View Single Post
  #29  
Old January 23rd 04, 05:00 PM
Charles Buckley
external usenet poster
 
Posts: n/a
Default 'Spirit' Communications Emergency

Sander Vesik wrote:
Henry Spencer wrote:

In article ,
Gary W. Swearingen wrote:

jumble should have reset by now, right? What hardware issues could give the
symptoms we are experiencing now?

The dude at this morning's briefing mentioned "SAU" or "SUA" (?) and
something about cosmic rays, in the same breath.


Probably SEU, Single Event Upset, where a bit gets flipped by a particle
hit on a chip. If it's an important bit, a mess can result. :-)

The good news is that an SEU is a transient error, not a permanent failure,
assuming you have some way of resetting and restarting.



Umm... you mean somebody would seriously consider having a project measured
in millions of dollars and not include trivial small things like SECDED,
memory scrubbing and restarts? You know stuff that is slowly coming even
to low end servers? I would be really shocked...



The ability of that stuff to work is usually overstated. Work
somewhere where they have thousands of Sun's and you will learn
that "ecache" is an evil word. There will be sections of the
cpu and ram that can't be scrubbed and then there is the question
of how effective a restart will be. Sun's, for instance, quite
often slag the filesystems during this type of shutdown and require
manual intervention to recover.

It is usually much more cost effective to just shutdown and
restart as Henry mentioned rather than exponentially increase the
software requirements by adding a lot of overhead. This hit a iwerd
boundry condition where hypothetically the bit-flip managed to
toggle something it should not have.