'Spirit' Communications Emergency

#1 January 22nd 04, 07:00 PM

Ian Stirling wrote in
:

JimO wrote:
Problems -- give 'em a day or two to work them out...

The most important thing is that they don't become disspirited.

Assuming Spirit is dead: (I dont think so, but for speculation sake)

What are the implications for Opportunity? Im thinking specifically of team
resources available during the approach/landing/deploy phase of Opportunity
next week.

Roughly what percentage of the science mission has been achieved sofar? The
rover has been on Mars for 1/5 of its planned lifecycle, but a lot of that
was systems checkouts.

Do we have any ideas floating around as to what could have caused the
problem? I know there was interference problems during command upload, but
the rover should reject any garbage commands. Besides a purely software
jumble should have reset by now, right? What hardware issues could give the
symptoms we are experiencing now?

#2 January 22nd 04, 09:15 PM

Marvin writes:

Roughly what percentage of the science mission has been achieved sofar? The
rover has been on Mars for 1/5 of its planned lifecycle, but a lot of that
was systems checkouts.

Do we have any ideas floating around as to what could have caused the
problem? I know there was interference problems during command upload, but
the rover should reject any garbage commands.

It has been said that the software waits for all commands uploaded and
checked before executing them. Right now Spirit is not reacting to
anything except with status signals (just a carrier without data).

Besides a purely software jumble should have reset by now, right? What
hardware issues could give the symptoms we are experiencing now?

There have been speculation about power flaws and radiation trouble. The
fault tree seems to be not complete yet, though.

Jochem

--
"A designer knows he has arrived at perfection not when there is no
longer anything to add, but when there is no longer anything to take
away." - Antoine de Saint-Exupery

#3 January 22nd 04, 11:05 PM

On 22 Jan 2004 21:00:51 +0200, Marvin
wrote:

Assuming Spirit is dead: (I dont think so, but for speculation sake)

Well things are not looking good.

What are the implications for Opportunity? Im thinking specifically of team
resources available during the approach/landing/deploy phase of Opportunity
next week.

These two rovers were built the same, which means that exactly the
same thing could happen again.

Roughly what percentage of the science mission has been achieved sofar? The
rover has been on Mars for 1/5 of its planned lifecycle, but a lot of that
was systems checkouts.

Compared with what was planned not a lot. However, they have got a
large amount of data, which is sure useful, but I suspect fails to
answer the question of why they came here.

Had they obtained a few more days, then their useful data would have
like doubled as they started using all the science instruments.

Do we have any ideas floating around as to what could have caused the
problem?

Computer crash? Or just plain frontal lobe death? As I have also been
wondering about the state of the internal temperature and other
factors.

I also considered local weather issues, but that seems to have been
discounted. Still, maybe Spirit with it's pointy metal eyes, metal
body and metal wheels makes for a good lightning target?

They have also been wondering about the strange surface for a long
time, where I guess falling through into a chasm is not an option.

Still, Mars is assumed to have lots of ice content in the ground for
places like this, where you have to wonder what this can do. As
imagine if all that ice melted for a short time, when things could
tend to sink.

Maybe not an option, but they don't understand how Mars works yet.

I know there was interference problems during command upload, but
the rover should reject any garbage commands. Besides a purely software
jumble should have reset by now, right? What hardware issues could give the
symptoms we are experiencing now?

They need more information in order to narrow down the possible cause.

Cardman
http://www.cardman.com
http://www.cardman.co.uk

#4 January 22nd 04, 11:32 PM

Marvin writes:

Do we have any ideas floating around as to what could have caused the
problem? I know there was interference problems during command upload, but
the rover should reject any garbage commands. Besides a purely software
jumble should have reset by now, right? What hardware issues could give the
symptoms we are experiencing now?

The dude at this morning's briefing mentioned "SAU" or "SUA" (?) and
something about cosmic rays, in the same breath.

In normal ECC DRAM PC memory, one NASA guy reported seeing errors at a
rate of 1.4 per year per GB. Another guy at 1 km altitude reported
83. These errors are caused almost entirely by cosmic rays which are
mostly blocked by Earth's atmosphere, but rovers probably carry
shielding to lower the risk to some low level. Maybe they just got
real unlucky.

#5 January 23rd 04, 01:08 AM

(Gary W. Swearingen) writes:

Marvin writes:

Do we have any ideas floating around as to what could have caused the
problem? I know there was interference problems during command upload, but
the rover should reject any garbage commands. Besides a purely software
jumble should have reset by now, right? What hardware issues could give the
symptoms we are experiencing now?

The dude at this morning's briefing mentioned "SAU" or "SUA" (?) and
something about cosmic rays, in the same breath.

"SEU", probably, for "Single Event Upset". A single cosmic ray passing
through a chip can, sometimes, cause a bit flip or spuriously trip a
switch. This is no big deal if it's a low-order bit in a bitmap image
file, a very big deal if it's part of the address following a "GOTO"
in the executable code, etc.

Spacecraft designers know that high-density electronics are susceptible
to the odd SEU, and tend to incorporate robust fault recovery systems
to, e.g., reload everything from a backup ROM and start over. So if
Spirit suffered an SEU, there's a good chance to save it.

Somewhat more serious is an "SEL", "Single Event Latchup", where the
bit flip closes a switch that shorts out some key bit of electronics,
possibly in a physically destructive manner, possibly in a logically
irreversable manner. Spacecraft designers work real hard to make
opportunities for this sort of thing very rare, but they do happen
from time to time.

--
*John Schilling * "Anything worth doing, *
*Member:AIAA,NRA,ACLU,SAS,LP * is worth doing for money" *
*Chief Scientist & General Partner * -13th Rule of Acquisition *
*White Elephant Research, LLC * "There is no substitute *
* for success" *
*661-951-9107 or 661-275-6795 * -58th Rule of Acquisition *

#6 January 23rd 04, 12:43 AM

In article ,
Gary W. Swearingen wrote:
jumble should have reset by now, right? What hardware issues could give the
symptoms we are experiencing now?

The dude at this morning's briefing mentioned "SAU" or "SUA" (?) and
something about cosmic rays, in the same breath.

Probably SEU, Single Event Upset, where a bit gets flipped by a particle
hit on a chip. If it's an important bit, a mess can result. :-)

The good news is that an SEU is a transient error, not a permanent failure,
assuming you have some way of resetting and restarting.
--
MOST launched 30 June; science observations running | Henry Spencer
since Oct; first surprises seen; papers pending. |

#7 January 23rd 04, 03:35 AM

On Fri, 23 Jan 2004 00:43:24 GMT, (Henry Spencer)
wrote:

Probably SEU, Single Event Upset, where a bit gets flipped by a particle
hit on a chip. If it's an important bit, a mess can result. :-)

The good news is that an SEU is a transient error, not a permanent failure,
assuming you have some way of resetting and restarting.

Well there appears to be some good news at least concerning this
matter. As they sent a command to Spirit assuming that it was in safe
mode following a failure, simply saying if you receive this, then
please respond.

And sure enough Spirit went "beep" indicating that it did.

This means that the X-band system is working, the SSPA, the
multi-space transponder and all the rest. However, Spirit has detected
a serious fault and so has gone into safe mode.

That means that it could still be screwed, where they have to find
this out for themselves, but at least this is good news in that it
lives on and is at minimum somewhat alive.

I have been thinking a bit about this, where I expect that this system
is divided up into two layers (at least). One is the control software
that could well have malfunctioned somehow, where on a second level,
where this beep comes from, is a more hard coded program.

And under my knowledge on how such a system should be designed, then
at this low level they should be able to do a lot of things, like
restarting the system or even loading in a fault detection program to
check it all out.

We will have to see what has gone on, but there is a fair chance that
it could make a good recovery.

Fingers crossed.

Just two days left now before Opportunity touches down.

Cardman
http://www.cardman.com
http://www.cardman.co.uk

#8 January 23rd 04, 04:44 PM

Henry Spencer wrote:
In article ,
Gary W. Swearingen wrote:
jumble should have reset by now, right? What hardware issues could give the
symptoms we are experiencing now?

The dude at this morning's briefing mentioned "SAU" or "SUA" (?) and
something about cosmic rays, in the same breath.

Probably SEU, Single Event Upset, where a bit gets flipped by a particle
hit on a chip. If it's an important bit, a mess can result. :-)

The good news is that an SEU is a transient error, not a permanent failure,
assuming you have some way of resetting and restarting.

Umm... you mean somebody would seriously consider having a project measured
in millions of dollars and not include trivial small things like SECDED,
memory scrubbing and restarts? You know stuff that is slowly coming even
to low end servers? I would be really shocked...

--
Sander

+++ Out of cheese error +++

#9 January 23rd 04, 05:00 PM

Sander Vesik wrote:
Henry Spencer wrote:

In article ,
Gary W. Swearingen wrote:

jumble should have reset by now, right? What hardware issues could give the
symptoms we are experiencing now?

The dude at this morning's briefing mentioned "SAU" or "SUA" (?) and
something about cosmic rays, in the same breath.

Probably SEU, Single Event Upset, where a bit gets flipped by a particle
hit on a chip. If it's an important bit, a mess can result. :-)

The good news is that an SEU is a transient error, not a permanent failure,
assuming you have some way of resetting and restarting.

Umm... you mean somebody would seriously consider having a project measured
in millions of dollars and not include trivial small things like SECDED,
memory scrubbing and restarts? You know stuff that is slowly coming even
to low end servers? I would be really shocked...

The ability of that stuff to work is usually overstated. Work
somewhere where they have thousands of Sun's and you will learn
that "ecache" is an evil word. There will be sections of the
cpu and ram that can't be scrubbed and then there is the question
of how effective a restart will be. Sun's, for instance, quite
often slag the filesystems during this type of shutdown and require
manual intervention to recover.

It is usually much more cost effective to just shutdown and
restart as Henry mentioned rather than exponentially increase the
software requirements by adding a lot of overhead. This hit a iwerd
boundry condition where hypothetically the bit-flip managed to
toggle something it should not have.

#10 January 23rd 04, 08:00 PM

Sander Vesik writes:

Umm... you mean somebody would seriously consider having a project measured
in millions of dollars and not include trivial small things like SECDED,
memory scrubbing and restarts? You know stuff that is slowly coming even
to low end servers? I would be really shocked...

You shouldn't be; millions of dollars doesn't buy much custom
electronics, let alone all the big hardware, software, and people to
run the program. BTW, we don't know they don't they don't have the
features you mention (except we know they have restarts -- probably
more than 60 of them so far). Plus, they might have other ways of
keeping memory corruption risk low, like very good radiation hardening
or frequent checksumming of memory or something.

I did hear today that they have another copy of the main software
onboard which they can load up if they want to. But I suspect they
greatly suspect some real hardware failure and need to figure out what
that is before trying to work around it. They're even willing to let
the batteries go dead at night while waiting to gather more diagnostic
info, rather than just taking the Microsoft approach and reloading the
software.

Thread Tools
Show Printable Version Email this Page
Display Modes
Switch to Linear Mode Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Mode VII orbiter emergency egress landing exercise Feb. 18	Jacques van Oene	Space Station	1	February 14th 04 05:02 AM
Mode VII orbiter emergency egress landing exercise Feb. 18	Jacques van Oene	Space Shuttle	0	February 13th 04 02:58 PM
Media invited to Shuttle emergency landing exercise	Jacques van Oene	Space Shuttle	0	February 13th 04 02:55 PM
Media invited to Shuttle emergency landing exercise	Jacques van Oene	Space Station	0	February 13th 04 02:55 PM
Lowell Observatory and Discovery Communications Announce Partnership To Build Innovative Telescope Technology	Ron Baalke	Technology	0	October 16th 03 06:17 PM