Operating systems used in spacecraft?

#31 November 12th 03, 11:38 PM

In article , says...
Presumably, space Shuttle
and space probe code would be written by excellent programmers, and
then thoroughly tested.

The Shuttle's flight software development process is one of the standard
case studies in software engineering. They really care about the quality
of what they produce.
--
Kevin Willoughby lid

Imagine that, a FROG ON-OFF switch, hardly the work
for test pilots. -- Mike Collins

#32 November 13th 03, 03:07 AM

In article ,
Keith F. Lynch wrote:
Oh. I guess that redundancy is only meant to protect against hardware
failures. The right thing to do would be to have them run completely
different code. Of course the code would have to all be written from
the same specification, and that specification could contain bugs.

Plus, experienced programmers writing code to the same spec have a
tendency -- surprise, surprise -- to write similar code with similar bugs.
It is not enough to have several guys implement to the same spec; it has
to be part of the *specs* that they use different approaches, different
tools, etc.

Occasionally there are simple and elegant ways of doing this -- e.g., a
dual-redundant display system where computer #1 accepted air-data inputs
and generated a display, and computer #2 read the display, worked backward
to what the air-data inputs should have been, and cried "foul" if they
didn't match fairly closely. But usually it's fairly difficult to enforce
sufficient diversity.

The fifth Shuttle computer is running very different code written
from a very different specification, to minimize the chance of
common-mode bugs.

When does the Shuttle rely on that fifth computer, and ignore the
other four?

When one of the pilots pushes the "switch to backup software" button.
(It has never been used in flight.)
--
MOST launched 30 June; first light, 29 July; 5arcsec | Henry Spencer
pointing, 10 Sept; first science, early Oct; all well. |

#33 November 13th 03, 03:12 AM

In article ,
rk wrote:
On the other hand, we have a lesson from Skylab:
34. Lesson: Redundancy Design
When designing redundancies into systems, consider the use
of nonidentical approaches for backup, alternate, and
redundant items.

The guys at XCOR put it this way: "We don't consider it to be redundant
to put down two of the same thing."

(For example, the EZ-Rocket FMEA revealed that having the engines stop on
loss of electrical control power was a really bad idea: what happens if
they cut out just after takeoff? So they stay on if that happens, with a
redundant pure-mechanical cutoff system. And sure enough, on flight 11,
all electrical engine controls went out!)
--
MOST launched 30 June; first light, 29 July; 5arcsec | Henry Spencer
pointing, 10 Sept; first science, early Oct; all well. |

#34 November 13th 03, 04:19 AM

rk wrote:
For those in the DC area, there's a seminar coming up related to
this topic:

From Anonymity to Ubiquity:
A Study of Our Increasing Reliance on Fault Tolerant Computing

Elwin C. Ong
Massachusetts Institute of Technology
NASA Goddard, Office of Logic Design

December 9, 2003

. . .
http://66.113.195.245/richcontent/Tu...g_abstract.htm

I can't find the location or the time of this event.
--
Keith F. Lynch - - http://keithlynch.net/
I always welcome replies to my e-mail, postings, and web pages, but
unsolicited bulk e-mail (spam) is not acceptable. Please do not send me
HTML, "rich text," or attachments, as all such email is discarded unread.

#35 November 13th 03, 05:08 AM

rk writes:

dave schneider wrote:
[synchronization requirements] Pentium 1's may have started pushing on
picoseconds).

Pentium timing resolution is in picoseconds as controlling and being
tolerant of clock skew is critical to making the new processors work
(amazing engineering there). In fact, for the newer
microprocessors, over the last few years, the variation in process
across a single die (not even wafer or lot) is a parameter that can
not be ignored any longer. So, they have local circuits distributed
about the chip to perform clock edge placement to get the skews
under control, accounting for variation in transistor performance.

The design of modern microprocessor clocks is a big engineering project
all by itself. Here's a paper about the Intel clock design - note the 5
co-authors.

Itanium Processor Clock Design
Utpal Desai1, Simon Tam, Robert Kim, Ji Zhang, Stefan Rusu
ISPD 2000

Here's an excerpt that shows the flavor of what is required:

Figure 1 shows the high level Itanium processor clock
generation and distribution scheme. The core PLL receives
differential clock inputs running at the bus clock frequency and
generates a high frequency clock at twice the core clock
frequency [3]. A divide-by-two circuit generates the high
frequency core clock and the
reference clock using the 2X-frequency clock from the PLL. Both
these clocks, the core clock and the reference clock, are routed
from the PLL via a balanced H-tree to 8 deskew clusters, each of
which contains four distinct deskew buffers (DSK). The deskew
buffer is a digitally controlled analog delay line, whose function
is to detect and eliminate the skew between any two clocks and
will be described in details later. The reference clock route stops
at the deskew clusters. The core clock, on the other hand is
routed as an input to the DSKs within a deskew cluster and
generates output clocks shown as gclk in Figure 1. The total die
area is partitioned into 30 regions, and therefore only 30 out of
the 32 gclk signals are used to generate the clocks required by a
clock region. The gclk output from a DSK is buffered by bank of
buffers called the regional clock drivers (RCD) located at the top
and bottom of a clock region and distributed over the clock
region via a clock grid. The circuits within a clock region tap
directly into the overlaying grid to generate the local clocks
required within the region.

[Pages of technical details omitted...]

In summary, the Itanium processor utilizes an active deskewing
scheme to achieve a low clock skew. This scheme compensates
for process variations which is not possible using a passive
scheme. Total measured skew between the 30 clock regions in the
Itanium processor is 28 ps. In contrast, the measured skew with
the deskew buffer architecture turned off, would have been 110
ps.

And this is just for normal operation - there are lots of other
features for debugging....

Lou Scheffer

#36 November 13th 03, 07:12 PM

Henry Spencer wrote:
In article ,
Keith F. Lynch wrote:
Oh. I guess that redundancy is only meant to protect against hardware
failures. The right thing to do would be to have them run completely
different code. Of course the code would have to all be written from
the same specification, and that specification could contain bugs.

Plus, experienced programmers writing code to the same spec have a
tendency -- surprise, surprise -- to write similar code with similar bugs.
It is not enough to have several guys implement to the same spec; it has
to be part of the *specs* that they use different approaches, different
tools, etc.

There are ways around this though - have one set of them use fortran 77 and
teh other use say haskell. Sure, if everybody writes in Ada, chances are
very good they will use a similar design - OTOH you would have to go way
out of your way to end up with a fortran like design in haskell.

--
Sander

+++ Out of cheese error +++

#37 November 14th 03, 12:39 PM

Would it be possible to get the presentation from that talk, when it
has been held?

Jan

#38 November 14th 03, 06:14 PM

"rk" wrote in
...
Keith F. Lynch wrote:

Keith F. Lynch wrote:
Doesn't the Shuttle have three computers all running in
parallel, with majority vote ruling?

Henry Spencer wrote:
It's actually four, with elaborate arrangements for
cross-connecting things as desired. (There are only three of
some of the more important subsystems, so if one computer is
acting up, you can cross-connect to put the other three in
charge of those.) The majority-rules voting is done in
hardware.

This works only because all four are running the *same*
software, bit for bit identical, in lockstep. You couldn't
get the necessary low-level timing synchronization on machines
running different code.

Why would you need low-level timing synchronization? So what
if one computer wants to take an action a tenth of a second
before the others? Just how critical is timing? Will disaster
happen if anything happens even a whole second too early or too
late?

Each action will take place as soon as two of the three
computers decide it should happen.

Then two of the computers will decide that the computer that is 100
ms late (which is a looooooooong time) died and vote it off the
island. Then the next two will quickly decide that each other has
failed.

This seems to be a bad strategy if you have only two computers remaining. In
manned (or "piloted" if you want to be sex neutral) missions typically you start
with four computers that can vote one computer out of the pool, even in case of
Byzantine errors (one computer becoming maliciously faulty and passing different
faulty data to the others). Three remaining computers can still vote one out of
the pool but the result after this should be that the two remaining decides that
only one of them continues, a "last survivor" strategy. Having one computer
remaining instead of two halves the probability of a fault occurring. The only
reason to have two remaining computers is if the application tolerates that a
comparison error results in both computers being restarted.

--
th

#39 November 15th 03, 06:27 PM

"rk" wrote in
...
I think that whole strategy is bad; which is why it wasn't done.
That algorithm falls apart early and continues to fall apart. Best
to just not do that rather than patch it up.

If I was down to two computers running together in general I would
run them both. That gives you some level of error detection. But
it does of course double the probability of a failure. What is
interesting is that for long mission (long compared to MTBF) TMR has
lower reliability than simplex.

It's true that majority voting computers are really messy, both from a hardware
and software point of view, but it is one of the few ways too guarantee a very
high degree of error detection (something you need to ensure the very low
unreliability figures you need in missions involving humans). As these missions
are quite short a TMR concept doesn't degrade the reliability too much. For
longer missions, like the TMR computers in the Zvezda Service Module, you need
to able to repair the computer pool when they are in operation. Non-manned
spacecraft in general very seldom uses TMR configurations although the latest
COTS technology trends seems to promote these ideas, for instance have a look at
the new PowerPC board from Maxwell.

An interesting application here is Ariane5 that uses a hot redundant pair of
computers in a master/slave configuration. In the early years it was planned to
launch the Hermes spaceplane by Ariane5 and a pool of four computers in Hermes
would then replace the standard dual redundant computers. This trick improved
the overall system reliability to values acceptable for manned missions

--
th

th wrote:

"rk" wrote in
...
Keith F. Lynch wrote:

Keith F. Lynch wrote:
Doesn't the Shuttle have three computers all running in
parallel, with majority vote ruling?

Henry Spencer wrote:
It's actually four, with elaborate arrangements for
cross-connecting things as desired. (There are only three
of some of the more important subsystems, so if one
computer is acting up, you can cross-connect to put the
other three in charge of those.) The majority-rules voting
is done in hardware.

This works only because all four are running the *same*
software, bit for bit identical, in lockstep. You couldn't
get the necessary low-level timing synchronization on
machines running different code.

Why would you need low-level timing synchronization? So
what if one computer wants to take an action a tenth of a
second before the others? Just how critical is timing?
Will disaster happen if anything happens even a whole second
too early or too late?

Each action will take place as soon as two of the three
computers decide it should happen.

Then two of the computers will decide that the computer that
is 100 ms late (which is a looooooooong time) died and vote it
off the island. Then the next two will quickly decide that
each other has failed.

This seems to be a bad strategy if you have only two computers
remaining. In manned (or "piloted" if you want to be sex
neutral) missions typically you start with four computers that
can vote one computer out of the pool, even in case of
Byzantine errors (one computer becoming maliciously faulty and
passing different faulty data to the others). Three remaining
computers can still vote one out of the pool but the result
after this should be that the two remaining decides that only
one of them continues, a "last survivor" strategy. Having one
computer remaining instead of two halves the probability of a
fault occurring. The only reason to have two remaining
computers is if the application tolerates that a comparison
error results in both computers being restarted.

--
rk, Just an OldEngineer
"In God we trust, all others bring data."
-- Framed plaque from the '60s, hanging in the Mission Evaluation
Room at Johnson Space Center, downstairs from Mission Control.

#40 November 15th 03, 10:49 PM

"Keith F. Lynch" wrote:
Each action will take place as soon as two of the three computers
decide it should happen.

"rk" wrote:
Then two of the computers will decide that the computer that is 100
ms late (which is a looooooooong time) died and vote it off the island.
Then the next two will quickly decide that each other has failed.

My idea was that no computer would ever be "voted off the island".
All three would continue to vote on each action. If they're slightly
out of sync, no problem. Each action takes place when two vote to
do it. If one fails completely, and consistently votes for bogus
actions, or fails to vote at all, no problem, so long as the other
two keep working.

The three computers would have different hardware, designed on
different principles by different people. And they'd be running
different software, designed on different principles by different
people.

Of course they have to all follow the same specification, and hence
are subject to any bugs in that one specification. I can't see a
way around that.
--
Keith F. Lynch - - http://keithlynch.net/
I always welcome replies to my e-mail, postings, and web pages, but
unsolicited bulk e-mail (spam) is not acceptable. Please do not send me
HTML, "rich text," or attachments, as all such email is discarded unread.

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Decision on the Soyuz TMA-4 spacecraft prelaunch processing	Jacques van Oene	Space Station	0	April 1st 04 01:12 PM
Voyager Spacecraft Approaching Solar System's Final Frontier	Ron Baalke	Science	0	November 5th 03 06:56 PM
Soyuz TMA-3 manned spacecraft launch to the ISS	Jacques van Oene	Space Station	0	October 21st 03 09:39 AM
The Final Day on Galileo	Ron Baalke	Science	0	September 19th 03 07:32 PM
BAE Systems Microprocessors Enroute To Mars	Ron Baalke	Technology	0	July 29th 03 10:40 PM