Operating systems used in spacecraft?

#41 November 18th 03, 08:26 PM

"Keith F. Lynch" wrote in message
The three computers would have different hardware, designed on
different principles by different people. And they'd be running
different software, designed on different principles by different
people.

I have worked in software development for 10 years, and the *only* to
keep bugs manageable is reduce the complexity of the code. Lots of
versions of the same thing is only going to make matters worse, not
better. A literature review on fault tolerant software shows that
n-version software is generally not very usefully or reliable. The key
to good software is complete testing of the software, not lots of
versions. More unit tests than you can shake a stick at, and white box
testing as well (check for coverage etc.).

Software is deterministic, if your test are complete then you doing
need to wonder about 'what if' situations, you test them first.

Hardware is a different story. But then you can redundant hardware
without the need for n-versions of software.

Greg

#42 November 18th 03, 10:59 PM

"Keith F. Lynch" writes:

"Keith F. Lynch" wrote:
Each action will take place as soon as two of the three computers
decide it should happen.

"rk" wrote:
Then two of the computers will decide that the computer that is 100
ms late (which is a looooooooong time) died and vote it off the island.
Then the next two will quickly decide that each other has failed.

My idea was that no computer would ever be "voted off the island".
All three would continue to vote on each action. If they're slightly
out of sync, no problem. Each action takes place when two vote to
do it. If one fails completely, and consistently votes for bogus
actions, or fails to vote at all, no problem, so long as the other
two keep working.

The three computers would have different hardware, designed on
different principles by different people. And they'd be running
different software, designed on different principles by different
people.

And possibly coming up with three completely different solutions
to the same problem.

E.g, roll control on an Ariane 4 is accomplished by thrust vectoring
the main engines, of which there are four and of which only two need
be canted to null out any particular roll torque.

If computer X decides, on the basis of some second-order effect, to
cant engines 1 & 3 one degree each, computer Y decides to cant engines
2 & 4 one degree each, and computer Z decides to cant all four engines
half a degree each, which engines move and how much?

--
*John Schilling * "Anything worth doing, *
*Member:AIAA,NRA,ACLU,SAS,LP * is worth doing for money" *
*Chief Scientist & General Partner * -13th Rule of Acquisition *
*White Elephant Research, LLC * "There is no substitute *
* for success" *
*661-951-9107 or 661-275-6795 * -58th Rule of Acquisition *

#43 November 19th 03, 08:32 AM

http://66.113.195.245/richcontent/Re...cy/tmr_simplex
jpg

I take it that the MTBF used to normalize time in that diagram is the MTBF
of the simplex, right?

I can understand TMR with spare - what is TMR/simplex?

Jan

#44 November 19th 03, 08:36 AM

A literature review on fault tolerant software shows that
n-version software is generally not very usefully or reliable.

A literature review would show that there some people that strongly
disagree with that position:

http://www.leshatton.org/Documents/Nver_1297.pdf

Indeed, Airbus does, and even the FAA has certified its products for use
in the USA...so the characterization as "not very useful or reliable" is
likely incorrect.

Jan

#45 November 19th 03, 09:25 PM

"rk" wrote in
...
th wrote:

"rk" wrote in
...
I think that whole strategy is bad; which is why it wasn't
done. That algorithm falls apart early and continues to fall
apart. Best to just not do that rather than patch it up.

If I was down to two computers running together in general I
would run them both. That gives you some level of error
detection. But it does of course double the probability of a
failure. What is interesting is that for long mission (long
compared to MTBF) TMR has lower reliability than simplex.

It's true that majority voting computers are really messy, both
from a hardware and software point of view, but it is one of
the few ways too guarantee a very high degree of error
detection (something you need to ensure the very low
unreliability figures you need in missions involving humans).
As these missions are quite short a TMR concept doesn't degrade
the reliability too much. For longer missions, like the TMR
computers in the Zvezda Service Module, you need to able to
repair the computer pool when they are in operation. Non-manned
spacecraft in general very seldom uses TMR configurations
although the latest COTS technology trends seems to promote
these ideas, for instance have a look at the new PowerPC board
from Maxwell.

Here's an old chart, I have older, but this one is public domain
from a NASA report that shows the relationship between simplex, TMR,
and variations on a theme such as TMR/Simplex, TMR with switchable
spare. There are a quite a few papers on this, some on-line in the
IBM Journal of Research and Development from either the late '50s or
early '60s, I'd have to check my notes. Careful line wrap.

http://66.113.195.245/richcontent/Re...cy/tmr_simplex
jpg

I wonder how much programming and computing work it took to calculate those
curves in the old days? Now it's a couple of minutes work with Excel to get the
same chart nicely printed with different colours!

For COTS in space, an example of TMR is also present, at a lower
level, in the SX-S and AX-S series of FPGAs, used as method of
hardening the flip-flops to single event upsets.

You can always argue if the -S series are COTS?!
When using the A14100 FPGA (definitely a COTS device) you were also forced to
use TMR for some applications although the TMR had do be done by the designer

The logic of the Saturn V launch vehicle computer was TMR with
built-in disagreement detectors. The memory was dual-redundant,
however, with both memories hot and operated in parallel. Parity on
each memory was used for error detection; the alternate memory was
used for error correction. It automatically scrubbed the errors in
hardware when accessed.

An interesting application here is Ariane5 that uses a hot
redundant pair of computers in a master/slave configuration. In
the early years it was planned to launch the Hermes spaceplane
by Ariane5 and a pool of four computers in Hermes would then
replace the standard dual redundant computers. This trick
improved the overall system reliability to values acceptable
for manned missions

Do you have a reference for the Hermes? It would be interesting to
look at that system and compare it to the Shuttle and to the X-38
work (I think I can dig up a good reference for that if you are
interested).

I have not found anything published on the web (remember the work was done more
than 10 years ago and the computer pool only reached the breadboard stage).

The best officially published summary is probably:
Philippe David and Claude Guidal, Development of a Fault Tolerant Computer
System for the HERMES Space Shuttle, Digest of Papers: The Twenty-Third
International Symposium on Fault-Tolerant Computing (FTCS 23), Toulouse, France,
June 22 -- 24, 1993, pp. 641 -- 646

Is there anything similar for the Shuttle and X-38?
--
th

#46 November 20th 03, 08:26 PM

"Jan C. Vorbrüggen" wrote in message ...
A literature review on fault tolerant software shows that
n-version software is generally not very usefully or reliable.

A literature review would show that there some people that strongly
disagree with that position:

http://www.leshatton.org/Documents/Nver_1297.pdf

Indeed, Airbus does, and even the FAA has certified its products for use
in the USA...so the characterization as "not very useful or reliable" is
likely incorrect.

Jan

Sorry it appears my draft went though. I meant to have a IMO in there.
There is a lot of papers both for and against. Everyone seems to agree
that n-versions (software) ain't independent, also everyone agrees
that n-versions is more reliable than 1, its just how much better. I
have a hard time going for more code to maintained and developed as a
answer.. My experience is to *always* simplify as much as practical.

The problem is that in the real world rather than in some experiment,
n-versions means money is not spent elsewhere. ie more testing of the
1 version. Or a redesign so that there is simply less code to go
wrong. Also some suggest that n-version is cost effective because you
don't need to test them as much, which i think is bad.

Another point is the testing of the n-version themselves. If we don't
have n-version unit test (black and white box) are we introducing
dependence? If we develop n-version unit tests (say 3) and n-versions
of the software(say 5) and test all combinations.....

My point is that we should be able to test throughly enough to be very
confidant of its behaviour in any situation (finite state machine and
all that).

My motive is that I am currently starting construction of a large
amateur rocket (with guidance etc). I'm not up to software yet, but I
want to be sure of safety and reliability. I am going to need to prove
safety to the CAA (new Zealand's FAA) so i'm still considering
n-version software (All hardware will have redundancy). But any real
project *always* has resource constraints. So i want the time to be
spent in the most effective way.

As for the FAA 's approval. I just hope that each version still has to
be up to the usual standard.

And since this was originally about OS.. a quote from the above paper:
For example, this paper is being written on a Pentium PC using Word 7
under Window s 95. We have found that a typical spread of software
use on this system exhibits a defect every 42 minutes on average,
with a serious failure around every 5 hours.

My Linux needs rebooting every power failure--about once every
3+months. XFree needs a restart about once a week...I use it 10+ hours
a day.

Greg

#47 November 21st 03, 05:33 AM

(Greg) wrote:

And since this was originally about OS.. a quote from the above paper:
For example, this paper is being written on a Pentium PC using Word 7
under Window s 95. We have found that a typical spread of software
use on this system exhibits a defect every 42 minutes on average,
with a serious failure around every 5 hours.

Which suggest more than anything, they have flakey hardware or serious
(non-MS) software issues, and/or an extremely incompent system
administrator. That's not to say Win95 is completely innocent, but a
failure rate that high is well out of family. All too often Windows
takes the blame for things not entirely it's fault.

D.
---
The STS-107 Columbia Loss FAQ can be found
at the following URLs:

Text-Only Version:
http://www.io.com/~o_m/columbia_loss_faq.html

Enhanced HTML Version:
http://www.io.com/~o_m/columbia_loss_faq_x.html

Corrections, comments, and additions should be
e-mailed to , as well as posted to
sci.space.history and sci.space.shuttle for
discussion.

#48 November 21st 03, 10:37 AM

I would say it's clearly reliable and increases reliability - even if
the different versions aren't completely independent, and the more the
more reliable the individual versions are (see Hatton's article). In
addition, you do see that there are a substantial number of common-mode
failures - both hardware and software - in the literature (Ariane 501 is,
at least in part, one of them), so trying to reduce them should be worth
it in certain situations.

I would say skimping on testing because relying on mutiple versions is
not the way to go, and limited resources are always a consideration. But
these arguments are outside the determination of the merits of N-version
programming as such.

Jan

#49 November 22nd 03, 03:21 AM

In article , gewi001
@phy.auckland.ac.nz says...
"Jan C. Vorbrüggen" wrote in message ...
Indeed, Airbus does, and even the FAA has certified its products for use
in the USA...so the characterization as "not very useful or reliable" is
likely incorrect.

There are at least two explanations that lead to a different conclusion:

(1) People don't understand the difference between hardware and
software. Well-designed hardware redundancy does help deal with
failures -- lots of historical examples demonstrate this. Hardware is
designed / tested / built well-enough that common-mode problems happen,
but only rarely. At first glance, the same might seem to be true of
software, but history says this just isn't so.

(b) Just because an N-version system was FAA certified does not mean
that an (N-1)-version system wouldn't also have been certified.

everyone agrees
that n-versions is more reliable than 1, its just how much better. I
have a hard time going for more code to maintained and developed as a
answer.. My experience is to *always* simplify as much as practical.

Yet, an N-version system is more complex than a 1-version system. A 3-
version system is *much* more than 3 times as complex as a 1-version
system. In addition to three separate systems, there is the system that
takes three outputs and decides what the "real" output should really be.
This "decide the real output" logic is simple at first glance, but
complex when you start looking at all the nasty details. This kind of
complexity is a classic example of things that are very difficult to
prove by testing, since there are a huge number of the interesting test
cases.

You also have to watch closely for "feature creep". If you build, say, a
5-version system, someone is going to want to want the system survive a
two-point failure. Then someone else will want the system to survive a
four-point failure. Quickly you have a much much more complex system.

Also some suggest that n-version is cost effective because you
don't need to test them as much, which i think is bad.

I can't belive any serious researcher believes that N-version systems
need less testing.

My point is that we should be able to test throughly enough to be very
confidant of its behaviour in any situation (finite state machine and
all that).

This is simplistic. The state of a non-trivial system requires many,
many bits. The permutations of those states is an astronomically large
number. The Apollo computer had 2048 words of state (modifiable) memory,
15 bits/word. That's 28672 bits. No one thinks this was excessive. The
number of states is 2^28672. If you can test one state per microsecond,
the Universe will die of heat-death before you are done.

And since this was originally about OS.. a quote from the above paper:
For example, this paper is being written on a Pentium PC using Word 7
under Window s 95. We have found that a typical spread of software
use on this system exhibits a defect every 42 minutes on average,
with a serious failure around every 5 hours.

My Linux needs rebooting every power failure--about once every
3+months. XFree needs a restart about once a week...I use it 10+ hours
a day.

"The singular of 'data' isn't 'anecdote'".
--
Kevin Willoughby lid

Imagine that, a FROG ON-OFF switch, hardly the work
for test pilots. -- Mike Collins

#50 November 22nd 03, 01:18 PM

"Jan C. Vorbrüggen" skrev i meddelandet
...
http://66.113.195.245/richcontent/Re...cy/tmr_simplex
jpg

I take it that the MTBF used to normalize time in that diagram is the MTBF
of the simplex, right?

Yes, for a simplex system MTBF=1/lambda and since R=exp(-lambda*time) the
reliability becomes exp(-1) which is approximately 0,37

--
th

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Decision on the Soyuz TMA-4 spacecraft prelaunch processing	Jacques van Oene	Space Station	0	April 1st 04 01:12 PM
Voyager Spacecraft Approaching Solar System's Final Frontier	Ron Baalke	Science	0	November 5th 03 06:56 PM
Soyuz TMA-3 manned spacecraft launch to the ISS	Jacques van Oene	Space Station	0	October 21st 03 09:39 AM
The Final Day on Galileo	Ron Baalke	Science	0	September 19th 03 07:32 PM
BAE Systems Microprocessors Enroute To Mars	Ron Baalke	Technology	0	July 29th 03 10:40 PM