|
|
Thread Tools | Display Modes |
#41
|
|||
|
|||
Operating systems used in spacecraft?
"Keith F. Lynch" wrote in message
The three computers would have different hardware, designed on different principles by different people. And they'd be running different software, designed on different principles by different people. I have worked in software development for 10 years, and the *only* to keep bugs manageable is reduce the complexity of the code. Lots of versions of the same thing is only going to make matters worse, not better. A literature review on fault tolerant software shows that n-version software is generally not very usefully or reliable. The key to good software is complete testing of the software, not lots of versions. More unit tests than you can shake a stick at, and white box testing as well (check for coverage etc.). Software is deterministic, if your test are complete then you doing need to wonder about 'what if' situations, you test them first. Hardware is a different story. But then you can redundant hardware without the need for n-versions of software. Greg |
#42
|
|||
|
|||
Operating systems used in spacecraft?
"Keith F. Lynch" writes:
"Keith F. Lynch" wrote: Each action will take place as soon as two of the three computers decide it should happen. "rk" wrote: Then two of the computers will decide that the computer that is 100 ms late (which is a looooooooong time) died and vote it off the island. Then the next two will quickly decide that each other has failed. My idea was that no computer would ever be "voted off the island". All three would continue to vote on each action. If they're slightly out of sync, no problem. Each action takes place when two vote to do it. If one fails completely, and consistently votes for bogus actions, or fails to vote at all, no problem, so long as the other two keep working. The three computers would have different hardware, designed on different principles by different people. And they'd be running different software, designed on different principles by different people. And possibly coming up with three completely different solutions to the same problem. E.g, roll control on an Ariane 4 is accomplished by thrust vectoring the main engines, of which there are four and of which only two need be canted to null out any particular roll torque. If computer X decides, on the basis of some second-order effect, to cant engines 1 & 3 one degree each, computer Y decides to cant engines 2 & 4 one degree each, and computer Z decides to cant all four engines half a degree each, which engines move and how much? -- *John Schilling * "Anything worth doing, * *Member:AIAA,NRA,ACLU,SAS,LP * is worth doing for money" * *Chief Scientist & General Partner * -13th Rule of Acquisition * *White Elephant Research, LLC * "There is no substitute * * for success" * *661-951-9107 or 661-275-6795 * -58th Rule of Acquisition * |
#43
|
|||
|
|||
Operating systems used in spacecraft?
http://66.113.195.245/richcontent/Re...cy/tmr_simplex
jpg I take it that the MTBF used to normalize time in that diagram is the MTBF of the simplex, right? I can understand TMR with spare - what is TMR/simplex? Jan |
#44
|
|||
|
|||
Operating systems used in spacecraft?
A literature review on fault tolerant software shows that
n-version software is generally not very usefully or reliable. A literature review would show that there some people that strongly disagree with that position: http://www.leshatton.org/Documents/Nver_1297.pdf Indeed, Airbus does, and even the FAA has certified its products for use in the USA...so the characterization as "not very useful or reliable" is likely incorrect. Jan |
#45
|
|||
|
|||
Operating systems used in spacecraft?
"rk" wrote in
... th wrote: "rk" wrote in ... I think that whole strategy is bad; which is why it wasn't done. That algorithm falls apart early and continues to fall apart. Best to just not do that rather than patch it up. If I was down to two computers running together in general I would run them both. That gives you some level of error detection. But it does of course double the probability of a failure. What is interesting is that for long mission (long compared to MTBF) TMR has lower reliability than simplex. It's true that majority voting computers are really messy, both from a hardware and software point of view, but it is one of the few ways too guarantee a very high degree of error detection (something you need to ensure the very low unreliability figures you need in missions involving humans). As these missions are quite short a TMR concept doesn't degrade the reliability too much. For longer missions, like the TMR computers in the Zvezda Service Module, you need to able to repair the computer pool when they are in operation. Non-manned spacecraft in general very seldom uses TMR configurations although the latest COTS technology trends seems to promote these ideas, for instance have a look at the new PowerPC board from Maxwell. Here's an old chart, I have older, but this one is public domain from a NASA report that shows the relationship between simplex, TMR, and variations on a theme such as TMR/Simplex, TMR with switchable spare. There are a quite a few papers on this, some on-line in the IBM Journal of Research and Development from either the late '50s or early '60s, I'd have to check my notes. Careful line wrap. http://66.113.195.245/richcontent/Re...cy/tmr_simplex jpg I wonder how much programming and computing work it took to calculate those curves in the old days? Now it's a couple of minutes work with Excel to get the same chart nicely printed with different colours! For COTS in space, an example of TMR is also present, at a lower level, in the SX-S and AX-S series of FPGAs, used as method of hardening the flip-flops to single event upsets. You can always argue if the -S series are COTS?! When using the A14100 FPGA (definitely a COTS device) you were also forced to use TMR for some applications although the TMR had do be done by the designer The logic of the Saturn V launch vehicle computer was TMR with built-in disagreement detectors. The memory was dual-redundant, however, with both memories hot and operated in parallel. Parity on each memory was used for error detection; the alternate memory was used for error correction. It automatically scrubbed the errors in hardware when accessed. An interesting application here is Ariane5 that uses a hot redundant pair of computers in a master/slave configuration. In the early years it was planned to launch the Hermes spaceplane by Ariane5 and a pool of four computers in Hermes would then replace the standard dual redundant computers. This trick improved the overall system reliability to values acceptable for manned missions Do you have a reference for the Hermes? It would be interesting to look at that system and compare it to the Shuttle and to the X-38 work (I think I can dig up a good reference for that if you are interested). I have not found anything published on the web (remember the work was done more than 10 years ago and the computer pool only reached the breadboard stage). The best officially published summary is probably: Philippe David and Claude Guidal, Development of a Fault Tolerant Computer System for the HERMES Space Shuttle, Digest of Papers: The Twenty-Third International Symposium on Fault-Tolerant Computing (FTCS 23), Toulouse, France, June 22 -- 24, 1993, pp. 641 -- 646 Is there anything similar for the Shuttle and X-38? -- th |
#46
|
|||
|
|||
Operating systems used in spacecraft?
"Jan C. Vorbrüggen" wrote in message ...
A literature review on fault tolerant software shows that n-version software is generally not very usefully or reliable. A literature review would show that there some people that strongly disagree with that position: http://www.leshatton.org/Documents/Nver_1297.pdf Indeed, Airbus does, and even the FAA has certified its products for use in the USA...so the characterization as "not very useful or reliable" is likely incorrect. Jan Sorry it appears my draft went though. I meant to have a IMO in there. There is a lot of papers both for and against. Everyone seems to agree that n-versions (software) ain't independent, also everyone agrees that n-versions is more reliable than 1, its just how much better. I have a hard time going for more code to maintained and developed as a answer.. My experience is to *always* simplify as much as practical. The problem is that in the real world rather than in some experiment, n-versions means money is not spent elsewhere. ie more testing of the 1 version. Or a redesign so that there is simply less code to go wrong. Also some suggest that n-version is cost effective because you don't need to test them as much, which i think is bad. Another point is the testing of the n-version themselves. If we don't have n-version unit test (black and white box) are we introducing dependence? If we develop n-version unit tests (say 3) and n-versions of the software(say 5) and test all combinations..... My point is that we should be able to test throughly enough to be very confidant of its behaviour in any situation (finite state machine and all that). My motive is that I am currently starting construction of a large amateur rocket (with guidance etc). I'm not up to software yet, but I want to be sure of safety and reliability. I am going to need to prove safety to the CAA (new Zealand's FAA) so i'm still considering n-version software (All hardware will have redundancy). But any real project *always* has resource constraints. So i want the time to be spent in the most effective way. As for the FAA 's approval. I just hope that each version still has to be up to the usual standard. And since this was originally about OS.. a quote from the above paper: For example, this paper is being written on a Pentium PC using Word 7 under Window s 95. We have found that a typical spread of software use on this system exhibits a defect every 42 minutes on average, with a serious failure around every 5 hours. My Linux needs rebooting every power failure--about once every 3+months. XFree needs a restart about once a week...I use it 10+ hours a day. Greg |
#48
|
|||
|
|||
Operating systems used in spacecraft?
I would say it's clearly reliable and increases reliability - even if
the different versions aren't completely independent, and the more the more reliable the individual versions are (see Hatton's article). In addition, you do see that there are a substantial number of common-mode failures - both hardware and software - in the literature (Ariane 501 is, at least in part, one of them), so trying to reduce them should be worth it in certain situations. I would say skimping on testing because relying on mutiple versions is not the way to go, and limited resources are always a consideration. But these arguments are outside the determination of the merits of N-version programming as such. Jan |
#49
|
|||
|
|||
Operating systems used in spacecraft?
In article , gewi001
@phy.auckland.ac.nz says... "Jan C. Vorbrüggen" wrote in message ... Indeed, Airbus does, and even the FAA has certified its products for use in the USA...so the characterization as "not very useful or reliable" is likely incorrect. There are at least two explanations that lead to a different conclusion: (1) People don't understand the difference between hardware and software. Well-designed hardware redundancy does help deal with failures -- lots of historical examples demonstrate this. Hardware is designed / tested / built well-enough that common-mode problems happen, but only rarely. At first glance, the same might seem to be true of software, but history says this just isn't so. (b) Just because an N-version system was FAA certified does not mean that an (N-1)-version system wouldn't also have been certified. everyone agrees that n-versions is more reliable than 1, its just how much better. I have a hard time going for more code to maintained and developed as a answer.. My experience is to *always* simplify as much as practical. Yet, an N-version system is more complex than a 1-version system. A 3- version system is *much* more than 3 times as complex as a 1-version system. In addition to three separate systems, there is the system that takes three outputs and decides what the "real" output should really be. This "decide the real output" logic is simple at first glance, but complex when you start looking at all the nasty details. This kind of complexity is a classic example of things that are very difficult to prove by testing, since there are a huge number of the interesting test cases. You also have to watch closely for "feature creep". If you build, say, a 5-version system, someone is going to want to want the system survive a two-point failure. Then someone else will want the system to survive a four-point failure. Quickly you have a much much more complex system. Also some suggest that n-version is cost effective because you don't need to test them as much, which i think is bad. I can't belive any serious researcher believes that N-version systems need less testing. My point is that we should be able to test throughly enough to be very confidant of its behaviour in any situation (finite state machine and all that). This is simplistic. The state of a non-trivial system requires many, many bits. The permutations of those states is an astronomically large number. The Apollo computer had 2048 words of state (modifiable) memory, 15 bits/word. That's 28672 bits. No one thinks this was excessive. The number of states is 2^28672. If you can test one state per microsecond, the Universe will die of heat-death before you are done. And since this was originally about OS.. a quote from the above paper: For example, this paper is being written on a Pentium PC using Word 7 under Window s 95. We have found that a typical spread of software use on this system exhibits a defect every 42 minutes on average, with a serious failure around every 5 hours. My Linux needs rebooting every power failure--about once every 3+months. XFree needs a restart about once a week...I use it 10+ hours a day. "The singular of 'data' isn't 'anecdote'". -- Kevin Willoughby lid Imagine that, a FROG ON-OFF switch, hardly the work for test pilots. -- Mike Collins |
#50
|
|||
|
|||
Operating systems used in spacecraft?
"Jan C. Vorbrüggen" skrev i meddelandet
... http://66.113.195.245/richcontent/Re...cy/tmr_simplex jpg I take it that the MTBF used to normalize time in that diagram is the MTBF of the simplex, right? Yes, for a simplex system MTBF=1/lambda and since R=exp(-lambda*time) the reliability becomes exp(-1) which is approximately 0,37 -- th |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Decision on the Soyuz TMA-4 spacecraft prelaunch processing | Jacques van Oene | Space Station | 0 | April 1st 04 01:12 PM |
Voyager Spacecraft Approaching Solar System's Final Frontier | Ron Baalke | Science | 0 | November 5th 03 06:56 PM |
Soyuz TMA-3 manned spacecraft launch to the ISS | Jacques van Oene | Space Station | 0 | October 21st 03 09:39 AM |
The Final Day on Galileo | Ron Baalke | Science | 0 | September 19th 03 07:32 PM |
BAE Systems Microprocessors Enroute To Mars | Ron Baalke | Technology | 0 | July 29th 03 10:40 PM |