|
|
Thread Tools | Display Modes |
#32
|
|||
|
|||
Operating systems used in spacecraft?
In article ,
Keith F. Lynch wrote: Oh. I guess that redundancy is only meant to protect against hardware failures. The right thing to do would be to have them run completely different code. Of course the code would have to all be written from the same specification, and that specification could contain bugs. Plus, experienced programmers writing code to the same spec have a tendency -- surprise, surprise -- to write similar code with similar bugs. It is not enough to have several guys implement to the same spec; it has to be part of the *specs* that they use different approaches, different tools, etc. Occasionally there are simple and elegant ways of doing this -- e.g., a dual-redundant display system where computer #1 accepted air-data inputs and generated a display, and computer #2 read the display, worked backward to what the air-data inputs should have been, and cried "foul" if they didn't match fairly closely. But usually it's fairly difficult to enforce sufficient diversity. The fifth Shuttle computer is running very different code written from a very different specification, to minimize the chance of common-mode bugs. When does the Shuttle rely on that fifth computer, and ignore the other four? When one of the pilots pushes the "switch to backup software" button. (It has never been used in flight.) -- MOST launched 30 June; first light, 29 July; 5arcsec | Henry Spencer pointing, 10 Sept; first science, early Oct; all well. | |
#33
|
|||
|
|||
Operating systems used in spacecraft?
In article ,
rk wrote: On the other hand, we have a lesson from Skylab: 34. Lesson: Redundancy Design When designing redundancies into systems, consider the use of nonidentical approaches for backup, alternate, and redundant items. The guys at XCOR put it this way: "We don't consider it to be redundant to put down two of the same thing." (For example, the EZ-Rocket FMEA revealed that having the engines stop on loss of electrical control power was a really bad idea: what happens if they cut out just after takeoff? So they stay on if that happens, with a redundant pure-mechanical cutoff system. And sure enough, on flight 11, all electrical engine controls went out!) -- MOST launched 30 June; first light, 29 July; 5arcsec | Henry Spencer pointing, 10 Sept; first science, early Oct; all well. | |
#34
|
|||
|
|||
Operating systems used in spacecraft?
rk wrote:
For those in the DC area, there's a seminar coming up related to this topic: From Anonymity to Ubiquity: A Study of Our Increasing Reliance on Fault Tolerant Computing Elwin C. Ong Massachusetts Institute of Technology NASA Goddard, Office of Logic Design December 9, 2003 . . . http://66.113.195.245/richcontent/Tu...g_abstract.htm I can't find the location or the time of this event. -- Keith F. Lynch - - http://keithlynch.net/ I always welcome replies to my e-mail, postings, and web pages, but unsolicited bulk e-mail (spam) is not acceptable. Please do not send me HTML, "rich text," or attachments, as all such email is discarded unread. |
#35
|
|||
|
|||
Operating systems used in spacecraft?
rk writes:
dave schneider wrote: [synchronization requirements] Pentium 1's may have started pushing on picoseconds). Pentium timing resolution is in picoseconds as controlling and being tolerant of clock skew is critical to making the new processors work (amazing engineering there). In fact, for the newer microprocessors, over the last few years, the variation in process across a single die (not even wafer or lot) is a parameter that can not be ignored any longer. So, they have local circuits distributed about the chip to perform clock edge placement to get the skews under control, accounting for variation in transistor performance. The design of modern microprocessor clocks is a big engineering project all by itself. Here's a paper about the Intel clock design - note the 5 co-authors. Itanium Processor Clock Design Utpal Desai1, Simon Tam, Robert Kim, Ji Zhang, Stefan Rusu ISPD 2000 Here's an excerpt that shows the flavor of what is required: Figure 1 shows the high level Itanium processor clock generation and distribution scheme. The core PLL receives differential clock inputs running at the bus clock frequency and generates a high frequency clock at twice the core clock frequency [3]. A divide-by-two circuit generates the high frequency core clock and the reference clock using the 2X-frequency clock from the PLL. Both these clocks, the core clock and the reference clock, are routed from the PLL via a balanced H-tree to 8 deskew clusters, each of which contains four distinct deskew buffers (DSK). The deskew buffer is a digitally controlled analog delay line, whose function is to detect and eliminate the skew between any two clocks and will be described in details later. The reference clock route stops at the deskew clusters. The core clock, on the other hand is routed as an input to the DSKs within a deskew cluster and generates output clocks shown as gclk in Figure 1. The total die area is partitioned into 30 regions, and therefore only 30 out of the 32 gclk signals are used to generate the clocks required by a clock region. The gclk output from a DSK is buffered by bank of buffers called the regional clock drivers (RCD) located at the top and bottom of a clock region and distributed over the clock region via a clock grid. The circuits within a clock region tap directly into the overlaying grid to generate the local clocks required within the region. [Pages of technical details omitted...] In summary, the Itanium processor utilizes an active deskewing scheme to achieve a low clock skew. This scheme compensates for process variations which is not possible using a passive scheme. Total measured skew between the 30 clock regions in the Itanium processor is 28 ps. In contrast, the measured skew with the deskew buffer architecture turned off, would have been 110 ps. And this is just for normal operation - there are lots of other features for debugging.... Lou Scheffer |
#36
|
|||
|
|||
Operating systems used in spacecraft?
Henry Spencer wrote:
In article , Keith F. Lynch wrote: Oh. I guess that redundancy is only meant to protect against hardware failures. The right thing to do would be to have them run completely different code. Of course the code would have to all be written from the same specification, and that specification could contain bugs. Plus, experienced programmers writing code to the same spec have a tendency -- surprise, surprise -- to write similar code with similar bugs. It is not enough to have several guys implement to the same spec; it has to be part of the *specs* that they use different approaches, different tools, etc. There are ways around this though - have one set of them use fortran 77 and teh other use say haskell. Sure, if everybody writes in Ada, chances are very good they will use a similar design - OTOH you would have to go way out of your way to end up with a fortran like design in haskell. -- Sander +++ Out of cheese error +++ |
#37
|
|||
|
|||
Operating systems used in spacecraft?
Would it be possible to get the presentation from that talk, when it
has been held? Jan |
#38
|
|||
|
|||
Operating systems used in spacecraft?
"rk" wrote in
... Keith F. Lynch wrote: Keith F. Lynch wrote: Doesn't the Shuttle have three computers all running in parallel, with majority vote ruling? Henry Spencer wrote: It's actually four, with elaborate arrangements for cross-connecting things as desired. (There are only three of some of the more important subsystems, so if one computer is acting up, you can cross-connect to put the other three in charge of those.) The majority-rules voting is done in hardware. This works only because all four are running the *same* software, bit for bit identical, in lockstep. You couldn't get the necessary low-level timing synchronization on machines running different code. Why would you need low-level timing synchronization? So what if one computer wants to take an action a tenth of a second before the others? Just how critical is timing? Will disaster happen if anything happens even a whole second too early or too late? Each action will take place as soon as two of the three computers decide it should happen. Then two of the computers will decide that the computer that is 100 ms late (which is a looooooooong time) died and vote it off the island. Then the next two will quickly decide that each other has failed. This seems to be a bad strategy if you have only two computers remaining. In manned (or "piloted" if you want to be sex neutral) missions typically you start with four computers that can vote one computer out of the pool, even in case of Byzantine errors (one computer becoming maliciously faulty and passing different faulty data to the others). Three remaining computers can still vote one out of the pool but the result after this should be that the two remaining decides that only one of them continues, a "last survivor" strategy. Having one computer remaining instead of two halves the probability of a fault occurring. The only reason to have two remaining computers is if the application tolerates that a comparison error results in both computers being restarted. -- th |
#39
|
|||
|
|||
Operating systems used in spacecraft?
"rk" wrote in
... I think that whole strategy is bad; which is why it wasn't done. That algorithm falls apart early and continues to fall apart. Best to just not do that rather than patch it up. If I was down to two computers running together in general I would run them both. That gives you some level of error detection. But it does of course double the probability of a failure. What is interesting is that for long mission (long compared to MTBF) TMR has lower reliability than simplex. It's true that majority voting computers are really messy, both from a hardware and software point of view, but it is one of the few ways too guarantee a very high degree of error detection (something you need to ensure the very low unreliability figures you need in missions involving humans). As these missions are quite short a TMR concept doesn't degrade the reliability too much. For longer missions, like the TMR computers in the Zvezda Service Module, you need to able to repair the computer pool when they are in operation. Non-manned spacecraft in general very seldom uses TMR configurations although the latest COTS technology trends seems to promote these ideas, for instance have a look at the new PowerPC board from Maxwell. An interesting application here is Ariane5 that uses a hot redundant pair of computers in a master/slave configuration. In the early years it was planned to launch the Hermes spaceplane by Ariane5 and a pool of four computers in Hermes would then replace the standard dual redundant computers. This trick improved the overall system reliability to values acceptable for manned missions -- th th wrote: "rk" wrote in ... Keith F. Lynch wrote: Keith F. Lynch wrote: Doesn't the Shuttle have three computers all running in parallel, with majority vote ruling? Henry Spencer wrote: It's actually four, with elaborate arrangements for cross-connecting things as desired. (There are only three of some of the more important subsystems, so if one computer is acting up, you can cross-connect to put the other three in charge of those.) The majority-rules voting is done in hardware. This works only because all four are running the *same* software, bit for bit identical, in lockstep. You couldn't get the necessary low-level timing synchronization on machines running different code. Why would you need low-level timing synchronization? So what if one computer wants to take an action a tenth of a second before the others? Just how critical is timing? Will disaster happen if anything happens even a whole second too early or too late? Each action will take place as soon as two of the three computers decide it should happen. Then two of the computers will decide that the computer that is 100 ms late (which is a looooooooong time) died and vote it off the island. Then the next two will quickly decide that each other has failed. This seems to be a bad strategy if you have only two computers remaining. In manned (or "piloted" if you want to be sex neutral) missions typically you start with four computers that can vote one computer out of the pool, even in case of Byzantine errors (one computer becoming maliciously faulty and passing different faulty data to the others). Three remaining computers can still vote one out of the pool but the result after this should be that the two remaining decides that only one of them continues, a "last survivor" strategy. Having one computer remaining instead of two halves the probability of a fault occurring. The only reason to have two remaining computers is if the application tolerates that a comparison error results in both computers being restarted. -- rk, Just an OldEngineer "In God we trust, all others bring data." -- Framed plaque from the '60s, hanging in the Mission Evaluation Room at Johnson Space Center, downstairs from Mission Control. |
#40
|
|||
|
|||
Operating systems used in spacecraft?
"Keith F. Lynch" wrote:
Each action will take place as soon as two of the three computers decide it should happen. "rk" wrote: Then two of the computers will decide that the computer that is 100 ms late (which is a looooooooong time) died and vote it off the island. Then the next two will quickly decide that each other has failed. My idea was that no computer would ever be "voted off the island". All three would continue to vote on each action. If they're slightly out of sync, no problem. Each action takes place when two vote to do it. If one fails completely, and consistently votes for bogus actions, or fails to vote at all, no problem, so long as the other two keep working. The three computers would have different hardware, designed on different principles by different people. And they'd be running different software, designed on different principles by different people. Of course they have to all follow the same specification, and hence are subject to any bugs in that one specification. I can't see a way around that. -- Keith F. Lynch - - http://keithlynch.net/ I always welcome replies to my e-mail, postings, and web pages, but unsolicited bulk e-mail (spam) is not acceptable. Please do not send me HTML, "rich text," or attachments, as all such email is discarded unread. |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Decision on the Soyuz TMA-4 spacecraft prelaunch processing | Jacques van Oene | Space Station | 0 | April 1st 04 01:12 PM |
Voyager Spacecraft Approaching Solar System's Final Frontier | Ron Baalke | Science | 0 | November 5th 03 06:56 PM |
Soyuz TMA-3 manned spacecraft launch to the ISS | Jacques van Oene | Space Station | 0 | October 21st 03 09:39 AM |
The Final Day on Galileo | Ron Baalke | Science | 0 | September 19th 03 07:32 PM |
BAE Systems Microprocessors Enroute To Mars | Ron Baalke | Technology | 0 | July 29th 03 10:40 PM |