Subject: RISKS DIGEST 10.40 REPLY-TO: risks@csl.sri.com RISKS-LIST: RISKS-FORUM Digest Tuesday 18 September 1990 Volume 10 : Issue 40 FORUM ON RISKS TO THE PUBLIC IN COMPUTERS AND RELATED SYSTEMS ACM Committee on Computers and Public Policy, Peter G. Neumann, moderator Contents: Software Unreliability and Social Vulnerability (Perry Morrison) DTI/SERC Safety Critical Systems Research Programme (Brian Randell) Canadian Transportation Accident Investigation (Brian Fultz) Risk of Collision (Brian Fultz) Knight reference: `Shapes of bugs' (Pete Mellor) The RISKS Forum is moderated. Contributions should be relevant, sound, in good taste, objective, coherent, concise, and nonrepetitious. Diversity is welcome. CONTRIBUTIONS to RISKS@CSL.SRI.COM, with relevant, substantive "Subject:" line (otherwise they may be ignored). REQUESTS to RISKS-Request@CSL.SRI.COM. TO FTP VOL i ISSUE j: ftp CRVAX.sri.comlogin anonymousAnyNonNullPW cd sys$user2:[risks]GET RISKS-i.j ; j is TWO digits. Vol summaries in risks-i.00 (j=0); "dir risks-*.*" gives directory; bye logs out. ALL CONTRIBUTIONS ARE CONSIDERED AS PERSONAL COMMENTS; USUAL DISCLAIMERS APPLY. The most relevant contributions may appear in the RISKS section of regular issues of ACM SIGSOFT's SOFTWARE ENGINEERING NOTES, unless you state otherwise. ---------------------------------------------------------------------- Date: 15 Sep 90 07:30:18 GMT From: pmorriso@gara.une.oz.au (Perry Morrison MATH) Subject: Software Unreliability and Social Vulnerability (Re: RISKS-10.31) Many thanks to all those who have contributed their views on issues raised in my article with Tom Forester -"Software Unreliability and Social Vulnerability", Futures, June 1990. I don't have time or space to respond in detail to all of the comments made, but I would like to pick up on some of the main threads while they are still salient for most risks readers. First, I was never completely happy with the blanket conclusion of removing digital systems from life-critical applications, but it has served a purpose in bringing software unreliability to the attention of the general public in a fashion that has more rigour than the tabloids would provide (I hope!). In the process, it has brought what I believe are increasingly important issues back to the arena of software/systems specialists. Although the paper restricts itself to software/hardware failures, it is highly influenced by Charles Perrow and the "Normal Accidents" argument of complex systems. As others have already stated, the central issue is really system complexity -- the limits it places upon our capacity to predict system behaviour. Unfortunately, digital systems have two other properties that exacerbate their unreliability: (a) it is easier to build much more complex digital systems than analogue ones (or at least we have so far) and (b) digital representation and computation provides more scope for catastrophic/unpredictable behaviour than analogue techniques -- essentially the "untestable number of system states" and "each state is a possible catastrophic discontinuity" arguments that we have (hopefully correctly) borrowed from Dave Parnas and others. Our motivation stemmed from our concern that we (others, I mean) have already built systems of such complexity that we no longer understand or can predict in any adequate sense, their full range of possible behaviours. This concern is exacerbated by the tendency of complex systems to also control complex responsibilities or awesome energies and involve huge monetary investment. Such systems -- things like the Shuttle, some particle accelerator/conglomerates and others exhibit very high levels of unreliability. Note however that it is not only the digital components of such "hybrid" systems that are unreliable, but their analogue components as well -- valves etc. Again, it is enormous complexity and our inability to get adequate intellectual handles on it that is the root problem. However, for better or worse, society is steaming down the road toward greater application of complex digital techniques... to everything -- financial systems, control of basic services, defence, communications.... In our view, this will lead to more complex digital systems with concomitant levels of unreliability and sometimes quite disastrous results. Our major purpose was to point out that in creating such powerful systems we cannot have our cake and eat it too -- we should not be surprised that when they fail (as they occasionally will being human created artifacts) the consequences will be catastrophic. The unreliability of complex systems (which, as outlined above tend to be digital) and the power/energies bound up within them should lead us to expect such consequences. Whether this scenario is any better than other alternatives (whatever they are) is a value judgement of some complexity itself. It is a value judgement because it relies upon subjective assessment of what constitutes an acceptable level of risk for a given system and the responsibilities it is tasked with. Note also, that we tend to ignore the notion that there is more than one legitimate assessor of risk -- not just the developer or client, but the users and those that the system affects in direct or indirect ways. This is essentially the PR problem that nuclear reactors and fly-by wire systems have -- perhaps the psychological processes of individuals are often awry, but they clearly assess the risks of FBW (and the owners/operators merely mirror the commercial consequences of this concern) as worrying. In this sense, successful design is as much an educational/consultative process as abstract discussions about reliability. I need to take some time to address the issue of risk in a way that I don't believe any commentator has to this point. 1. It is important to note that most commentators in this discussion have assumed that most system design/implementation happens in a relatively controlled, client-expert relationship where the expert has a fair degree of control over the qualities that the system will have. Often this is not the case -- increasingly, subsystems are being designed and plugged into ad-hoc mega-systems (like the international financial system communications networks) thereby adding to overall complexity without the advantage of any over-riding design to guide the development of the subsystem (apart from protocols say). I believe that this is a dangerous trend and it is a characteristic that many communications nets and other systems have -- they just grow like topsy and "design" if it can be called that is blatantly ad-hoc. Clearly, even conventional systems are added to and expanded throughout their useful lifetime, but generally some design documents are available and eventual limits to the system can be reliably assessed. Assessment of risks in the former situation and judgements of required levels of reliability can become very difficult indeed. 2. The application or nonapplication of digital techniques and the subissue of whether analogue systems are better in some applications need to be considered against some systemic cost/benefit analysis or analysis of risk (how appropriate!). Phil Maker of the Northern Territory Uni called and pointed out to me that he designed defibrillators (things that restablish the normal electrical activity of the heart I believe) and that the chances of such people dying without a defibrillator is 50% in any given year. Clearly, software unreliability is an insignificant risk in that situation -- small numbers of people with high existing risk and enormous potential benefits -- using systems of comparatively low complexity. On the other hand, a complex digital system controlling a nuclear reactor may (possibly) provide efficiency benefits that are insignificant compared to the consequences of a nuclear accident stemming from software unreliability . i.e. a situation in which marginal benefits and high complexity exist, coupled with catastrophic consequences in the (perhaps low probability) event of systemic complexity bringing about an accident. **Perhaps** a less complex analogue system could be applied in this case, if at all possible. Note however that reduction of complexity in a situation with high background risk and/or potential catastrophic consequences (subjective I know) is what I would (now) advocate. (Note that the example of a computer based intensive care unit has all the qualities of a good application of complex systems -- high existing risk, good potential benefits, localised failure consequences). It just so happens (I believe) that analogue systems by their nature, limit how complex a system can become. On this point, shortly after the article was published, we received a call from someone in authority at the Shell oil company. They mentioned that they were investigating the possibility of going back to analogue systems for their drilling rigs because of their concerns relating to software unreliability. 3. In answer to the implicit question -- "what else can we do but apply digital systems to life-critical (or potentially catastrophic) situations" one should be clear that many "needs" or areas of application are only driven by the possibilities that computers provide. They are often not genuine, pre-existing needs. For example, my father hardly knew what cheques were for until later in life. Only rich people had a need for such things and banks wouldn't wear the manual processing costs for large numbers of cheque accounts. Computers eventually allowed that to happen but I'm not sure that there was a screaming demand for cheque books at the time. Yes it's convenient and yes the whole electronic funds basis of our economies allows us to do wonderful things with enormous ease. At the same time it has created a system that is subject to massive, international abuse, theft, fraud and (perhaps) great risk of rapid systemic collapse given a sufficiently powerful event. Given these benefits and arguably large risks, is the electronic financial system a good idea? The major point however is that NO-ONE ever had the chance to evaluate the risks involved because the system grew incrementally without any central design base. It just happened. Some of us may ask -- "how else could the international financial system be controlled?" Clearly, it can't be controlled by anything but digital techniques, since they spawned it in the first place, but the point is we know have a whole new class of problems to deal with due to the develoment of the system. This may all sound like old hat and recycled luddism, but as our article pointed out, there's no substitute for technologists being concerned about the implications of the systems they design and implement. For example, I imagine that antarctic mining would create some really interesting temperature/electromagnetic reliability problems for software and hardware designers. But for my money, antarctic mining is irresponsible with large potential for dramatic environmental damage. Obviously, we shouldn't restrict ourselves to the mere technical problems involved in our work. It might sound quaint, but too many systems are simply technology driven rather than need driven. Stop the clock, go back to the caves you ask? No, but we shouldn't complain when complexity makes unreliability a problem for implementors and an issue for the general public. If we actually EXPLAINED to the public the design contraints we face and INVOLVE them in the design, they may come to accept levels of unreliability that we think they won't tolerate. On the other hand they might throw up their hands and say "what the hell do think you're doing". Either way it would be very interesting. 4. Where does this leave us? Merely with the ideology that lots of technology commentators have peddled for a long time -- the notion that technologists must properly evaluate the implications of what they do and actively involve in the design process the people they affect. This includes amongst other things a consensus of opinion on what is a sufficient level of reliability given the array of risks. Too often we want prefabricated solutions/methodologies like "what are the best methods for providing an error minimized system at minimal cost for applications of type x". That is the engineer talking to his colleagues and designing for an abstract user group or (real) corporate owner. It is meaningless unless the designer/implementor understands that such prescriptions must be modified in the face of existing circumstances. Those circumstances include what the users and others affected by the system think. In this same vein, some commentators seem to believe that we are hostile to probabilistic measures of software reliablity/quality. This is not so -- just that probabilities by themselves are not enough. They are meaningless without some understanding of the reliability/quality required in the intended application by the intended users. That is, as already explained, low levels of reliability may be an unacceptable source of risk in some situations (reactors?) but not in others (defibrillators) even when the software has equivalent probability of failure. How do you know how reliable the software must be? How do we know if digital systems are best or analogue? The answer is that as a generalization, law or principle we probably don't. But in the end we should remember that we are designing for people and if they are included and fully informed and satisfied with the safety/reliability/performance provisions in a design, then that should be enough. As always, the customer is always right. Perry Morrison ------------------------------ Date: Mon, 17 Sep 90 9:38:19 BST From: Brian Randell Subject: DTI/SERC Safety Critical Systems Research Programme The following article is reprinted in its entirety from the British Computer Society's Computer Bulletin (vol. 2, part 7, Sept. 1990, p.2). It marks the launching of a new joint initiative by the Department of Trade and Industry and the Science and Engineering Research Council, to fund industry/university collaborative research projects relating to safety-critical systems. I personally very much like the tone of the article - and hope that it can be matched by the research projects that get selected. Brian Randell Brian Randell, Computing Laboratory, University of Newcastle upon Tyne, UK EMAIL = Brian.Randell@newcastle.ac.uk PHONE = +44 91 222 7923 FAX = +44 91 222 8232 : Safety Critical Systems - Bob Malcolm : : On the stairs and landing of my cottage I have four light-bulbs. A few : weeks ago the landing went dark. No, it was not a blown fuse, but the : fourth light bulb gone. As technical co-ordinator for the DTI-SERC : Safety Critical Systems Research Programme I smiled - wanly, as they : say. : : I had failed to replace the three bulbs which had failed previously. I : had fallen prey to one of the classic safety critical system problems : - failure to maintain redundant channels. Why did I do it? Why do : people store rubbish in fire-escapes? It is no good saying that 'They : shouldn't do it'. They do. What are designers to do about it? : : The challenge in building safety critical systems is to achieve safety : in the real world. We may not understand the world and may not be able : to describe it properly, populated as it is by real and really : perverse people - designers and managers as well as users and vandals. : : It is not enough to think only of 'meeting a specification', : especially when the majority of failures arise because of : specification errors. Nor is it sufficient to add something about : getting the specification right, since we can never be sure that we : have succeeded. : : One view of safety critical systems is that all failures have their : origins in human error of some kind - whether a design or operational : mistake, a failure to understand either a system or its failure modes, : or a failure at a higher level to appreciate and accommodate the : likelyhood of these errors. : : To produce safety arguments for systems based on reasoning about a : wide range of technological and human factors requires not only that : we compare chalk and cheese, but that we add it together in some way. : We will need to balance arguments about the style of design with : arguments about its correct construction; arguments about proof with : arguments about likelihood; arguments about comprehension with : arguments about rigorous notation. Who knows - maybe if we succeed in : general then we will succeed in the particular of reconciling : formalists and 'the rest'. To do that we would also need to make : explicit the rigour which is presently only implicit in so-called : 'informal' methods and tools. : : I doubt whether we can do all these things without the help of : psychologists, operational researchers, even philosophers. Computer : technologists and experts in application domains must work with them : to devise better global solutions and, as far as is possible and : sensible, common solutions. We must avoid the 'displacement' problem : where fixing one problem simply creates a different and perhaps worse : problem elsewhere. And we must find ways of quantifying the : effectiveness of what we do - whether as a contribution to a 'safety : argument' or to a probabilistic risk prediction. : : But we need the cooperation of all these different disciplines not : just for the particular skills which they individually bring to the : party. We need them because of their very diversity - to enrich our : intellectual stock. : : In the new DTI-SERC programme we must produce results which will be of : commercial value in the medium term. To do this we must have a firmer, : better-reasoned, basis for what we do. We might stand a chance if we : are prepared to open our minds. : : Hard scientists must stop treating soft science as non-science: soft : scientists must not feel disenfranchised. Formalists must seek the : value of what they call informal, and help in its formalisation; : informalists must find the justification for what they do. : : Safety critical systems pose problems which push us toward : co-operation between disciplines. I hope that industry and academia : seize the opportunity to do this within the new programme. : : [Bob Malcolm is an independent consultant in the field of systems and : software engineering. He currently holds a number of contracts for : consultancy in research and technology transfer, and is technical : co-ordinator for the new DTI-SERC Safety Critical Systems Research : Programme. He is a visiting professor at City University, London. BR] ------------------------------ Date: Sat, 15 Sep 90 07:31:35 EDT From: Brian_Fultz@carleton.ca Subject: Canadian Transportation Accident Investigation The above body in Canada is responsibility for making reports on "accidents" and "aviation occurrences". They produce reports on safety related problems. With reference to report 87-A74947 "Risk of collision between Delta Air Lines Lockheed L-1011 N1739D and Contential Air Lines Boeing 747-200 N608PE North Atlantic 52 degrees 14 north, 34 degrees, 00 West Long