Tuesday, April 28, 2009

Why do we test?

Why do we test?

Altimeter and autopilot possible cause of plane crash near Schiphol

I read an interesting article from the BBC, headlined "Altimeter 'had role' in air crash".  In reporting a news conference conducted by Dutch Safety Board chairman Pieter van Vollenhoven, the article reads in part:
...the plane had been at an altitude of 595m (1950ft) when making its landing approach to Schiphol airport. But the altimeter recorded an altitude of around ground level.  The plane was on autopilot and its systems believed the plane was already touching down, he said.

The automatic throttle controlling the two engines was closed and they powered down. This led to the plane losing speed, and stalling.

I am surprised that an autopilot would throttle back engines based on only one instrument, the altimeter.  I would have assumed additional criteria would need to be met - perhaps having weight on the landing gear.

The article raises other interested points, and can be found here:
http://news.bbc.co.uk/2/hi/europe/7923782.stm

Turkish Airline disaster and the Altimeter

As you probably know, a Boeing 737-800 with 127 passengers and seven crew crashed near Schiphol airport in the Netherlands, killing nine and injuring many others.  The details are starting to emerge that the left altimeter was faulty, and that from 2000 feet, it notified the autopilot that they were suddenly at -8 feet. Autopilot immediately cut the power to the engines, stalling it in mid air.  Due to the weather, the pilots had to rely on their instruments and could not see what was wrong until the stall indicators came on.

What I would like to know is that how software testing is done at Boeing.  I fail to see how the software would not spot a problem and carry out the landing:

1) If the two altimeters are reading very different readings,
2) If one of the altimeters switches from reading 2000 feet to -8 feet instantly,
3) If one of the altimeters reads a negative number?

If the software had warned them, I'm sure these pilots would not have died, along with several passengers.

  [Somewhat similar comment from Ben Blout.  Also, there has been extensive discussion on this topic around the Net.  Having two of anything always suggests the problem of what to do what they disagree.  (Les Lamport's paper on Buridan's Ass comes to mind). That problem suggests that having THREE might be a better strategy, and seeking consensus.  But sanity checking is also a good idea, and trusting absurd readings is not wise.  Perhaps the biggest problem is again that autopilots and people are not infallible, but the lack of synergy between the two can be even more debilitating. ]

Google Calendar as a single point of failure?

I keep most of my scheduled appointments on Google Calendar (although it is a pain since I have to be careful not to put proprietary information in my appointment descriptions).  I missed a teleconference today, because Google Calendar said it was at 4pm, and when I showed up no one was there.  Going back through my notes I discovered the meeting was supposed to be at 3pm, and I had set it up as a recurring event.  I then discovered from reading Google Help forums that there are known problems - when Daylight Savings Time started, recurring events got moved either an hour earlier or an hour later, depending on whether you were the originator of the meeting or an invitee.

There was another symptom I noticed - my meeting was at 3pm (Standard time) which got shifted to 4pm (Savings) time.  When I tried to change the time back to 3pm, it had no effect - presumably because it thought the meeting was already scheduled for 3pm (Standard).  So to make it show up on my calendar at 3pm (Savings), I had to change the schedule to 2pm (Standard).

Once Google fixes the problem, I have to remember to move the scheduled time back again, so I show up at the right time!

RISK?  When there's a shared calendar infrastructure and it's buggy, everyone ends up at the wrong time.  Something similar happened last year when the US switched to Savings time earlier than in previous years - Microsoft and other vendors rushed out patches to handle the time change.

  [Along many other problems discussed here, this one seems to recur.]

A firmware glitch of router software: 32-bit integer handling

I am not sure if this is computer risk in the general sense, but I feel we will see more of this type of problems (32-bit signed integer vs unsigned integer) in embedded devices for consumer electronics and elsewhere for some time to come, and so I am reporting it here.

First the public fact.

NEC, a large electronics company, and its subsidiary NEC Access Technica have announced that their line of DSL routers with IP-phone feature which are used by many Japanese ISPs including the giants NTT East and NTT West has a software bug that after a continuous use of 2485 days, the router no longer allows telephone functions.  (Internet access is still usable, though.)

The problem can be fixed by firmware upgrade, or for that matter, if power recycling is done, the problem is shifted for another 2485 days into the future.

(Actually, I saw somewhere that NEC and NEC Technica was looking into the problems reported on different line of routers when they learned of the potential of similar problems. And when they checked the firmware of other products, they found the newly reported problems. I am not sure where I read it. I can't locate it any more. Maybe I read it in the letter which was sent by an ISP to notify the problem urging me to update my firmware of the said affected router.)

The cause of the bug?

The various reports I read didn't mention what are the real cause of the "software" problem, but I guessed that it must be related to the use of 32 bit integer for counting inside the firmware.

To wit, the problem interval in seconds is 214704000 [seconds] (= 2485 [days] * 24 [hours/day] * 3600 [s/h]).

One day shorter T' is 214617600 [seconds] = 2484 [days].

Also, 2 ** 32 = 4294967296
      2 ** 31 = 2147483648

We can see the following holds:

     (T' * 10)  < 2**31 < (T * 10) < 2 ** 32

My conclusion:

A certain integer counter is incremented at 1/10 sec interval within the software using 32 bits data starting from 0 after power-up.

Internally, the firmware code regards this data as "signed" and suddenly somewhere between 2484th and 2485th day, this counter becomes "NEGATIVE" and wreaking havoc within the code, and rendering phone function useless.

Observation:

When I was checking web pages to write this submission, I noticed a bug report that different routers used for IP phones using optical fiber had a similar problem after 249 days.  This was found last summer.  Obviously 294 days is much shorter than 2485 days and some people suffered from the bug last year.  In this case, I think the counter is incremented every 1/100 second. Maybe this discovery led to the massive review of the router firmware.

I noticed that similar problems concerning integer size and its signedness have occurred when * file size has begun exceeding 2GB limit (again 31/32 bit boundary),  * address space has been extended to 64 bits from 32 bits.

I noticed the first file size problems starting around the time Solaris and other Posix-based systems began offering large file size systems. Also, to this day, unmaintained shareware on windows often have problems when we try to handle a large file (2GB) like ISO image on windows. The symptoms are many. But one symptom that suggests the use of signed integer to check for the remaining file space is messages that say I don't have enough space although I have more than enough (actually larger than 2GB of free space.) I have checked that in many cases, if I create a very large dummy file to shrink the remaining free file space to less than 2GB, then these unmaintained programs proceeded without a hitch or ran into other size-related problems later.

I noticed the second address space problems when linux was ported to x86 architecture with 64bit address space: many device drivers as well as applications began failing. Solaris for x86 also saw many third party drivers facing similar problems when 64 bit address space was supported on x86 architecture. (Solaris for Sparc supported 64 bits address space for many years and I don't remember particular problems.)

We can now add the use of timers/counters to the causes that may trigger careless errors in applications.  We have already seen there have been cases where a counter goes over the allocated bits and repeats again from 0, thus causing some software problems in the past: I think the early version of Windows NT had a problem of requiring reboot every 49 days or so for certain applications. (Counters incremented every 1/50 sec?)

As more consumer devices (as well as industry machines) are equipped with 32 bit CPUs, and more programmers who are accustomed to the luxury of 32 bit CPU programming under non-embedded OS have begun to develop software for embedded devices, we may see similar problems in the embedded system space more often.  We have seen many already, but I am afraid that this trend will continue.

BTW, routers are complex device and some have linux inside literally.  I am surprised somewhat recently to see GNU General Public License repeated verbatim in print inside the manual of my Toshiba Hard Disk recorder. Obviously, certain code used inside is based on GPL'ed code.  When I think about the growing pains that drivers and OS itself had to go through when larger file sizes and address space extension were introduced, I have an uncomfortable feeling to trust the complex operations on such products unless software patches are readily available. But how are we supposed to "patch" hard disk recorder software? If power cord is accidentally removed during patching, what happens?! Does hard disk recorder store the "patch" program in a separate place, preferably a ROM or something, so that glitches during patching can not corrupt such software?

Embedded system programming requires certain different mindset, but I am afraid that not many programmers are trained to develop such mindset in the educational system and even in the industry in general.  I feel this way because the problems with NT don't seem to have been learned by the would-be developers today.

I am reporting the problem here today so that at least someone can point out that such a problem is a public knowledge for a long time when a similar problem happens again in the future.