Friday 29 June 2012

So what went wrong at Natwest?


A lot of questions need to be asked over RBS’s computer problems – but if we want to stop this happening again, we need to listen to the answers.

An easy answer. But not a useful one.
So there we have it. For anyone who questions the value of software testing, here is a prime example of what happens when you let a bug slip through. I know we’ve already moved on to another banking scandal, but in case you’ve forgotten: many Natwest customers failed to get paid owing to a botched system upgrade. This has led to all sorts of consequences, and the obvious question of how this could be allowed to happen.
Except that when people ask this question, I fear most of them have already decided on the answer, which is that RBS is a bank and therefore Big and Evil and responsible for everything bad in the world from Rabies to Satan to Geordie Shore. That answer might make people feel better but does little to stop this happening again. In practice, what went wrong is likely to have little to do with the credit crunch or banking practices and a lot to do with boring old fact that any bank – no matter how responsibly they borrow and lend – runs on a highly business-critical IT system where any fault can be disastrous.

An easy claim from a software tester would be that RBS, as Natwest’s owner, must have gone cheap on the testing. I suspect it won't be that simple. By its very nature, a banking IT system is going to be very complex – it has to be capable of handling thousands of transactions every second whilst keeping itself totally secure from hackers – so it would benefit from as much testing as possible. But, as any ISEB-qualified tester can tell you: exhaustive testing is impossible. There is always a balance between testing and finance, and testing has to be prioritised and targeted. This is taken for granted all the time, and it’s only when things go wrong that we ask why.
The fact remains, however, that something went seriously wrong. The Treasury Select Committee is already asking what happened, as has the FSA, so we should get more details on what happened soon. But how much we learn will depend on whether the right questions are asked. So here are my suggestions:
  • Was the upgrade necessary? Chances are, it was. Security loopholes are uncovered all the time, and a security update for a banking system can’t wait. But if it was an update for the sake of updating, that would be a different matter.
  • Were they using out-of-date software? I can’t comment on what banking software is and isn’t used, but I know of numerous systems that doggedly stick to Windows XP or Internet Explorer 6 in spite being horribly error-prone in a modern IT environment. A business that becomes dependent on out-of-date components, and fails to bite the bullet and upgrade when it needs to, only has itself to blame when the testing can’t keep up with the bugs.
  • Was enough time allowed for testing? As a rule of the thumb, every day of development should be matched by at least one day of testing. A common mistake is when software uses commercial off-the-shelf products as back-end components, little testing is done in the belief that the commercial product is bound to work fine. In my experience, that gamble usually backfires.
  • Was everything tested that should have been tested? This might seem obvious, but it’s not unusual to concentrate on easy feature tests without paying much attention to more problematic areas such as performance or integration.
  • Was the timescale realistic? I ask this only because a common response to a software project overrunning is to cut the testing time. That is a stupid thing to do, but if the budget and timescale has been set in stone the project manager might have had no other option.
  • Did they carry on monitoring the update after it was implemented? Software that worked perfectly in the test environment can still fail in the live environment. Since it took them three days to identify the case of the problem, they have some explaining to do here.
  • Was the testing correctly prioritised by risk? To state the obvious, when an area of the software is known to be likely to break, or the consequences of a component going wrong will be severe, you need to concentrate testing on this are (and not spend your time doing endless repetitive tests of low-risk areas). What’s not so obvious is identifying what are the high-risk areas in the first place. And this brings me to a pertinent question.
  • Did the people in charge of the testing properly understand the job? This is where RBS may have a case to answer. The Unite union has suggested that RBS replacing outsourcing their IT work abroad was to blame. I don’t believe in assuming off-shored worked is cheaper, more expensive, sloppier, better quality, faster, slower or any other silly generalisation. But when you suddenly outsource your IT work to another country, you lose most of your in-house expertise – quite possibly the people who knew what the risks were and how to avoid them. In the worst-case scenario, the work may have ended up with people whose idea of testing is telling you everything’s fine.
However, it might be that RBS has perfect questions for all of the above. That would still not guarantee that nothing can go wrong. As exhaustive testing is impossible, there is always a chance that an untested area thought to be low-risk goes disastrously wrong anyway, and there is no foolproof way of stopping this. So I have two final very important questions:
  • Did they have a fall-back plan for a fault making it into the live environment? No matter how good your test plan is, you always have to think “What’s the worst that could happen?” The wrong answer is “But it definitely won’t happen.” The #1 mistake of the Titanic was not the design flaws that allowed the ship to sink, but the foolish assumption that as the ship was unsinkable there was no need to provide enough lifeboats. Did RBS do a Titanic and assume their tested upgrade couldn’t possibly go wrong? I doubt they would have been stupid enough to have no plan at all, but this leads me on the other important question.
  • If they had a contingency plan, was it credible? In far too many cases, contingency plans are made for reviewing, signing off and shelving but not actually implementing. When the sole purpose of a contingency plan is to allow you to say “Yes, we have a contingency plan,” … well, you can imagine the rest.
But all of these questions rely on an attitude of “What went wrong?” first, and “Who went wrong?” a long way second. Unfortunately, there are already signs of thelatter option being favoured. I’ve seen what happens when people blame each other for IT problems, and it’s not a pretty sight. Whatever story RBS offers, there are valuable lessons to be learned. I only hope someone’s interested in learning these lessons.

No comments:

Post a Comment