Monday, 9 July 2012

Is it a Glitch or a Meltdown?

It depends on who you talk to as to whether the most recent problem faced by the RBS group is described as a mere “glitch” or a catastrophic “meltdown”. As you might expect, Stephen Hester, CEO of RBS, firmly sticks to the former.

We still don’t know the full details of what went wrong at RBS, NatWest and Ulster Bank. What we do know is that there was a change to an overnight batch and that change caused a problem which meant millions of payments never got made and some people could not login to online banking systems. The transactions then backed up and so by the time the problem was fixed, there was way too much load for the system to cope with. Millions of people were affected for days on end with some very serious consequences such as being unable to pay mortgages, bills and medical costs. The unfortunate customers of Ulster Bank were shoved to the back of the queue for a fix and reports today say that an estimated 100,000 people are still unable to access their money. I don’t understand how a change to a payment batch process would have any impact on the login functionality of a web interface. Does this mean there were two separate problems that occurred simultaneously or perhaps that RBS purposely locked people out of their accounts to minimise the number of transactions they needed to process? I don’t want this posting to be (only) about my conspiracy theory or just an easy swipe at the banks, so I wanted to suggest a solution to what I see as a deep seated problem, based on my experience of working in banks and specifically RBS. The kind of problem that RBS is dealing with, could have happened many times over in any number of systems – many of which I used to be responsible for the quality of. It was just a matter of time and I think there are 3 root causes to the problem.

1. The risk of regression has changed and is not fully appreciated

All Financial Institutions make money out of taking risks. Whether it be the traders taking a risk on a share price rising, a mortgage manager taking a risk on whether someone will default on their loan or an insurer taking risk that the policy holder will make a claim on their car insurance. There are analysts who help determine the market risk of these things and this in turn helps make money. This attitude that taking risk is a good thing is applied to the management of their IT systems. Unfortunately, the same detailed risk analysis is not carried out when a change is made and as RBS have just found out, things can go badly wrong.

The problem facing banks is that things have changed in the past few years. Since Twitter and Facebook gave verybody a soapbox and a global reach, it became impossible to contain the fallout of a problem like this. Couple this with the zeal with which the media and general public pounce on any opportunity to bash the bankers since the Financial Crisis, the reputational and financial impact of problems increases exponentially. If we accept that:

Risk = Probability of Occurrence x Impact of Occurrence

then as risk is proportional to the Impact of occurrence, it becomes clear that something should be done to mitigate the exponentially increased risk. Unfortunately, in too many areas of a large organisation such as a bank, the decision makers have not recognised this change, have not changed their risk management strategy and are still managing IT with outdated beliefs and techniques.

2. The cost benefit of automated regression testing and performance testing doesn’t stack up

When changes are made to a system, the best and easiest way to regression and performance test it is to run a set of automated tests. This isn’t always easy. Maintaining the scripts takes time and money. There are high peaks and low troughs in the required effort and in a fast paced environment like a bank, system changes are rarely documented. Skills can be difficult to find and very difficult to mobilise. As a result, releases often go out having had fewer automated regression tests and performance tests than there were bankers in this year’s Honours List. However, it is a fallacy to say that the cost benefit does not stack up. As Stephen Hester has just found out, the benefit of running a regression test on the recent change and a performance test to ensure that the batch payment system in question could handle a 24 hour backlog of transactions would have far outweighed the costs he has just incurred.

3. There is a disconnect between Run the Bank and Change the Bank

Many organisations have an operational expenditure (opex) budget and a capital expenditure (capex) budget and the banks are no exception, often calling these Run the Bank and Change the Bank. In my experience this leads to a disconnect between the project delivery teams and the maintenance team once a release goes live. For example, it is not in the project managers interests to create a regression pack as part of the project as this absorbs resource and budget. However, once the project goes live, the knowledge and skills to create the regression pack are no longer available. In most large banks this problem is exasperated as multi-vendor outsourcing has led to completely different companies, with completely different loyalties, doing the delivery and maintenance. There is also a lack of accountability for poor quality code. Poorly written contracts and badly managed SLAs mean that development teams are not held to account for bugs or production incidents – or even worse, get paid to fix their own poor quality code. When cost per head is the benchmark by which CTOs and their IT organisations are judged, they don’t have the responsibility, incentive or capability to build quality into a product.

What is the Answer?

So, if the question is “How can we ensure that system changes don’t cause huge embarrassment to organisations, especially banks, if and when they go wrong?” what is the answer?

I think the answer is managing an application from cradle to grave, not just managing the delivery lifecycle and throwing a problem over the fence. It’s understanding the Total Cost of Ownership of a system, including the cost of thorough testing, following good practices, maintenance and future releases – even if that means delivering fewer changes and less functionality. It’s accepting that things can and will go wrong if too many risks are taken. It’s seeing the bigger picture and understanding that upfront investment in better practices and quality controls can provide a future cost savings and a return on that investment. It’s making use of tools to do more testing, more cost effectively. It’s creating efficiencies between each and every person involved in the lifetime of a system. It’s sharing information between relevant stakeholders and making sure that everyone involved is a stakeholder. It’s understanding that a larger proportion of the budget needs to be spent on quality assurance activity. It’s investment in the right tools that handle the interfaces between teams and helps them collaborate. It’s being able to easily create and share relevant information. It’s Application Lifecycle Management.

As this round of bank bashing starts to run out of steam and Bob Diamond and Marcus Agius take the heat off Stephen Hester with their amusing and heart-warming “I’m Spartacus” routine following the Barclays Libor scandal, I urge IT professionals everywhere to learn from the mistakes made at RBS. None of us can claim that we have worked in organisations that don’t have some (if not all) of the same problems. Let’s improve the way we deliver software with a better understanding of the risk of getting it wrong and greater focus on quality and stability. Isn’t that what users want after all?

No comments: