Risk Based Testing - How testing can help you make the right decision
There are massive pressures associated with any IT project:
- Reputations (individuals & the company)
- Spend already committed
- Old systems failing & not fit for purpose
- Cost of maintaining existing systems
Pressures to 'get it right' vs Pressures to 'get it in' & Cost vs Risk vs Reward
There are pros and cons on both sides. There’s no right answer. Getting a solution 100% right could take an eternity and cost a fortune - way in excess of the benefits the system is expected to deliver. Maybe the risk of anything going wrong is minimal and the impact negligible. Then again, just chucking it in could be disastrous. The risk of something going wrong could be great and the impact catastrophic.
So, how do we make sure we are making the right call?
Testing can help or rather properly planned testing that is fully supported can help. How? Well, in the case of testing, illustrating where a project got it right is difficult to quantify, difficult to visualise. Sometimes, the impact can be far more real if you see how it can go badly wrong. As in the case of TSB / Sabadell. Briefly, I will outline what happened, the reasons – then how testing could have helped prevent the outcome.
All of the information in this blog is currently widely available through different news outlets. I will be using TSB case study as an example to explain. All the links are available in the footer.
When Banco Sabadell bought TSB from Lloyds in 2015, they instigated a £450 million three-year project to move 1.3 billion TSB customer records from Lloyds old legacy system onto a version of their Proteo system.
But a migration expected to save £160 million a year turned into a disaster that left up to 1.9 million customers locked out of their accounts, mortgage accounts vanishing, small businesses not being able to pay staff and debit cards ceasing to work.
So what happened?
Banco Sabadell put into motion a plan, it had successfully executed many times in the past for other banks it had acquired. Trouble was that Lloyds was a vast legacy system – not like anything their plan had been applied to in the past.
The time period to develop the new system and migrate TSB’s customers over to it was 18 months. Sabadell was warned that this timeline was high risk and was likely to cost far more than the £450m ‘budget’.
Indeed, the switchover was pushed out to April 2018, which incurred (aside from the extra project costs) a payment of approx £70m to allow TSB to continue operating on Lloyds legacy system.
By April 2018, the trial migrations had still not been fully tested but senior TSB management were faced with a dilemma. Accept the risk that testing was not complete or postpone go-live by several weeks, report another slippage and incur further cost overrun that could run into tens of millions.
They decided to gamble...
Outcome & Aftermath
On April 23 2018, Sabadell announced that the development of Proteo4UK was complete and that 5.4m clients had been successfully migrated over to the new system.
However, only hours later, problems began to emerge and by the end of 2019, the failure had cost TSB approx £370m in post-migration charges. The ramifications in terms of lost business were incalculable and eventually, their CEO was forced to resign.
What went wrong?
An independent inquiry from IBM concluded that ‘a catalogue of failures’ at TSB and Banco Sabadell had led to the disaster. Specifically, a failure to conduct the right testing which made it impossible to identify the problems that were causing the downtime when the system began to fail.
So, how could testing have helped?
When you are deciding if a project is ready to go, the answers to key questions can be much better informed by the right testing. Have you committed the right amount of resource (time & money) to planning and testing? Careful planning and control is essential. Budgets and timescales must be realistic to the effort required. Early involvement of testing can help scope the required effort and identify the support infrastructure that needs to be in place to successfully determine if the solution s fit for purpose (environments, data, phases and personnel).
The larger and more complex the solution, the greater the requirement for testing. At a minimum testing should include functional testing, user acceptance testing, non-functional testing and performance testing.
Was the right type of testing done? Was the testing comprehensive enough? For instance, functional testing can vary in depth between an existing system enhancement and a newly developed piece of software. As a general rule, points of change or interfaces between systems are usually where issues will occur.
In the case we’ve been looking at, perhaps the migration involved moving accounts from one mature and well-tested system to another; TSB thought testing was relatively low risk. That element of it probably was. However, the two systems had very different code bases; meaning the amount of functional testing required should have been similar to that required for new software.
Also, if a system is likely to have heavy user traffic or spikes in usage, then performance testing is a necessity. In the TSB / Sabadell case, IBM specifically pointed to insufficient performance testing as being a contributing factor to the disaster.
What don’t you know about the solution ?. What can you say with real certainty works and what else that would go live with it are you not sure about? The correct types and levels of testing - if supported and listened to will identify what works and what doesn’t. Also, how much testing effort will be required to give full confidence and the risks of going live with whatever the current situation is.
Have the risks been fully investigated? Was the risk policy appropriate? There must be a clear policy on risk and the policy should be stuck to. What criteria must be met for go-live? Once this has been determined, the amount of testing required can be determined. If the tests are not passed, there must be the discipline not to attempt to go-live, even if it will cost more to carry on testing.
Do you have control over or clear ‘lines of sight’ to ALL suppliers involved in the final solution?
The right types and levels of testing will unearth all steps in the process and the people/roles that are involved. Testing will work with key personnel (including third party suppliers) to ensure the solution is fully operational. Also, to check that all possible paths through it have been identified and executed, that any known issues are communicated to the wider project and that key contacts are in place.
Have you got the right people involved?
If UAT is properly supported and executed, the key personnel will have been involved. They will have had chance to flag any concerns and ensure that the solution is not signed off for live unless it's ready.
If required, do we have clearly defined rescue workarounds?
A workaround may seem clunky and not in tune with ‘transformative programmes’ but they’re usually slightly cheaper than a failure! Similarly, working out full solutions for scenarios that don’t occur too often can add unnecessary time and expense to a project.
Comprehensive testing will identify edge cases or negative cases that may not fit the ‘critical path’ but could occur. From there, workarounds can be put in place to ensure that they do not cause issues when the new system is switched on.
Is a phased deployment practical?
Where a phased approach is being considered, the more comprehensive the testing; the more likely you are to be able to identify whether a phased approach could be safely implemented, be more likely to spot natural break points, the workarounds needed to cover scenarios not covered in go-live.
If required, do we have a robust roll-back plan?
Any roll-back plan will require confidence in the solution and it’s ability to be undone if required, without causing further issues. OAT, roll-out dress rehearsals and Disaster Recovery non-functional testing can all help identify key points in the go-live process and where roll-backs could occur. Also, when they should occur, how they are to be executed and their effect on the live system.
Could TSB / Sabadell answer those key questions?
The independent IBM enquiry would suggest not. Insufficient planning and testing were given as major factors for the disaster. IBM also implied that TSB should have given more consideration to a phased migration.
Some people struggle to be convinced of the benefits of testing. It costs time & resources and the benefits can be intangible. But I wonder how much it would have cost TSB to test so that they could answer those questions?
I suspect a lot less than £370m...
PYMNTS (2019, November 19). Report: TSB Bank IT Crash Caused By Inadequate Testing. Retrieved March 03, 2022, from (Website Link).
henricodolfing (2019, March 13). Case Study 2: The Epic Meltdown of TSB Bank. Retrieved March 03, 2022, from (Website Link).
Ruth Sunderland & Alex Hawkes. The Mail on Sunday/This is MONEY.co.uk. (2018, April 28). Beleaguered TSB is heading for legal showdown with Sabadell over botched tech upgrade. Retrieved March 03, 2022, from (Website Link).
Angela Monaghan/The Guardian (2018, June 06). Timeline of trouble: how the TSB IT meltdown unfolded. Retrieved March 03, 2022, from (Website Link).
BBC News (2019, November 19). TSB lacked common sense before IT meltdown, says report. Retrieved March 03, 2022, from (Website Link).
Lawrence White & Iain Withers (2019, November 19). TSB and parent Sabadell heavily criticized for IT crash that locked two million out of accounts. Retrieved March 03, 2022, from (Website Link).