Wednesday, March 13, 2019

Case Study: The epic meltdown of TSB Bank

Case Study: The epic meltdown of TSB Bank
With clients locked out of their bank accounts, mortgage accounts vanishing, small businesses reporting that they could not pay their staff and reports of debit cards ceasing to work, the TSB Bank computer crisis of April 2018 has been one of the worst in recent memory. The bank’s CEO, Paul Pester, admitted in public that the bank was “on its knees” and that it faces a compensation bill likely to run to tens of millions of pounds.

But let’s start from the beginning. First, we’ll examine the background of what led to TSB’s ill-fated system migration. Then, we’ll look at what went wrong and how it could have been prevented.

September 2013

When TSB split from Lloyds Banking Group (LBG) in September 2013, a move forced by the EU as a condition of its taxpayer bailout in 2008, a clone of the original group’s computer system was created and rented to TSB for £100m a year.

That banking system was a combination of many old systems for TSB, BOS, Halifax, Cheltenham & Gloucester, and others that had resulted from the integration of HBOS with Lloyds as a result of the banking crisis.

Under this arrangement, LBG held all the cards. It controlled the system and offered it as a costly service to TSB when it was spun off from LBG.

March 2015

When the Spanish Banco Sabadell bought TSB for £1.7bn in March 2015, it put into motion a plan it had successfully executed in the past for several other smaller banks it had acquired: merge the bank’s IT systems with its own Proteo banking software and, in doing so, save millions in IT costs.

Sabadell was warned in 2015 that its ambitious plan was high risk and that it was likely to cost far more than the £450m Lloyds was contributing to the effort.

“It is not overly generous as a budget for that scale of migration,” John Harvie, a director of the global consultancy firm Protiviti, told the Financial Times in July 2015. But the Proteo system was designed in 2000 specifically to handle mergers such as that of TSB into the Spanish group, and Sabadell pressed ahead.

Summer 2016

By the summer of 2016, work on developing the new system was meant to be well underway and December 2017 was set as a hard-and-fast deadline for delivery.

The time period to develop the new system and migrate TSB over to it was just 18 months. TSB people were saying that Sabadell had done this many times in Spain. But tiny Spanish local banks are not sprawling LBG legacy systems.

To make matters worse, the Sabadell development team did not have full control—and therefore a full understanding—of the system they were trying to migrate client data and systems from because LBG was still the supplier.

Autumn 2017

By the autumn the system was not ready. TSB announced a delay, blaming the possibility of a UK interest rate rise—which did materialize—and the risk that the bank might leave itself unable to offer mortgage quotes over a crucial weekend.

Sabadell pushed back the switchover to April 2018 to try to get the system working. It was an expensive delay because the fees TSB had to pay to LBG to keep using the old IT system were still clocking up: CEO Pester put the bill at £70m.

April 2018

On April 23, Sabadell announced that Proteo4UK—the name given to the TSB version of the Spanish bank’s IT system—was complete, and that 5.4m clients had been “successfully” migrated over to the new system.

Josep Oliu, the chairman of Sabadell, said: “With this migration, Sabadell has proven its technological management capacity, not only in national migrations but also on an international scale.”

The team behind the development were celebrating. In a LinkedIn post that has since been removed, those involved in the migration were describing themselves as “champions,” a “hell of a team,” and were pictured raising glasses of bubbly to cheers of “TSB transfer done and dusted.”

However, only hours after the switch was flicked, systems crumpled and up to 1.9m TSB clients who use internet and mobile banking were locked out.

Twitter had a field day as clients frustrated by the inability to access their accounts or get through to the bank’s call centers started to vent their anger.

Clients reported receiving texts saying their cards had been used abroad; that they had discovered thousands of pounds in their accounts they did not have; or that mortgage accounts had vanished, multiplied or changed currency.

One bemused account holder showed his TSB banking app recording a direct debit paid to Sky Digital 81 years from now. Some saw details of other people’s accounts, and holidaymakers complained that they had been left unable to pay restaurant and hotel bills.

TSB, to clients’ fury, at first insisted the problems were only intermittent. At 3:40 a.m. on Wednesday, April 25, Pester tweeted that the system was “up and running,” only to be forced to apologize the next day and admit it was actually only running at 50 percent capacity.

On Thursday he admitted the bank was on its knees, announced that he was personally seizing control of the attempts to fix the problem from his Spanish masters, and had hired a team from IBM to do the job. Sabadell said it would probably be next week before normal service returned.

The financial ombudsman and the Financial Conduct Authority have launched investigations. The bank has been forced to cancel all overdraft fees for April and raise the interest rate it pays on its classic current account in a bid to stop disillusioned clients from taking their business elsewhere.

The software Pester had boasted about in September of being 2,500 man-years in the making, with more than 1,000 people involved, has been a client service disaster that will cost the bank millions and tarnish its reputation for years.

The basic principles of a system migration

The two main things to avoid in a system migration are an unplanned outage of the service for users and loss of data, either in the sense that unauthorized users have access to data, or in the sense that data is destroyed.

In most cases, outages cannot be justified during business hours, so migrations must typically take place within the limited timeframe of a weekend. To be sure that a migration over a weekend will run smoothly, it is normally necessary to perform one or more trial migrations in non-production environments, that is, migrations to a copy of the live system which is not used by or accessible to real users. The trial migration will expose any problems with the migration process, and these problems can be fixed without any risk of affecting the service to users.

Once the trial migration is complete, has been tested, and any problems with it have been fixed, the live migration can be attempted. For a system of any complexity, the go-live weekend must be carefully pre-planned hour by hour, ensuring that all the correct people are available and know their roles.

As part of the plan, a rollback plan should be put in place. The rollback plan is a planned, rapid way to return to the old system in case anything should go wrong during the live migration. One hopes not to have to use it because the live migration should not normally be attempted unless there has been a successful trial migration and the team is confident that all the problems have been ironed out.

On the go-live weekend, the live system is taken offline, and a period of intense, often round-the-clock, activity begins, following the previously made plan. At a certain point, while there is still time to trigger the rollback plan, a meeting will be held to decide whether to go live with the migration or not (a “go/no go” meeting).

If the migration work has gone well, and the migrated system is passing basic tests (there is no time at that point for full testing; full testing should have been done on the trial migration), the decision will be to go live. If not, the rollback plan will be triggered and the system returned to its previous state, that which was obtained before the go-live weekend.

If the task of migration is so great that it is difficult to fit it into a weekend, even with very good planning and preparation, it may be necessary to break it into phases. The data or applications are broken down into groups which are migrated separately.

This approach reduces the complexity of each group migration compared to one big one, but it also has disadvantages. If the data or applications are interdependent, it may cause performance issues or other technical problems if some are migrated while others remain, especially if the source and destination are physically far apart.

A phased migration will also normally take longer than a single large migration, which will add cost, and it will be necessary to run two data centers in parallel for an extended period, which may add further cost. In TSB’s case, it may have been possible to migrate the clients across in groups, but it is hard to be sure without knowing its systems in detail.

Testing a system migration

Migrations can be expensive because it can take a great deal of time to plan and perform the trial migration(s). With complex migrations, several trial migrations may be necessary before all the problems are ironed out. If the timing of the go-live weekend is tight, which is very likely in a complex migration, it will be necessary to stage some timed trial migrations—“dress rehearsals.” Dress rehearsals are to ensure that all the activities required for the go-live can be performed within the timeframe of a weekend.

Trial migrations should be tested. In other words, once a trial migration has been performed, the migrated system, which will be hosted in a non-production environment, should be tested. The larger and more complex the migration, the greater the requirement for testing. Testing should include functional testing, user acceptance testing and performance testing.

Functional testing of a migration is somewhat different from functional testing of a newly developed piece of software. In a migration, the code itself may be unchanged, and if so there is little value in testing code which is known to work. Instead, it is important to focus the testing on the points of change between the source environment and the target. The points of change typically include the interfaces between each application and whatever other systems it connects to.

In a migration, there is often change in interface parameters used by one system to connect to another, such as IP addresses, database connection strings, and security credentials. The normal way to test the interfaces is to exercise whatever functionality of the application uses the interfaces. Of course, if code changes are necessary as part of a migration, the affected systems should be tested as new software.

In the case of TSB, the migration involved moving client bank accounts from one banking system to another. Although both the source and target systems were mature and well-tested, they had different code bases, and it is likely that the amount of functional testing required would have approached that required for new software.

User acceptance testing is functional testing performed by users. Users know their application well and therefore have an ability to spot errors quickly, or see problems that IT professionals might miss. If users test a trial migration and express themselves satisfied, it is a good sign, but not adequate on its own because, amongst other things, a handful of user acceptance testers will not test performance.

Performance testing checks that the system will work fast enough to satisfy its requirements. In a migration the normal requirement is for there to be little or no performance degradation as a result of the migration. Performance testing is expensive because it requires a full-size simulation of the systems under test, including a full data set.

If the data is sensitive, and in TSB’s case it was, it will be necessary, at significant time and cost, to protect the data by security measures as stringent as those protecting the live data, and sometimes by anonymizing the data. In the case of TSB, the IBM inquiry into what went wrong identified insufficient performance testing as one of the problems.

What went wrong?

Where did it go wrong for TSB? The bank was attempting a very complex operation. There would have been a team of thousands drawn from internal staff, staff from IT service companies, and independent contractors. Their activities would have had to be carefully coordinated, so that they performed the complex set of tasks in the right order to the right standard. Many of them would have been rare specialists. If one such specialist is off sick, it can block the work of hundreds of others. One can imagine that, as the project approached go-live, having been delayed several times before, the trial migrations were largely successful but not perfect.

The senior TSB management would have been faced with a dilemma of whether to accept the risks of doing the live migration without complete testing in the trials, or to postpone go-live by several weeks and report to the board another slippage, and several tens of millions of pounds of further cost overrun. They gambled and lost.

How could TSB have done things differently?

Firstly, a migration should have senior management backing. TSB clearly had it, but with smaller migrations, it is not uncommon for the migration to be some way down senior managers’ priorities. This can lead to system administrators or other actors, whose reporting lines lead elsewhere from those doing the migration, frustrating key parts of the migration because their managers are not ordering them or paying them to cooperate.

Secondly, careful planning and control is essential. It hardly needs saying that it is not possible to manage a complex migration without careful planning and those managing the migration must have an appropriate level of experience and skill. In addition, however, the planning must follow a sound basic approach that includes trial migrations, testing, and rollback plans as described above. While the work is going on, close control is important. Senior management must stay close to what is happening on the ground and be able to react quickly, for example by fast-tracking authorizations, if delays or blockages occur.

Thirdly, there must be a clear policy on risk, and the policy should be stuck to. What criteria must be met for go-live? Once this has been determined, the amount of testing required can be determined. If the tests are not passed, there must be the discipline not to attempt the migration, even if it will cost much more.

Finally, in complex migrations, a phased approach should be considered.

Conclusion

In the case of TSB Bank, the problems that occurred after the live migration were either not spotted in testing, or they were spotted but the management decided to accept the risk and go live anyway. If they were not spotted, it would indicate that testing was not comprehensive enough—IBM specifically pointed to insufficient performance testing. That could be due to a lack of experience among the key managers. If the problems were spotted in testing, it implies weak go-live criteria and/or an inappropriate risk policy. IBM also implied that TSB should have performed a phased migration.

It may be that the public will never fully know what caused TSB’s migration to go wrong, but it sounds like insufficient planning and testing were major factors. Sensitive client data was put at risk, and clients suffered long unplanned outages, resulting in CEO Paul Pester being summoned to the Treasury select committee and the Financial Conduct Authority launching an investigation into the bank. Ultimately Pester lost his job.

When migrating IT systems in the financial sector, cutting corners is dangerous. Ultimately, TSB’s case goes to show that the consequences can be dire. For success one needs to follow some basic principles, use the right people, and be prepared to allocate sufficient time and money to planning and testing. Only then can it be ensured a successful system migration will take place.
Posted on Wednesday, March 13, 2019 by Henrico Dolfing

0 comments: