Monday, March 25, 2019

Cloud computing threats, vulnerabilities and risks

Cloud computing threats, vulnerabilities and risks
As pointed out in a previous article on cloud computing project management, one thing that has changed a lot with the rise of cloud usage is security.

Your cloud computing environment experiences at a high level the same threats as your traditional data center environment. Your threat picture is more or less the same.

Both environments run software, software has vulnerabilities, and adversaries try to exploit those vulnerabilities.

But unlike your systems in a traditional data center, in cloud computing, responsibility for mitigating the risks that result from these software vulnerabilities is shared between the provider and you, the customer.

For that reason, you must understand the division of responsibilities and trust that the provider will hold up their end of the bargain.

This article discusses the 12 biggest threats and vulnerabilities for a cloud computing environment. It splits these into a set of cloud-unique and a set of shared cloud/on-premises vulnerabilities and threats. But before we start, we have to clarify some definitions, because some of the most commonly mixed-up security terms are actually threat, vulnerability, and risk.

Assets, Threats, Vulnerabilities, and Risk

While it might be unreasonable to expect those outside the security industry to understand the differences, more often than not, many in the business use terms such as “asset,” “threat,” “vulnerability,” and “risk” incorrectly or interchangeably. So maybe providing some definitions for those terms will help to make the rest of the article clearer.

Asset – People, property, and information. People may include employees and customers along with other invited persons such as contractors or guests. Property assets consist of both tangible and intangible items that can be assigned a value. Intangible assets include reputation and proprietary information. Information may include databases, software code, critical company records, and many other intangible items. An asset is what we’re trying to protect.

Threat – Anything that can exploit a vulnerability, intentionally or accidentally, and obtain, damage, or destroy an asset. A threat is what we’re trying to protect against.

Vulnerability – Weaknesses or gaps in a security program that can be exploited by threats to gain unauthorized access to an asset. A vulnerability is a weakness or gap in our protection efforts.

Risk – The potential for loss, damage or destruction of an asset as a result of a threat exploiting a vulnerability. Why is it important to understand the difference between these terms? If you don’t understand the difference, you’ll never understand the true risk to assets. You see, when conducting a risk assessment, the formula used to determine risk is:

Asset + Threat + Vulnerability = Risk

Cloud characteristics

While we're defining terms, let’s define cloud computing as well. The most meaningful way to do so in a security context is, in my opinion, by the five cloud computing characteristics published by the National Institute of Standards and Technology (NIST). They are:

1) On-demand self-service: A customer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service provider.

2) Broad network access: Capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, tablets, laptops and workstations).

3) Resource pooling: The provider's computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to customer demand. There is a sense of location independence in that the customer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state or datacenter). Examples of resources include storage, processing, memory and network bandwidth.

4) Rapid elasticity: Capabilities can be elastically provisioned and released, in some cases automatically, to scale rapidly outward and inward commensurate with demand. To the customer, the capabilities available for provisioning often appear to be unlimited and can be appropriated in any quantity at any time.

5) Measured service: Cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth and active user accounts). Resource usage can be monitored, controlled and reported, providing transparency for the provider and customer.

Cloud-specific threats and vulnerabilities

The following vulnerabilities are a result of a cloud service provider’s implementation of the five cloud computing characteristics described above. These vulnerabilities do not exist in classic IT data centers.

#1 Reduced visibility and control

When transitioning your assets/operations to the cloud, your organization loses some visibility and control over those assets/operations. When using external cloud services, the responsibility for some of the policies and infrastructure moves to the provider.

The actual shift of responsibility depends on the cloud service model(s) used, leading to a paradigm shift for customers in relation to security monitoring and logging. Your organization needs to perform monitoring and analysis of information about applications, services, data, and users, without using network-based monitoring and logging, which is available for your on-premises IT.

#2 On-demand self-service

Providers make it very easy to provision new services. The on-demand self-service provisioning features of the cloud enables your organization's employees to provision additional services from the provider without IT consent. This practice of using software in an organization that is not supported by the organization's IT department is commonly referred to as shadow IT.

Due to the lower costs and ease of implementing platform as a service (PaaS) and software as a service (SaaS) products, the probability of unauthorized use of cloud services increases. Services provisioned or used without IT's knowledge present risks to an organization. The use of unauthorized cloud services could result in an increase in malware infections or data exfiltration since your organization is unable to protect resources it does not know about. The use of unauthorized cloud services also decreases your organization's visibility and control of network and data.

#3 Internet-accessible management APIs 

Providers expose a set of application programming interfaces (APIs) that customers use to manage and interact with cloud services (also known as the management plane). Organizations use these APIs to provision, manage, orchestrate, and monitor their assets and users. These APIs can contain the same software vulnerabilities as an API for an operating system, library, etc. Unlike management APIs for on-premises computing, provider APIs are accessible via the Internet, exposing them more broadly to potential exploitation.

Threat actors look for vulnerabilities in management APIs. If discovered, these vulnerabilities can be targeted for successful attacks, and an organization’s cloud assets can be compromised. From there, attackers can use organization assets to perpetrate further attacks against other customers of the provider.

#4 Multi-tenancy 

Exploitation of system and software vulnerabilities within a provider's infrastructure, platforms, or applications that support multi-tenancy can lead to a failure to maintain separation among tenants. This failure can be used by an attacker to gain access from one organization's resource to another user's or organization's assets or data. Multi-tenancy increases the attack surface, leading to an increased chance of data leakage if the separation controls fail.

This attack can be accomplished by exploiting vulnerabilities in the provider's applications, hypervisor, or hardware, subverting logical isolation controls or attacks on the provider's management API.

No reports of an attack based on logical separation failure have been identified; however, proof-of-concept exploits have been demonstrated.

#5 Data deletion 

Threats associated with data deletion exist because the consumer has reduced visibility into where their data is physically stored in the cloud and a reduced ability to verify the secure deletion of their data. This risk is concerning because the data is spread over a number of different storage devices within the provider's infrastructure in a multi-tenancy environment. In addition, deletion procedures may differ from provider to provider. Organizations may not be able to verify that their data was securely deleted and that remnants of the data are not available to attackers. This threat increases as a customer uses more provider services.

Cloud and on-premises threats and vulnerabilities

The following are threats and vulnerabilities that apply to both cloud and on-premises IT data centers that organizations need to address.

#6 Credentials are stolen

If an attacker gains access to one of your user's cloud credentials, the attacker can have access to the provider's services to provision additional resources (if credentials allowed access to provisioning), as well as target your organization's assets. The attacker could leverage cloud computing resources to target your organization's administrative users, other organizations using the same provider, or the provider's administrators. An attacker who gains access to a provider administrator's cloud credentials may be able to use those credentials to access the customers’ systems and data.

Administrator roles vary between a provider and an organization. The provider administrator has access to the provider network, systems, and applications (depending on the service) of the provider's infrastructure, whereas the customer's administrators have access only to the organization's cloud implementations. In essence, the provider administrator has administration rights over more than one customer and supports multiple services.

#7 Vendor lock-in 

Vendor lock-in becomes an issue when your organization considers moving its assets/operations from one provider to another. Your organization will probably discover that the cost/effort/schedule time necessary for the move is much higher than initially considered due to factors such as non-standard data formats, non-standard APIs, and reliance on one provider's proprietary tools and unique APIs.

This issue increases in service models where the provider takes more responsibility. As a customer uses more features, services, or APIs, the exposure to a provider's unique implementations increases. These unique implementations require changes when a capability is moved to a different provider. If a selected provider goes out of business, it becomes a major problem since data can be lost or may not be able to be transferred to another provider in a timely manner.

#8 Increased complexity 

Migrating to the cloud can introduce complexity into IT operations. Managing, integrating, and operating in the cloud may require that the organization's existing IT staff learn a new model. IT staff must have the capacity and skill level to manage, integrate, and maintain the migration of assets and data to the cloud in addition to their current responsibilities for on-premises IT.

Key management and encryption services become more complex in the cloud. The services, techniques, and tools available to log and monitor cloud services typically vary across providers, further increasing complexity. There may also be emergent threats/risks in hybrid cloud implementations due to technology, policies, and implementation methods, which add complexity.

This added complexity leads to an increased potential for security gaps in an agency's cloud and on-premises implementations.

#9 Insider abuse 

Insiders, such as staff and administrators for both organizations and providers, who abuse their authorized access to the organization's or provider's networks, systems, and data are uniquely positioned to cause damage or exfiltrate information.

The impact is most likely worse when using infrastructure as a service (IaaS) due to an insider's ability to provision resources or perform nefarious activities that require forensics for detection. These forensic capabilities may not be available with cloud resources.

#10 Lost data 

Data stored in the cloud can be lost for reasons other than malicious attacks. Accidental deletion of data by the cloud service provider or a physical catastrophe, such as a fire or earthquake, can lead to the permanent loss of customer data. The burden of avoiding data loss does not fall solely on the provider's shoulders. If a customer encrypts its data before uploading it to the cloud but loses the encryption key, the data will be lost. In addition, inadequate understanding of a provider's storage model may result in data loss. Organizations must consider data recovery and be prepared for the possibility of their provider being acquired, changing service offerings, or going bankrupt.

This threat increases as an organization uses more provider services. Recovering data from a provider may be easier than recovering it at an agency because a service level agreement (SLA) designates availability/uptime percentages. These percentages should be investigated when your organization selects a provider.

#11 Provider supply chain 

If your provider outsources parts of its infrastructure, operations, or maintenance, these third parties may not satisfy/support the requirements that the provider is contracted to provide with for organization. Your organization needs to evaluate how the provider enforces compliance and check to see if the provider flows its own requirements down to third parties. If the requirements are not being levied on the supply chain, then the threat to your organization increases.

This threat increases as your organization uses more provider services and is dependent on individual providers and their supply chain policies.

#12 Insufficient due diligence 

Organizations migrating to the cloud often perform insufficient due diligence. They move data to the cloud without understanding the full scope of doing so, the security measures used by the provider, and their own responsibility to provide security measures. They make decisions to use cloud services without fully understanding how those services must be secured.

Conclusion

Although the level of threat in a cloud computing environment is similar to that of a traditional data center, there is a key difference in who is responsible for mitigating the risk. It is important to remember that cloud service providers use a shared responsibility model for security. Your provider accepts responsibility for some aspects of security. Other aspects of security are shared between your provider and you, the customer. And some aspects of security remain the sole responsibility of the consumer. Successful cloud security depends on both parties knowing and meeting all their responsibilities effectively. The failure of organizations to understand or meet their responsibilities is a leading cause of security incidents in cloud computing environments.

Read more…

Monday, March 18, 2019

Project Portfolio Management: Theory vs. Practice

Project Portfolio Management: Theory vs. Practice
If you are responsible for managing portfolios of technology programs and projects, your success in maximizing business outcomes with finite resources is vital to your company’s future in a fast-changing and digital world.

Project portfolio management is the art and science of making decisions about investment mix, operational constraints, resource allocation, project priority and schedule. It is about understanding the strengths and weaknesses of the portfolio, predicting opportunities and threats, matching investments to objectives, and optimizing trade-offs encountered in the attempt to maximize return (i.e., outcomes over investments) at a given appetite for risk (i.e., uncertainty about return).

Most large companies have a project portfolio management process in place, and they mostly follow the traditional project portfolio management process as put on paper by PMI. This process is comprehensible and stable by nature.

Even better, it has the appearance of a marvelous mechanical system that can be followed in a plannable, stable, and reproducible manner. In the end, the project with the greatest strategic contributions always wins the battle for the valuable resources.

Unfortunately, this process does not work well in the real world, despite its apparent elegance. Ultimately, it is characterized by uncertainty, difficulties, ever-changing market environments, and, of course, people—and these do not function like machines.

When we look at technology projects, the primary goal of portfolio executives is to maximize the delivery of technology outputs within budget and schedule. This IT-centric mandate emphasizes output over outcome, and risk over return.

On top of this, the traditional IT financial framework is essentially a cost-recovery model that isn’t suitable for portfolio executives to articulate how to maximize business outcomes on technology investments.

As a result, portfolio management is marginalized to a bureaucratic overhead and a nice-to-have extension of the program and project management function.

So yes, in theory most large organizations have a project portfolio management function in place, but in practice it is far from effective.

Below are 11 key observations I have made in the last few years regarding effective project portfolio management:

1) No data and visibility.

The first theoretical benefit of effective project portfolio management concerns its ability to drive better business decisions. To make good decisions you need good data, and that’s why visibility is so crucial, both from a strategic, top-down perspective and from a tactical, bottom-up perspective.

Anything that can be measured can be improved. However, organizations don’t always do sufficient monitoring. Few organizations actually track project and portfolio performance against their own benchmarks, nor do they track dependencies.

Worse, strategic multiyear initiatives are the least likely to be tracked in a quantitative, objective manner. For smaller organizations, the absence of such a process might be understandable, but for a large organization, tracking is a must.

Not monitoring project results creates a vicious circle: If results are not tracked, then how can the portfolio management and strategic planning process have credibility? It is likely that it doesn’t, and over time, the risk is that estimates are used more as a means of making a project appear worthy of funding than as a mechanism for robust estimation of future results. Without tracking, there is no mechanism to make sure initial estimates of costs and benefits are realistic.

When you have a good handle on past project metrics, it makes it much easier to predict future factors like complexity, duration, risks, expected value, etc. And when you have a good handle on what is happening in your current project portfolio, you can find out which projects are not contributing to your strategy, are hindering other more important projects, or are not contributing enough value.

And once you have this data, don’t keep it in a silo only visible for a select group. All people involved in projects should be able to use this data for their own projects.

2) Many technology projects should not have been started at all.

Big data, blockchain, artificial intelligence, virtual reality, augmented reality, robotics, 5G, machine learning... Billions and billions are poured into projects around these technologies, and for most organizations, not much is coming out of it.

And this is not because these projects are badly managed. Quite simply, it is because they should not have been started in the first place.

I believe that one of the main reasons that many innovative technology projects are started comes down to a fear of missing out, or FOMO.

You may find the deceptively simple but powerful questions in “Stop wasting money on FOMO technology innovation projects” quite useful in testing and refining technology project proposals, clarifying the business case, building support, and ultimately persuading others why they should invest scarce resources in an idea or not.

3) Many projects should have been killed much earlier.

Knowing when to kill a project and how to kill it is important for the success of organizations, project managers and sponsors.

Not every project makes its way to the finish line, and not every project should. As a project manager or sponsor, you’re almost certain to find yourself, at some point in your career, running a project that has no chance of success, or that should never have been initiated in the first place.

The reasons why you should kill a project may vary. It could be the complexity involved, staff resource limitations, unrealistic project expectations, a naive and underdeveloped project plan, the loss of key stakeholders, higher priorities elsewhere, market changes, or some other element. Likely, it will be a combination of some or many of these possibilities.

What’s important is that you do it on time: 17 percent of IT projects fail so badly they can threaten the existence of a company (Calleam).

Keep an eye out for warning signs, ask yourself tough questions, and set aside your ego. By doing so, you can easily identify projects that need to be abandoned right away. You might find “Why killing projects is so hard (and how to do it anyway)” helpful in this process.

4) Project selection is rarely complete and neutral.

This is often because the organization’s strategy is not known, not developed, or cannot be applied to the project (see Observation 10).

But besides this there is the “principal-agent problem.” This means that your managers already know the criteria on which projects will be selected, and so they “optimize” their details accordingly. Even when these details are not “optimized,” this data is collected in an entirely incomplete and inconsistent manner.

And did you ever encounter the situation where projects were already decided on in other rooms than in the one where the decision should have been made? I sure have.

5) Organizations do far too many projects in parallel.

Traditional project portfolio management is all about value optimization and optimizing resource allocation. Both are designed in such a way that, in my opinion, it will result in the opposite. As I (and probably you too) have seen time and again, running projects in an organization at 100 percent utilization is an economic disaster.

Any small amount of unplanned work will cause delays, which will become even worse because of time spent on re-planning, and value is only created when it is delivered and not when it is planned. Hence, we should focus on delivering value as quickly as possible within our given constraints. See “Doing the right number of projects” for more details.

6) Projects are done too slowly.

Too many organizations try to save money on projects (cost efficiency) when the benefits of completing the project earlier far outweigh the potential cost savings. You might, for example, be able to complete a project with perfect resource management (all staff is busy) in 12 months for $1 million. Alternatively, you could hire some extra people and have them sitting around occasionally at a total cost of $1.5 million, but the project would be completed in only six months.

What's that six-month difference worth? Well, if the project is strategic in nature, it could be worth everything. It could mean being first to market with a new product or possessing a required capability for an upcoming bid that you don't even know about yet. It could mean impressing the heck out of some skeptical new client or being prepared for an external audit. There are many scenarios where the benefits outweigh the cost savings (see "Cost of delay" for more details).

On top of delivering the project faster, when you are done after six months instead of 12 months you can use the existing team for a different project, delivering even more benefits for your organization. So not only do you get your benefits for your original project sooner and/or longer, you will get those for your next project sooner as well because it starts earlier and is staffed with an experienced team.

An important goal of your project portfolio management strategy should be to have a high throughput. It’s vital to get projects delivered fast so you start reaping your benefits, and your organization is freed up for new projects to deliver additional benefits.

7) The right projects should have gotten more money, talent and senior management attention.

Partly as a result of observations 5 and 6, but also because of not focusing and agreeing on what the real important projects are, many of them are spread too thin.

The method of always selecting “the next project on the list, from top to bottom, until the budget runs out” does not work as a selection method for the project portfolio. The problem here is that the right resources often receive far too little consideration. Even a rough consideration according to the principle “it looks good overall” can lead to bad bottlenecks in the current year.

Unlike money, people and management attention cannot be moved and scaled at will. This means that bottlenecks quickly become determining factors and conflict with strategic priority and feasibility. In addition, external capacities are not available in the desired quantity. Also, the process of phasing in new employees creates friction, costs time, and temporarily reduces the capacity of the existing team instead of increasing it.

8) Project success is not defined nor measured.

Defining project success is actually one of the largest contributors to project success and I have written many times about it (see here, and here). When starting any project, it's essential to work actively with the organization that owns the project to define success across three levels:

i) Project delivery
ii) Product or service
iii) Business

The process of "success definition" should also cover how the different criteria will be measured (targets, measurements, time, responsible, etc.). Project success may be identified as all points within a certain range of these defined measurements. Success is not just a single point.

The hard part is identifying the criteria, importance, and boundaries of the different success areas. But only when you have done this are you able to manage and identify your projects as a success.

9) Critical assumptions are not validated.

For large or high-risk projects (what is large depends on your organization) it should be mandatory to do an assumption validation before you dive headfirst into executing the project. In this phase you should do a business case validation and/or a technical validation in the form of a proof of concept.

Even if you do this, your project isn’t guaranteed to succeed. The process of validation is just the start. But if you’ve worked through the relevant validations, you’ll be in a far better position to judge if you should stop, continue or change your project.

The goal of the validation phase is to delay the expensive and time-consuming work of projects as late as possible in the process. It’s the best way to keep yourself focused, to minimize costs and to maximize your chance of a successful project. See “No validation? No project!” for more details on this.

10) Your organization has no clear strategy.

Without having a strategy defined and communicated in your organization it is impossible to do effective project portfolio management. I like the definitions of Mintzberg and De Flander regarding this.

“Strategy is a pattern in a stream of decisions.” – Henry Mintzberg            

First, there’s the overall decision—the big choice—that guides all other decisions. To make a big choice, we need to decide who we focus on—our target client segment—and we need to decide how we offer unique value to the customers in our chosen segment. That’s basic business strategy stuff.

But by formulating it this way, it helps us to better understand the second part: the day-to-day decisions—the small choices—that get us closer to the finish line. When these small choices are in line with the big choice, you get a Mintzberg pattern. So if strategy is a decision pattern, strategy execution is enabling people to create a decision pattern. In other words:

“Strategy execution is helping people make small choices in line with a big choice.” – Jeroen De Flander

This notion requires a big shift in the way we typically think about execution. Looking at strategy execution, we should imagine a decision tree rather than an action plan. Decision patterns are at the core of successful strategy journeys, not to-do lists.

To improve the strategy implementation quality, we should shift our energy from asking people to make action plans to help them make better decisions.

11) Ideas are not captured.

Although there is clearly no shortage of ideas within organizations, most organizations unfortunately seldom capture these ideas, except in the few cases where a handful of employees are sufficiently entrepreneurial to drive their own ideas through to implementation. This can happen in spite of the organization, rather than because of it.

Organizations are effective at focusing employees on their daily tasks, roles, and responsibilities. However, they are far less effective at capturing the other output of that process: the ideas and observations that result from it. It is important to remember that these ideas can be more valuable than an employee’s routine work. Putting in an effective process for capturing ideas provides an opportunity for organizations to leverage a resource they already have, already pay for, but fail to capture the full benefit of—namely, employee creativity.

To assume that the best ideas will somehow rise to the top, without formal means to capture them in the first place, is too optimistic. Providing a simplified, streamlined process for idea submission can increase project proposals and result in a better portfolio of projects. Simplification is not about reducing the quality of ideas, but about reducing the bureaucracy associated with producing them. Simplification is not easy, as it involves defining what is really needed before further due diligence is conducted on the project. It also means making the submission process easy to follow and locate, and driving awareness of it.

Conclusion

In the digital age, an effective project portfolio management function is a strategic necessity.

The dilemma of traditional project portfolio management is in granting too little relevance to the actual feasibility at the expense of strategic weighting. In actuality, it is more important to produce a portfolio that, in its entirety, has a real chance of succeeding. It should also be regarded not in terms of a fiscal year, but ideally in much smaller time segments with constant review and the possibility of reprioritization.

Therefore, the question should no longer be “what can we get for this fixed amount of money in the upcoming year,” but rather, “what is the order of priority for us today?”

Here, the perspective moves away from an annually recurring budget process and toward a periodic social exchange of results, knowledge, and modified framework conditions. In the best-case scenario, this penetrates the entire organization, from portfolio to project to daily work.

What do you think?

Read more…

Wednesday, March 13, 2019

Case Study: The epic meltdown of TSB Bank

Case Study: The epic meltdown of TSB Bank
With clients locked out of their bank accounts, mortgage accounts vanishing, small businesses reporting that they could not pay their staff and reports of debit cards ceasing to work, the TSB Bank computer crisis of April 2018 has been one of the worst in recent memory. The bank’s CEO, Paul Pester, admitted in public that the bank was “on its knees” and that it faces a compensation bill likely to run to tens of millions of pounds.

But let’s start from the beginning. First, we’ll examine the background of what led to TSB’s ill-fated system migration. Then, we’ll look at what went wrong and how it could have been prevented.

September 2013

When TSB split from Lloyds Banking Group (LBG) in September 2013, a move forced by the EU as a condition of its taxpayer bailout in 2008, a clone of the original group’s computer system was created and rented to TSB for £100m a year.

That banking system was a combination of many old systems for TSB, BOS, Halifax, Cheltenham & Gloucester, and others that had resulted from the integration of HBOS with Lloyds as a result of the banking crisis.

Under this arrangement, LBG held all the cards. It controlled the system and offered it as a costly service to TSB when it was spun off from LBG.

March 2015

When the Spanish Banco Sabadell bought TSB for £1.7bn in March 2015, it put into motion a plan it had successfully executed in the past for several other smaller banks it had acquired: merge the bank’s IT systems with its own Proteo banking software and, in doing so, save millions in IT costs.

Sabadell was warned in 2015 that its ambitious plan was high risk and that it was likely to cost far more than the £450m Lloyds was contributing to the effort.

“It is not overly generous as a budget for that scale of migration,” John Harvie, a director of the global consultancy firm Protiviti, told the Financial Times in July 2015. But the Proteo system was designed in 2000 specifically to handle mergers such as that of TSB into the Spanish group, and Sabadell pressed ahead.

Summer 2016

By the summer of 2016, work on developing the new system was meant to be well underway and December 2017 was set as a hard-and-fast deadline for delivery.

The time period to develop the new system and migrate TSB over to it was just 18 months. TSB people were saying that Sabadell had done this many times in Spain. But tiny Spanish local banks are not sprawling LBG legacy systems.

To make matters worse, the Sabadell development team did not have full control—and therefore a full understanding—of the system they were trying to migrate client data and systems from because LBG was still the supplier.

Autumn 2017

By the autumn the system was not ready. TSB announced a delay, blaming the possibility of a UK interest rate rise—which did materialize—and the risk that the bank might leave itself unable to offer mortgage quotes over a crucial weekend.

Sabadell pushed back the switchover to April 2018 to try to get the system working. It was an expensive delay because the fees TSB had to pay to LBG to keep using the old IT system were still clocking up: CEO Pester put the bill at £70m.

April 2018

On April 23, Sabadell announced that Proteo4UK—the name given to the TSB version of the Spanish bank’s IT system—was complete, and that 5.4m clients had been “successfully” migrated over to the new system.

Josep Oliu, the chairman of Sabadell, said: “With this migration, Sabadell has proven its technological management capacity, not only in national migrations but also on an international scale.”

The team behind the development were celebrating. In a LinkedIn post that has since been removed, those involved in the migration were describing themselves as “champions,” a “hell of a team,” and were pictured raising glasses of bubbly to cheers of “TSB transfer done and dusted.”

However, only hours after the switch was flicked, systems crumpled and up to 1.9m TSB clients who use internet and mobile banking were locked out.

Twitter had a field day as clients frustrated by the inability to access their accounts or get through to the bank’s call centers started to vent their anger.

Clients reported receiving texts saying their cards had been used abroad; that they had discovered thousands of pounds in their accounts they did not have; or that mortgage accounts had vanished, multiplied or changed currency.

One bemused account holder showed his TSB banking app recording a direct debit paid to Sky Digital 81 years from now. Some saw details of other people’s accounts, and holidaymakers complained that they had been left unable to pay restaurant and hotel bills.

TSB, to clients’ fury, at first insisted the problems were only intermittent. At 3:40 a.m. on Wednesday, April 25, Pester tweeted that the system was “up and running,” only to be forced to apologize the next day and admit it was actually only running at 50 percent capacity.

On Thursday he admitted the bank was on its knees, announced that he was personally seizing control of the attempts to fix the problem from his Spanish masters, and had hired a team from IBM to do the job. Sabadell said it would probably be next week before normal service returned.

The financial ombudsman and the Financial Conduct Authority have launched investigations. The bank has been forced to cancel all overdraft fees for April and raise the interest rate it pays on its classic current account in a bid to stop disillusioned clients from taking their business elsewhere.

The software Pester had boasted about in September of being 2,500 man-years in the making, with more than 1,000 people involved, has been a client service disaster that will cost the bank millions and tarnish its reputation for years.

The basic principles of a system migration

The two main things to avoid in a system migration are an unplanned outage of the service for users and loss of data, either in the sense that unauthorized users have access to data, or in the sense that data is destroyed.

In most cases, outages cannot be justified during business hours, so migrations must typically take place within the limited timeframe of a weekend. To be sure that a migration over a weekend will run smoothly, it is normally necessary to perform one or more trial migrations in non-production environments, that is, migrations to a copy of the live system which is not used by or accessible to real users. The trial migration will expose any problems with the migration process, and these problems can be fixed without any risk of affecting the service to users.

Once the trial migration is complete, has been tested, and any problems with it have been fixed, the live migration can be attempted. For a system of any complexity, the go-live weekend must be carefully pre-planned hour by hour, ensuring that all the correct people are available and know their roles.

As part of the plan, a rollback plan should be put in place. The rollback plan is a planned, rapid way to return to the old system in case anything should go wrong during the live migration. One hopes not to have to use it because the live migration should not normally be attempted unless there has been a successful trial migration and the team is confident that all the problems have been ironed out.

On the go-live weekend, the live system is taken offline, and a period of intense, often round-the-clock, activity begins, following the previously made plan. At a certain point, while there is still time to trigger the rollback plan, a meeting will be held to decide whether to go live with the migration or not (a “go/no go” meeting).

If the migration work has gone well, and the migrated system is passing basic tests (there is no time at that point for full testing; full testing should have been done on the trial migration), the decision will be to go live. If not, the rollback plan will be triggered and the system returned to its previous state, that which was obtained before the go-live weekend.

If the task of migration is so great that it is difficult to fit it into a weekend, even with very good planning and preparation, it may be necessary to break it into phases. The data or applications are broken down into groups which are migrated separately.

This approach reduces the complexity of each group migration compared to one big one, but it also has disadvantages. If the data or applications are interdependent, it may cause performance issues or other technical problems if some are migrated while others remain, especially if the source and destination are physically far apart.

A phased migration will also normally take longer than a single large migration, which will add cost, and it will be necessary to run two data centers in parallel for an extended period, which may add further cost. In TSB’s case, it may have been possible to migrate the clients across in groups, but it is hard to be sure without knowing its systems in detail.

Testing a system migration

Migrations can be expensive because it can take a great deal of time to plan and perform the trial migration(s). With complex migrations, several trial migrations may be necessary before all the problems are ironed out. If the timing of the go-live weekend is tight, which is very likely in a complex migration, it will be necessary to stage some timed trial migrations—“dress rehearsals.” Dress rehearsals are to ensure that all the activities required for the go-live can be performed within the timeframe of a weekend.

Trial migrations should be tested. In other words, once a trial migration has been performed, the migrated system, which will be hosted in a non-production environment, should be tested. The larger and more complex the migration, the greater the requirement for testing. Testing should include functional testing, user acceptance testing and performance testing.

Functional testing of a migration is somewhat different from functional testing of a newly developed piece of software. In a migration, the code itself may be unchanged, and if so there is little value in testing code which is known to work. Instead, it is important to focus the testing on the points of change between the source environment and the target. The points of change typically include the interfaces between each application and whatever other systems it connects to.

In a migration, there is often change in interface parameters used by one system to connect to another, such as IP addresses, database connection strings, and security credentials. The normal way to test the interfaces is to exercise whatever functionality of the application uses the interfaces. Of course, if code changes are necessary as part of a migration, the affected systems should be tested as new software.

In the case of TSB, the migration involved moving client bank accounts from one banking system to another. Although both the source and target systems were mature and well-tested, they had different code bases, and it is likely that the amount of functional testing required would have approached that required for new software.

User acceptance testing is functional testing performed by users. Users know their application well and therefore have an ability to spot errors quickly, or see problems that IT professionals might miss. If users test a trial migration and express themselves satisfied, it is a good sign, but not adequate on its own because, amongst other things, a handful of user acceptance testers will not test performance.

Performance testing checks that the system will work fast enough to satisfy its requirements. In a migration the normal requirement is for there to be little or no performance degradation as a result of the migration. Performance testing is expensive because it requires a full-size simulation of the systems under test, including a full data set.

If the data is sensitive, and in TSB’s case it was, it will be necessary, at significant time and cost, to protect the data by security measures as stringent as those protecting the live data, and sometimes by anonymizing the data. In the case of TSB, the IBM inquiry into what went wrong identified insufficient performance testing as one of the problems.

What went wrong?

Where did it go wrong for TSB? The bank was attempting a very complex operation. There would have been a team of thousands drawn from internal staff, staff from IT service companies, and independent contractors. Their activities would have had to be carefully coordinated, so that they performed the complex set of tasks in the right order to the right standard. Many of them would have been rare specialists. If one such specialist is off sick, it can block the work of hundreds of others. One can imagine that, as the project approached go-live, having been delayed several times before, the trial migrations were largely successful but not perfect.

The senior TSB management would have been faced with a dilemma of whether to accept the risks of doing the live migration without complete testing in the trials, or to postpone go-live by several weeks and report to the board another slippage, and several tens of millions of pounds of further cost overrun. They gambled and lost.

How could TSB have done things differently?

Firstly, a migration should have senior management backing. TSB clearly had it, but with smaller migrations, it is not uncommon for the migration to be some way down senior managers’ priorities. This can lead to system administrators or other actors, whose reporting lines lead elsewhere from those doing the migration, frustrating key parts of the migration because their managers are not ordering them or paying them to cooperate.

Secondly, careful planning and control is essential. It hardly needs saying that it is not possible to manage a complex migration without careful planning and those managing the migration must have an appropriate level of experience and skill. In addition, however, the planning must follow a sound basic approach that includes trial migrations, testing, and rollback plans as described above. While the work is going on, close control is important. Senior management must stay close to what is happening on the ground and be able to react quickly, for example by fast-tracking authorizations, if delays or blockages occur.

Thirdly, there must be a clear policy on risk, and the policy should be stuck to. What criteria must be met for go-live? Once this has been determined, the amount of testing required can be determined. If the tests are not passed, there must be the discipline not to attempt the migration, even if it will cost much more.

Finally, in complex migrations, a phased approach should be considered.

Conclusion

In the case of TSB Bank, the problems that occurred after the live migration were either not spotted in testing, or they were spotted but the management decided to accept the risk and go live anyway. If they were not spotted, it would indicate that testing was not comprehensive enough—IBM specifically pointed to insufficient performance testing. That could be due to a lack of experience among the key managers. If the problems were spotted in testing, it implies weak go-live criteria and/or an inappropriate risk policy. IBM also implied that TSB should have performed a phased migration.

It may be that the public will never fully know what caused TSB’s migration to go wrong, but it sounds like insufficient planning and testing were major factors. Sensitive client data was put at risk, and clients suffered long unplanned outages, resulting in CEO Paul Pester being summoned to the Treasury select committee and the Financial Conduct Authority launching an investigation into the bank. Ultimately Pester lost his job.

When migrating IT systems in the financial sector, cutting corners is dangerous. Ultimately, TSB’s case goes to show that the consequences can be dire. For success one needs to follow some basic principles, use the right people, and be prepared to allocate sufficient time and money to planning and testing. Only then can it be ensured a successful system migration will take place.

For more Project Failure Case Studies just click here

Read more…

Sunday, March 10, 2019

10 Questions to ask before signing your cloud computing contract

10 Questions to Ask Before Signing Your Cloud Computing Contract
As pointed out in a previous article on cloud computing project management two things that have changed a lot with the rise of cloud usage are vendor relationships and contracts.

Contracts for cloud computing are rather inflexible by nature. In a cloud computing arrangement, what's negotiable and what's not? Cloud computing may be highly virtualized and digitized, but it is still based on a relationship between two parties consisting of human beings.

Below you will find 10 questions you should have answered before you sign your cloud computing contract. In my experience, these are also the biggest discussion points between a cloud provider and you as a cloud customer when negotiating such a contract.

1) How can you exit if needed? 

The very first question you should ask is, how do you get out when you need to? Exit strategies need to be carefully thought out before committing to a cloud engagement.

Vendor lock-in typically results from long-term initial contracts. Some providers want early termination fees (which may be huge) if customers terminate a fixed-term contract earlier for convenience, as recovery of fixed setup costs were designed to be spread over the term.

Often, contracts require "notice of non-renewal within a set period before expiry," causing customers to miss the window to exit the arrangement. Such onerous automatic renewal provisions can be negotiated out up front.

One other very important aspect of your exit strategy is the next question.

2) Who maintains your data for legal or compliance purposes, and what happens to it when contracts are terminated?

I have not seen a lot of negotiation yet around data retention for legally required purposes, such as litigation e-discovery or preservation as evidence upon law enforcement request. I think this issue will become more important in the future. One area that is being negotiated with increasing urgency is the ability to have your data returned upon contract termination. There are several aspects here: data format, what assistance (if any) providers will give users, what (if anything) providers charge for such assistance, and data retention period.

Another question that comes up is how long after termination users have to recover data before deletion. Many providers delete all data immediately or after a short period (often 30 days), but some users obtain longer grace periods, for example two months, perhaps requiring notice to users before deletion.

3) Who is liable for your damages from interruptions in service? 

For the most part, cloud providers refuse to accept liability for service interruption issues. Providers state liability is non-negotiable, and “everyone else accepts it.” Even large organizations have difficulty getting providers to accept any monetary liability. This can be a deal-breaker.

4) What about service level agreements (SLAs)? 

Service level agreements are another important piece of a cloud contract, and come in many flavors, since standards are lacking in this area. SLAs are often highly negotiable, as they can be adjusted through pricing—the more you pay, the better performance you are guaranteed. If SLAs are not kept, payments in the form of a service credit is normal. But how much?

5) Does availability extend to your data? 

Cloud providers tend to emphasize how redundant and fault-tolerant their clouds are, but cloud customers still need to do their due diligence. Like fire insurance for an apartment, the provider will rebuild the structure but not compensate the renter for the damaged contents. While some providers will undertake to make the necessary number of backups, most will not take steps to ensure data integrity, or accept liability for data loss.

6) What about the privacy and residency of your data?

GDPR is an important piece of data privacy legislation that regulates how data on EU citizens needs to be secured and protected. GDPR prohibits storing of data outside the boundaries of the EU without additional measures.

With the European Court of Justice’s ruling in 2015 that the Safe Harbor framework is inadequate to protect the privacy rights of EU citizens when their data is processed in the United States, it’s important to check if your U.S. provider is a member of the Privacy Shield Framework.

Some providers will not disclose data center locations. Verifying that data are actually residing and processed in the data centers claimed by providers is technically difficult.

7) What happens when your provider decides to change their service?

Many standard terms allow providers to change certain or all contract terms unilaterally. Enterprise cloud providers are more likely to negotiate these provisions up front, as are infrastructure providers. But for the bulk of businesses using more commoditized Software as a Service (SaaS) applications, you might have to accept providers’ rights to change features.

Customers are able to negotiate advance notifications of changes to Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) engagements; however, as these reach deeper into your organizational systems, these changes could result in you having to rewrite application code created to integrate with proprietary provider application programming interfaces.

8) How do you manage your intellectual property rights? 

Intellectual property rights issues are a frequently debated issue. Providers’ terms may specify they own deliverables, for example, documentation. However, the line is sometimes unclear between a customer’s application and the provider’s platform and integration tools. Where integrators develop applications for their own customers, customers might require intellectual property rights ownership, or at least rights to use the software free after contract termination or transfer.

Another issue of contention concerns ownership rights to service improvements arising from customer suggestions or bug fixes. Providers may require customers to assign such rights. Yet customers may not want their suggested improvements to be made available to competitors.

9) What are the reasons for your service termination?

Non-payment is the leading reason providers terminate contracts with customers, but there are many other issues that crop up, which may or may not be the customer's fault. Other reasons providers pull their services include material breach, breach of acceptable use policies, or upon receiving third-party complaints regarding breach of their intellectual property rights.

The main issue is that the actions of one user of a customer may trigger rights to terminate the whole service. However, many services lack granularity. For instance, an IaaS provider may not be able to locate and terminate the offending VM instance, and therefore needs to terminate the entire service.

Providers, while acknowledging this deficiency, still refuse to change terms, but state they would take a commercial approach to discussions should issues arise.

10) When was your provider’s last independent audit?

Most cloud providers boast their compliance with the regulatory scheme du jour. But any cloud customer—especially one working in a highly regulated industry—should ask a provider: "How long ago was your last independent audit against the latest [relevant] regulatory protocols?"

Even for cloud customers that don't operate within a highly regulated sector, it might be a plus to know that a selected provider can pass a stringent regulatory audit.

Conclusion

When cloud customers seek to negotiate important data security and data privacy provisions, a common response from cloud providers is that the terms and conditions with which the customer has been presented is a "standard contract"—implying that it is, as such, non-negotiable.

A good counter-response is: "I understand—and these are my standard amendments to the standard contract."

Try asking a cloud provider if they have ever added, waived, or modified a contentious provision for other customers. See how they respond.

An organization's data represents its crown jewels. As such, no cloud customer should just lie down for a disadvantageous, and potentially harmful, cloud contract.

A cloud contract is just that: a contract. As such, it carries with it all of the normal pitfalls of a contractual relationship—and a few specialized ones. By asking the right questions, you’ll ensure your rights are protected.

Read more…