Monday, January 06, 2025

Case Study 21: The Australian Securities Exchange (ASX) $250 Million CHESS Blunder

Case Study 21: The Australian Securities Exchange (ASX) $250 Million CHESS Blunder

The Australian Securities Exchange (ASX) embarked on an ambitious journey to replace its 25-year-old Clearing House Electronic Subregister System (CHESS) with a state-of-the-art, blockchain-based platform. 

Initially envisioned as a groundbreaking project to enhance efficiency, security, and scalability, the CHESS replacement project quickly turned into a cautionary tale. 

The initiative faced repeated delays and escalating costs before its ultimate suspension in November 2022. Stakeholders, including market participants and regulators, expressed frustration with the project's mismanagement, questioning the feasibility of such ambitious undertakings.

Despite being heralded as a world-first use of distributed ledger technology (DLT) in a financial market, the ASX's CHESS replacement project encountered numerous challenges. The ripple effects of the failure impacted Australia's financial ecosystem, as trust in ASX's ability to manage critical infrastructure took a significant hit. This case study examines the series of missteps, governance issues, and technological challenges that led to the demise of one of the most ambitious financial infrastructure projects of its time.

In total, the project's failure has been projected to cost the ASX and its stakeholders upwards of AUD 250 million in direct expenses, with additional indirect costs stemming from lost time, diminished trust, and delayed market enhancements. ASIC Chair Joe Longo described the situation as "a watershed moment for governance in financial infrastructure." The failure also dealt a blow to the broader narrative around blockchain's transformative potential in finance. This detailed case study highlights the lessons other organizations can learn from the ASX’s missteps.

Background

The Clearing House Electronic Subregister System (CHESS) has served as the backbone of Australia's financial market infrastructure since 1994. Operating as the primary platform for clearing, settlement, and record-keeping of share transactions, CHESS has been critical to ensuring the efficiency and integrity of the market. However, as financial markets grew more complex, the aging CHESS system began to show limitations, including scalability issues and difficulty integrating with modern technologies.

In 2015, ASX initiated a strategic review of its market infrastructure. The review highlighted the need for a modern system that could support increased trading volumes, enhanced data capabilities, and real-time reporting. Blockchain technology emerged as an appealing solution, promising transparency, immutability, and efficiency. ASX partnered with Digital Asset, a New York-based fintech firm specializing in distributed ledger technology, to design and implement the new system. ASX CEO Dominic Stevens stated at the time: "Blockchain technology offers unprecedented opportunities to transform the way markets operate."

Stakeholders initially greeted the project with optimism. ASX promised significant benefits, including reduced reconciliation processes, enhanced market efficiency, and lower operational costs. The project was envisioned to be completed by 2020, with a transparent and collaborative approach involving market participants and regulators. However, these early promises soon gave way to skepticism as challenges mounted.

The scope of the project extended far beyond simply replicating the functionalities of CHESS. It sought to reimagine the entire post-trade process, embedding blockchain technology into critical financial infrastructure. This level of ambition introduced complexity, requiring extensive customizations, thorough testing, and close coordination among stakeholders. The ambitious scope, combined with technological and governance challenges, sowed the seeds of its eventual failure.

Don’t let your project fail like this one!

Discover here how I can help you turn it into a success.

For a list of all my project failure case studies just click here.

Timeline of Events

2015: Strategic Review and Vision

ASX began a review of its aging CHESS infrastructure to identify a replacement. Blockchain technology was identified as a promising solution, leading to the selection of Digital Asset as the primary technology partner.

2017: Project Announcement

ASX formally announced the CHESS replacement project, promising implementation by 2020 and widespread benefits for market participants. Initial enthusiasm was tempered by questions about blockchain's suitability for such a critical system. Market analyst Sarah Klein noted, "The industry was excited but cautious about the risks of untested technology."

2018: Early Development and Testing

Development efforts commenced, with ASX emphasizing collaboration with industry participants. Early testing revealed scalability issues, prompting adjustments to project timelines.

2020: First Delays Announced

The COVID-19 pandemic disrupted timelines, with ASX announcing a revised implementation date of 2022. Stakeholders raised concerns about insufficient transparency in the project's progress.

2021: Mounting Challenges

Reports surfaced that Digital Asset’s blockchain platform struggled to meet performance benchmarks. Additional delays were announced, pushing the go-live date to 2023. ASX cited the complexity of integrating blockchain technology into existing workflows. "The timelines were ambitious from the outset," said finance professor Alan Morrison.

2022: Project Suspension

An independent review commissioned by ASX highlighted significant gaps in project management and governance. ASX officially suspended the project after further testing revealed that the platform was not fit for purpose. The total sunk cost reached AUD 250 million. Former ASX Chair Helen Lofthouse acknowledged, "This outcome is deeply disappointing and a stark reminder of the need for governance at every level."

What Went Wrong?

Underestimation of Complexity

ASX underestimated the technical and operational complexities of integrating blockchain technology into critical market infrastructure. Blockchain, while promising, required significant adaptations to meet the high-performance standards of financial markets. 

Early limited feasibility studies failed to fully capture these challenges, leading to overconfidence in project timelines and deliverables. As technology consultant Mark Connors noted, "Blockchain was treated as a silver bullet without fully understanding the nuances of its integration."

This lack of understanding was evident in scalability tests that revealed major bottlenecks. Developers struggled to balance the decentralized nature of blockchain with the speed and efficiency demands of financial transactions. These challenges were compounded by the need to integrate the new system with legacy infrastructure.

Stakeholder Misalignment

The project suffered from inadequate communication and alignment with key stakeholders. Market participants expressed frustration over a lack of transparency and insufficient opportunities to provide input. "The ASX’s approach alienated many of us," said James Porter, a broker with over 20 years of experience. "We felt sidelined during critical phases of the project."

As a result, critical operational needs were overlooked, further complicating the implementation process. This misalignment created friction between ASX and its stakeholders, eroding trust and delaying progress.

Over-Reliance on Emerging Technology

Blockchain technology, though innovative, was still in its infancy when ASX committed to the project. Relying on an unproven technology for such a critical system introduced significant risks, including performance bottlenecks and integration challenges. "The decision to go all-in on blockchain was premature," said independent analyst Fiona Wong. "The technology wasn’t ready for the scale required."

Insufficient Risk Management

ASX failed to implement robust risk management practices, particularly in identifying and mitigating risks associated with scalability and performance. Testing protocols revealed issues late in the development cycle, compounding delays and costs. "By the time problems were identified, it was often too late to course-correct," observed consultant Ethan Harris.

Governance and Oversight Failures

Weak governance structures allowed issues to persist unaddressed. The independent review commissioned in 2022 highlighted a lack of clear accountability and ineffective oversight mechanisms. Decision-making processes were often slow and reactive, exacerbating project delays. ASIC Chair Joe Longo remarked, "Governance failures were at the heart of this project’s downfall."

Limited Independent Assurance 

EY, contracted to provide assurance over the CHESS replacement project, failed to identify and escalate critical risks early in the development cycle. Their reviews often focused on procedural compliance rather than probing the feasibility and scalability of the proposed solution. 

"Assurance without substantive scrutiny is a missed safeguard," said corporate governance expert Dr. Olivia Marks. The absence of deeper interrogation into the project's technical risks meant that systemic issues, such as blockchain's scalability challenges, were not flagged until significant resources had already been spent.

Reliance on a Single Supplier

ASX’s decision to rely exclusively on Digital Asset as the sole technology provider created significant dependencies and risks. With no alternative suppliers in place, ASX was unable to pivot when Digital Asset’s blockchain solution encountered performance and scalability issues. 

"Diversity in supplier relationships is critical for mitigating risks," said IT procurement specialist Andrew Carter. The lack of competitive bidding also limited opportunities for ASX to benchmark costs or explore other technical solutions that might have been more suited to the scale and complexity of the CHESS replacement.

How ASX Could Have Done Things Differently

Conducting Comprehensive Feasibility Studies with Pilot Testing

ASX could have invested more time in understanding the practical implications of implementing blockchain technology at scale. Comprehensive feasibility studies combined with phased pilot testing would have provided crucial insights into technical and operational hurdles. 

Diversifying Supplier Relationships

Relying on a single supplier limited ASX’s ability to pivot when issues with Digital Asset arose. Engaging multiple suppliers would have introduced healthy competition, fostered innovation, and mitigated the risks of over-dependence. 

IT procurement specialist Andrew Carter noted, "Supplier diversity is key to building resilient systems. It ensures flexibility and access to alternative solutions when challenges emerge." A multi-vendor approach could have provided ASX with backup options during critical phases.

Enhancing Stakeholder Engagement

Throughout the CHESS replacement project, communication gaps between ASX and its stakeholders contributed to misaligned expectations and operational oversights. Greater stakeholder involvement, particularly from brokers and institutional investors, would have ensured that the system's design aligned with real-world needs. Regular workshops, feedback loops, and transparency around project milestones would have also helped build trust and resolve conflicts early.

James Porter, a veteran broker, emphasized, "Early and consistent engagement would have made a world of difference. We felt sidelined, which only added to frustrations as issues emerged." Greater collaboration would have ensured that critical user requirements were accounted for, reducing resistance and easing eventual adoption.

Establishing Independent Project Assessments

ASX could have benefited from appointing an independent body with the expertise and authority to oversee the project. This body should have had the remit to assess technical decisions, validate risk mitigation strategies, and ensure accountability across all project phases. Independent oversight helps flag early warning signs, provide actionable recommendations, and maintain transparency with regulators and stakeholders.

Dr. Olivia Marks, a corporate governance expert, noted, "Independent assessments bring objectivity and rigor to complex projects. They can challenge assumptions and prevent tunnel vision among project leaders." A well-structured independent review process would have provided additional scrutiny, particularly during critical milestones like vendor selection and scalability testing.

Strengthening Governance and Oversight

Effective governance structures are critical for large-scale projects like the CHESS replacement. ASX's governance approach, described as reactive and fragmented, left key risks unaddressed for too long. Strengthening governance frameworks with clear accountability, decision-making protocols, and escalation mechanisms could have prevented many of the delays and inefficiencies observed.

ASIC Chair Joe Longo remarked, "Proactive oversight and clear accountability are essential in projects of this magnitude. Weak governance structures create an environment where small issues can snowball into systemic failures." Implementing a robust governance framework would have fostered better coordination among teams, enabling timely responses to challenges.

Closing Thoughts

The failure of ASX’s CHESS replacement project serves as a sobering reminder of the complexities and risks involved in large-scale technological transformations. While blockchain technology holds significant potential, its integration into critical infrastructure demands rigorous planning, stakeholder alignment, and adaptive management.

This case illustrates the importance of balancing ambition with practical execution. Organizations must ensure that emerging technologies are validated through thorough testing and phased implementation before full-scale deployment. Equally crucial is the need for robust governance structures, transparent communication, and independent oversight to mitigate risks and ensure accountability.

The lessons from ASX’s experience resonate across industries undergoing digital transformation. By embracing a disciplined and collaborative approach, organizations can unlock the transformative potential of technology while safeguarding against avoidable failures.

Don’t let your project fail like this one!

Discover here how I can help you turn it into a success.

For a list of all my project failure case studies just click here.

Sources

> "ASX Abandons CHESS Replacement Project," Financial Times, 2022.

> "Independent Review of the ASX CHESS Replacement Project," Accenture, 2022. 

> "EY CHESS Assurance Program Review," 2022-2024.

> "Special Report on CHESS Replacement," 2023.

> "Statutory Inquiry into ASIC and CHESS," 2024.

> "Challenges in Blockchain Adoption for Financial Systems," CIO Magazine, 2022. [https://www.cio.com]

> "Lessons from ERP and Blockchain Failures," TechRepublic, 2023.

Read more…

Saturday, December 07, 2024

Case Study 20: The $4 Billion AI Failure of IBM Watson for Oncology

Case Study 20: The $4 Billion AI Failure of IBM Watson for Oncology
In 2011, IBM’s Watson took the world by storm when it won the television game show Jeopardy!, showcasing the power of artificial intelligence (AI). Emboldened by this success, IBM sought to extend Watson’s capabilities beyond trivia to address real-world challenges. 

Healthcare, with its complex data and critical decision-making needs, became a primary focus. Among its flagship initiatives was Watson for Oncology, a system designed to assist doctors in diagnosing and treating cancer through AI-driven insights.

Cancer treatment represents one of the most intricate and rapidly evolving domains in medicine. With over 18 million new cases diagnosed globally each year, oncologists face an overwhelming amount of medical literature, treatment protocols, and emerging research. Watson for Oncology aimed to address this challenge by analyzing vast amounts of data to recommend evidence-based treatment plans, all in a matter of seconds.

IBM marketed Watson for Oncology as a revolutionary tool that could bridge the gap between cutting-edge research and clinical practice. Its promise was to assist oncologists in identifying personalized treatment options for patients, thereby improving outcomes and reducing variability in care. 

However, this ambitious vision quickly collided with the complex realities of cancer care, resulting in widespread criticism and eventual failure.

Background

At the start of the project it had five lofty objectives;

1) Streamlining Clinical Decision-Making: Watson for Oncology aimed to provide oncologists with AI-generated insights, synthesizing vast amounts of data into actionable treatment recommendations.

2) Bridging Knowledge Gaps: With the rapid pace of medical advancements, Watson sought to keep clinicians updated on the latest evidence, clinical trials, and treatment protocols.

3) Improving Patient Outcomes: The system was designed to support personalized care by tailoring treatment recommendations to each patient’s unique genetic and clinical profile.

4) Expanding Access to Expertise: IBM envisioned Watson as a tool to democratize high-quality oncology care, particularly in resource-limited settings where access to specialists is constrained.

5) Establishing Market Leadership: Beyond healthcare, IBM sought to position Watson as a leader in AI applications, demonstrating the transformative potential of cognitive computing.

The project was supported by partnerships with leading institutions like Memorial Sloan Kettering Cancer Center (MSKCC) to train Watson for Oncology.

The partnership aimed to imbue Watson with the expertise of MSKCC’s oncologists by feeding it clinical guidelines, peer-reviewed literature, and patient case histories. The AI would then analyze patient records and suggest ranked treatment options based on the latest evidence. It was envisioned as a tool that could augment oncologists' expertise, particularly in under-resourced settings.

IBM invested heavily in the project, pouring billions into Watson Health, which encompassed Watson for Oncology. The company acquired several firms specializing in healthcare data and analytics, including Truven Health Analytics and Merge Healthcare. These acquisitions were meant to enhance Watson’s capabilities by providing access to large datasets and advanced imaging tools.

Initial trials and pilots were conducted in countries like India and China, where disparities in healthcare resources presented an opportunity for Watson to make a meaningful impact. However, reports soon emerged that the AI’s recommendations were often inconsistent with local clinical practices. For example, Watson’s reliance on U.S.-centric guidelines made it difficult to implement in regions with differing treatment standards or drug availability.

By 2018, skepticism was growing. High-profile reports detailed instances where Watson provided inappropriate or even unsafe recommendations. These challenges, coupled with declining revenues for IBM Watson Health, culminated in the program’s discontinuation in 2023.

This case study examines how a project with such potential faltered, offering lessons for future ventures at the intersection of AI and healthcare.

Don’t let your project fail like this one!

Discover here how I can help you turn it into a success.

For a list of all my project failure case studies just click here.

Timeline of Events

2011–2012: Watson's Post-Jeopardy! Evolution

Following its Jeopardy! success, IBM began exploring commercial applications for Watson, identifying healthcare as a priority. In 2012, IBM partnered with Memorial Sloan Kettering to develop Watson for Oncology, marking the start of an ambitious initiative.

2013: Initial Development and Training

Watson’s training began with curated data from MSKCC, including clinical guidelines and research publications. Early feedback highlighted challenges in teaching the system to interpret ambiguous or contradictory medical information.

2014: Pilot Testing at Memorial Sloan Kettering

MSKCC oncologists started testing Watson on hypothetical patient cases. Early results revealed gaps in the system’s knowledge and its tendency to offer impractical or unsafe recommendations, raising concerns about its readiness.

2015: Launch and Early Adoption

IBM officially launched Watson for Oncology with aggressive marketing campaigns. Hospitals in countries like Thailand, India, and South Korea signed adoption agreements, drawn by the promise of bringing world-class cancer care to underserved regions.

2016: Growing Skepticism Among Oncologists

Reports emerged of dissatisfaction among oncologists using Watson. Many found the system’s recommendations simplistic, biased toward MSKCC practices, and poorly adapted to local guidelines.

2017: Critical Media Coverage

Investigative reports revealed that some of Watson’s recommendations were based on hypothetical scenarios rather than real-world data. These revelations damaged IBM’s credibility and raised ethical questions about its marketing claims.

2018: Customer Contracts Cancelled

Major clients, including MD Anderson Cancer Center, ended their contracts with IBM, citing high costs and underwhelming results. IBM began scaling back its marketing efforts for Watson for Oncology.

2019: Internal Restructuring at IBM Watson Health

Facing declining revenues, IBM restructured its Watson Health division. Resources were redirected to other AI projects, and development on Watson for Oncology slowed significantly.

2021: Watson Health Division Sold

IBM announced the sale of its Watson Health assets to a private equity firm, effectively marking the end of its ambitions in AI-driven cancer care.

2023: Retrospective Studies Highlighting System Flaws

Postmortem analyses identified systemic issues, including poor data quality, inadequate clinical validation, and unrealistic timelines, as key factors in the project’s failure.

What Went Wrong?

Overreliance on Limited Training Data
Watson’s knowledge base was heavily influenced by MSKCC’s practices, leading to recommendations that often failed to align with local guidelines or real-world cases. This lack of diversity in training data undermined the system’s global applicability.

Unrealistic Marketing Claims
IBM’s aggressive marketing exaggerated Watson’s capabilities, creating unrealistic expectations among customers. When the system failed to deliver, trust eroded quickly.

Inadequate Physician Involvement
Oncologists reported that Watson’s interface was not user-friendly and often disrupted their workflow. Limited engagement with end-users during development contributed to these usability issues.

Lack of Adaptability to Local Contexts
Watson struggled to accommodate variations in healthcare systems, resource availability, and cultural practices. This rigidity limited its usefulness in diverse settings.

Ethical and Transparency Concerns
IBM’s use of hypothetical cases and selective data to demonstrate Watson’s capabilities raised ethical red flags. Customers felt misled by the lack of transparency.

How IBM Could Have Done Things Differently?

Broader and More Diverse Training Data
IBM could have partnered with multiple institutions worldwide to train Watson on a broader dataset, ensuring recommendations were evidence-based and applicable in varied contexts.

Iterative Development with Physician Feedback
By involving more oncologists in the design and testing process, IBM could have identified and resolved usability issues early on, ensuring the system met clinical needs.

Transparent Communication of Capabilities
IBM should have been more transparent about Watson’s limitations, focusing on incremental benefits rather than overhyping its transformative potential.

Emphasis on Local Adaptability
Developing a system that could integrate local guidelines and resource constraints would have made Watson more practical for global deployment.

Strengthened Ethical Oversight
IBM could have established an independent advisory board to review marketing claims, data usage, and clinical validation processes, building trust with stakeholders.

Closing Thoughts

The failure of IBM Watson for Oncology offers valuable lessons for AI projects in healthcare and beyond. It highlights the importance of realistic expectations, rigorous validation, and end-user involvement in developing and deploying AI solutions. 

While IBM’s vision was ambitious, its execution fell short, underscoring the challenges of applying AI in complex, high-stakes domains. Moving forward, the healthcare industry must balance optimism about AI’s potential with a commitment to patient safety and ethical responsibility.

Don’t let your project fail like this one!

Discover here how I can help you turn it into a success.

For a list of all my project failure case studies just click here.

Sources

> IBM official press releases (2011–2021).

> Investigative reports from Stat News and The Wall Street Journal on Watson for Oncology’s challenges.

> Interviews with Memorial Sloan Kettering oncologists published in medical journals.

> Retrospective analyses in The Lancet Digital Health and JAMA Oncology.

> Public statements by IBM executives, including John Kelly III (SVP, IBM Research).

Read more…

Tuesday, August 20, 2024

Case Study 19: The $20 Billion Boeing 737 Max Disaster That Shook Aviation

Case Study 19: The $20 Billion Boeing 737 Max Disaster That Shook Aviation

The Boeing 737 Max, once heralded as a triumph in aviation technology and efficiency, has since become synonymous with one of the most catastrophic failures in modern corporate history. 

This case study delves deep into the intricacies of the Boeing 737 Max program—a project that was initially designed to sustain Boeing's dominance in the narrow-body aircraft market but instead resulted in two fatal crashes, the loss of 346 lives, and an unprecedented global grounding of an entire fleet. 

Boeing's 737 series has long been a cornerstone of the company's commercial aircraft offerings. Since its inception in the late 1960s, the 737 has undergone numerous iterations, each improving upon its predecessor while maintaining the model's reputation for reliability and efficiency. 

By the 2000s, the 737 had become the best-selling commercial aircraft in history, with airlines around the world relying on its performance for short and medium-haul flights.

However, by the early 2010s, Boeing faced significant competition from Airbus, particularly with the introduction of the Airbus A320neo. The A320neo offered superior fuel efficiency and lower operating costs, thanks to its state-of-the-art engines and aerodynamic enhancements. 

In response, Boeing made the strategic decision to develop the 737 Max, an upgrade of the existing 737 platform that would incorporate similar fuel-efficient engines and other improvements to match the A320neo without necessitating extensive retraining of pilots.

Boeing's leadership was acutely aware that any requirement for significant additional training would increase costs for airlines and potentially drive them to choose Airbus instead.

The company selected the CFM International LEAP-1B engines for the 737 Max, which were larger and more fuel-efficient than those on previous 737 models. 

However, this choice introduced significant engineering challenges, particularly related to the aircraft's aerodynamics and balance.

The Maneuvering Characteristics Augmentation System (MCAS) was developed as a solution to these challenges. 

The system was designed to automatically adjust the aircraft's angle of attack in certain conditions to prevent stalling, thereby making the 737 Max handle similarly to older 737 models. This was intended to reassure airlines that their pilots could transition to the new model with minimal additional training. 

As Dennis Muilenburg, Boeing’s CEO at the time, stated, "Our goal with the 737 Max was to offer a seamless transition for our customers, ensuring they could benefit from improved efficiency without significant operational disruptions". 

The MCAS would later become central to the 737 Max's tragic failures.

Don’t let your project fail like this one!

Discover here how I can help you turn it into a success.

For a list of all my project failure case studies just click here.

Timeline of Events

2011-2013: Project Inception and Initial Development

The 737 Max project was officially launched in 2011, with Boeing announcing that the aircraft would feature new engines, improved aerodynamics, and advanced avionics. The design and development process was marked by intense pressure to meet tight deadlines and to deliver a product that could quickly enter the market. By 2013, Boeing had completed the design phase, and the first test flights were scheduled for early 2016.

2016-2017: Certification and Commercial Launch

The first test flight of the 737 Max took place in January 2016, and the aircraft performed as expected under controlled conditions. The Federal Aviation Administration (FAA) granted the 737 Max its certification in March 2017, allowing it to enter commercial service. The aircraft was initially well-received by airlines, with thousands of orders placed within the first year of its launch.

October 29, 2018: Lion Air Flight JT610 Crash

Lion Air Flight JT610, a Boeing 737 Max traveling in Indonesia from Jakarta to Pangkal Pinang, crashes, killing all 189 passengers and crew on board. Questions quickly emerge over previous control problems related to the aircraft’s MCAS. This marks the first major incident involving the 737 Max, and it raises significant concerns about the safety of the aircraft.

March 1, 2019: Boeing’s Share Price Peaks

Boeing’s share price reaches $446, an all-time record, after the company reports $100 billion in annual revenues for the first time. This reflects investor confidence in Boeing’s financial performance, despite the recent Lion Air crash.

March 10, 2019: Ethiopian Airlines Flight ET302 Crash

Ethiopian Airlines Flight ET302, another Boeing 737 Max, crashes shortly after takeoff from Addis Ababa, Ethiopia, killing all 157 people on board. The circumstances of this crash are eerily similar to the Lion Air disaster, with the MCAS system again suspected to be a contributing factor. The crash leads to global scrutiny of the 737 Max’s safety.

March 14, 2019: Global Grounding of the 737 Max

U.S. President Donald Trump grounds the entire 737 Max fleet, following the lead of regulators in several other countries. This grounding is unprecedented in its scope, affecting airlines worldwide and marking a significant turning point in the crisis surrounding the 737 Max.

October 29, 2019: Muilenburg Testifies Before Congress

Boeing CEO Dennis Muilenburg is accused of supplying “flying coffins” to airlines during angry questioning by U.S. senators. His testimony is widely criticized, and his handling of the crisis further erodes confidence in Boeing’s leadership.

December 23, 2019: Muilenburg Fired

Boeing fires Dennis Muilenburg, appointing Chairman Dave Calhoun as the new Chief Executive Officer. This leadership change is seen as an attempt to restore confidence in Boeing and address the mounting crisis.

March 6, 2020: U.S. Congressional Report

A U.S. congressional report blames Boeing and regulators for the “tragic and avoidable” 737 Max crashes. The report highlights numerous failures in the design, certification, and regulatory oversight processes, and it calls for significant reforms in the aviation industry.

March 11, 2020: Boeing Borrows $14 Billion

Boeing borrows $14 billion from U.S. banks to navigate the financial strain caused by the grounding of the 737 Max and the emerging COVID-19 pandemic. This loan is later supplemented by another $25 billion in debt, underscoring the financial challenges Boeing faces.

March 18, 2020: Boeing Shares Plummet

Boeing shares hit $89, the lowest since early 2013, reflecting investor concerns about the company’s future amid the 737 Max crisis and the impact of the COVID-19 pandemic on global air travel.

April 29, 2020: Job Cuts Announced

Boeing announces the first wave of job cuts, planning to reduce its workforce by 10% in response to the pandemic-induced drop in air travel. This move is part of broader efforts to cut costs and stabilize the company’s finances.

September 2020: Manufacturing Flaws in the 787 Dreamliner

Manufacturing flaws are discovered in Boeing’s 787 Dreamliner, leading to the grounding of some jets. This adds to Boeing’s mounting challenges and further complicates its efforts to recover from the 737 Max crisis.

November 18, 2020: U.S. Regulator Approves 737 Max for Flight

The U.S. Federal Aviation Administration approves some 737 Max planes to fly again after Boeing implements necessary design and software changes. This marks a significant step in Boeing’s efforts to return the 737 Max to service.

January 8, 2021: Boeing Pays $2.5 Billion Settlement

Boeing agrees to pay $2.5 billion to resolve a criminal charge of misleading federal aviation regulators over the 737 Max. This settlement includes compensation for victims’ families, penalties, and payments to airlines affected by the grounding.

November 11, 2021: Boeing Admits Responsibility

Boeing admits full responsibility for the second Max crash in a legal agreement with victims’ families. This admission marks a significant acknowledgment of the company’s failures in the development and certification of the 737 Max.

What Went Wrong?

Flawed Engineering and Design Decisions

One of the most significant factors contributing to the failure of the 737 Max was the flawed design of the MCAS system. Boeing engineers decided to rely on a single AOA sensor to provide input to the MCAS, despite the known risks of sensor failure. 

Traditionally, critical systems in aircraft design incorporate redundancy to ensure that a single point of failure does not lead to catastrophic consequences. 

Boeing's decision to omit this redundancy was driven by the desire to avoid triggering additional pilot training requirements, which would have undermined the 737 Max's cost advantage.

The placement of the new, larger engines also altered the aircraft's aerodynamic profile, making it more prone to nose-up tendencies during certain flight conditions. 

Instead of addressing this issue through structural changes to the aircraft, Boeing chose to implement the MCAS as a software solution. This decision, while expedient, introduced new risks that were not fully appreciated at the time. 

"We were under immense pressure to deliver the Max on time and under budget, and this led to some compromises that, in hindsight, were catastrophic," admitted a senior Boeing engineer involved in the project

Inadequate Regulatory Oversight

The FAA's role in the 737 Max disaster has been widely criticized. The agency allowed Boeing to conduct much of the certification process itself, including the evaluation of the MCAS system. This arrangement, known as Organization Designation Authorization (ODA), was intended to streamline the certification process, but it also created a conflict of interest. 

Boeing's engineers were under pressure to downplay the significance of the MCAS in order to avoid additional scrutiny from regulators. 

"The relationship between the FAA and Boeing became too cozy, and this eroded the regulatory oversight that is supposed to keep the public safe," said Peter DeFazio, Chairman of the House Transportation and Infrastructure Committee

Corporate Culture and Leadership Failures

At the heart of the 737 Max crisis was a corporate culture that prioritized profitability and market share over safety and transparency. 

Under the leadership of Dennis Muilenburg, Boeing was focused on delivering shareholder value, often at the expense of other considerations. This led to a culture where concerns about safety were dismissed or ignored, and where employees felt pressured to meet unrealistic deadlines. 

Muilenburg's public statements after the crashes, where he repeatedly defended the safety of the 737 Max despite mounting evidence to the contrary, only further eroded trust in Boeing. 

"There was a disconnect between the engineers on the ground and the executives in the boardroom, and this disconnect had tragic consequences," said John Hamilton, Boeing's former chief engineer for commercial airplanes

Communication Failures

Boeing's failure to adequately communicate the existence and functionality of the MCAS system to airlines and pilots was a critical factor in the two crashes. Pilots were not informed about the system or its potential impact on flight dynamics, which left them unprepared to handle a malfunction. 

After the Lion Air crash, Boeing issued a bulletin to airlines outlining procedures for dealing with erroneous MCAS activation, but this was seen as too little, too late. 

"It’s pretty asinine for them to put a system on an airplane and not tell the pilots who are operating it," said Captain Dennis Tajer of the Allied Pilots Association

Supply Chain and Production Pressures

The aggressive production schedule for the 737 Max also contributed to the project's failure. Boeing's management was determined to deliver the aircraft to customers as quickly as possible to fend off competition from Airbus. 

This led to a "go, go, go" mentality, where deadlines were prioritized over safety considerations. Engineers were pushed to their limits, with some reporting that they were working at double the normal pace to meet production targets. This rush to market meant that there was less time for thorough testing and validation of the MCAS system and other critical components

Moreover, Boeing's decision to keep the 737 Max's design as similar as possible to previous 737 models was driven by the desire to reduce production costs and speed up certification. This decision, however, meant that the aircraft's design was pushed to its limits, resulting in an aircraft that was more prone to instability than previous models. 

"We were trying to do too much with too little, and in the end, it cost lives," said an unnamed Boeing engineer involved in the project

Cost-Cutting Measures

Boeing's relentless focus on cost-cutting also played a significant role in the 737 Max's failure. The company made several decisions that compromised safety in order to keep costs down, such as relying on a single AOA sensor and not including an MCAS indicator light in the cockpit. 

These decisions were made in the name of reducing the cost of the aircraft and avoiding additional pilot training, which would have increased costs for airlines. However, these cost-cutting measures ultimately made the aircraft less safe and contributed to the crashes of Lion Air Flight 610 and Ethiopian Airlines Flight 302

Organizational Failures

Boeing's organizational structure also contributed to the 737 Max's failure. The company's decision to move its headquarters from Seattle to Chicago in 2001 created a physical and cultural distance between the company's leadership and its engineers. 

This move, coupled with the increasing focus on financial performance over engineering excellence, led to a breakdown in communication and decision-making within the company. Engineers felt that their concerns were not being heard by management, and decisions were made without a full understanding of the technical challenges involved. 

"There was a sense that the leadership was more focused on the stock price than on building safe airplanes," said a former Boeing engineer

How Boeing Could Have Done Things Differently?

Prioritizing Safety Over Speed

One of the most significant ways Boeing could have avoided the 737 Max disaster was by prioritizing safety over speed. The company was under intense pressure to deliver the aircraft quickly to compete with Airbus, but this focus on speed led to critical safety oversights. 

By taking more time to thoroughly test and validate the MCAS system and other components, Boeing could have identified and addressed the issues that ultimately led to the crashes. 

"In hindsight, we should have taken more time to get it right, rather than rushing to meet deadlines," said Greg Smith, Boeing's Chief Financial Officer at the time

Incorporating Redundancy in Critical Systems

Another key change Boeing could have made was to incorporate redundancy in critical systems like the MCAS. Aviation safety protocols typically require multiple layers of redundancy to ensure that a single point of failure does not lead to catastrophe. 

By relying on a single AOA sensor, Boeing violated this principle and left the aircraft vulnerable to sensor malfunctions. Including a second AOA sensor and ensuring that both sensors had to agree before the MCAS system activated could have prevented the erroneous activation of the system that caused the crashes. 

"Redundancy is a fundamental principle of aviation safety, and it's one that we should have adhered to in the design of the 737 Max," said John Hamilton, Boeing's former chief engineer for commercial airplanes

Improving Communication and Transparency

Boeing could have also improved its communication and transparency with both regulators and airlines. The company's decision to downplay the significance of the MCAS system and not include it in the aircraft's flight manuals left pilots unprepared to deal with its activation. 

By fully disclosing the system's capabilities and risks to the FAA and airlines, Boeing could have ensured that pilots were adequately trained to handle the system in the event of a malfunction. 

"Transparency is key to building trust, and we failed in that regard with the 737 Max," said Dennis Muilenburg, Boeing's CEO at the time

Strengthening Regulatory Oversight

The FAA's delegation of much of the certification process to Boeing created a conflict of interest that contributed to the 737 Max's failure. By strengthening regulatory oversight and ensuring that the FAA maintained its independence in the certification process, the agency could have identified the risks associated with the MCAS system and required Boeing to address them before the aircraft entered service. 

This would have provided an additional layer of scrutiny and ensured that safety was prioritized over speed and cost. 

"The FAA's role is to be the independent watchdog of aviation safety, and we need to ensure that it has the resources and authority to fulfill that role effectively," said Peter DeFazio, Chairman of the House Transportation and Infrastructure Committee

Fostering a Safety-First Corporate Culture

Finally, Boeing could have fostered a corporate culture that prioritized safety over profitability. The company's increasing focus on financial performance and shareholder value led to a culture where safety concerns were often dismissed or ignored. 

By emphasizing the importance of safety in its corporate values and decision-making processes, Boeing could have created an environment where engineers felt empowered to raise concerns and where those concerns were taken seriously by management. 

"Safety needs to be the top priority in everything we do, and we lost sight of that with the 737 Max," said David Calhoun, who succeeded Dennis Muilenburg as Boeing's CEO in 2020

Closing Thoughts

The Boeing 737 Max disaster is a stark reminder of the consequences of prioritizing speed and cost over safety in the aviation industry. The two crashes that claimed the lives of 346 people were not the result of a single failure but rather a series of systemic issues, including flawed engineering decisions, inadequate regulatory oversight, and a corporate culture that valued profitability over safety. 

These failures have had far-reaching consequences for Boeing, resulting in billions of dollars in losses, a damaged reputation, and a loss of trust among airlines, regulators, and the flying public.

Moving forward, it is crucial that both Boeing and the wider aviation industry learn from these mistakes. 

This means prioritizing safety above all else, ensuring that critical systems are designed with redundancy, and maintaining transparency and communication with regulators and customers. 

It also means fostering a corporate culture that values safety and empowers employees to speak up when they see potential risks.  

If I look at the "accidents" that happened to Boeing employees that have spoken up it seems to be the opposite...

Don’t let your project fail like this one!

Discover here how I can help you turn it into a success.

For a list of all my project failure case studies just click here.

Sources

> Cannon-Patron, S., Gourdet, S., Haneen, F., Medina, C., & Thompson, S. (2021). A Case Study of Management Shortcomings: Lessons from the B737-Max Aviation Accidents. 

> Larcker, D. F., & Tayan, B. (2024). Boeing 737 MAX: Organizational Failures and Competitive Pressures. Stanford Graduate School of Business. 

> Boeing Co. (2019). Investigation Report: The Design and Certification of the Boeing 737 Max. 

> FAA. (2023). Examining Risk Management Failures: The Case of the Boeing 737 MAX Program. 

> Enders, T. (2024). Airbus Approach to Safety and Innovation: A Response to the Boeing 737 MAX. 

> Muilenburg, D. (2019). Boeing’s Commitment to Safety: A Public Statement. 

> Gates, D., & Baker, M. (2019). The Inside Story of MCAS: How Boeing’s 737 MAX System Gained Power and Lost Safeguards. The Seattle Times. 

> Tajer, D. (2019). Statement on MCAS and Pilot Awareness. Allied Pilots Association.

Read more…

Tuesday, July 02, 2024

Case Study 18: How Excel Errors and Risk Oversights Cost JP Morgan $6 Billion

Case Study 18: How Excel Errors and Risk Oversights Cost JP Morgan $6 Billion

In the spring of 2012, JP Morgan Chase & Co. faced one of the most significant financial debacles in recent history, known as the "London Whale" incident. The debacle resulted in losses amounting to approximately $6 billion, fundamentally shaking the confidence in the bank's risk management practices. 

At the core of this catastrophe was the failure of the Synthetic Credit Portfolio Value at Risk (VaR) Model, a sophisticated financial tool intended to manage the risk associated with the bank's trading strategies. 

The failure of the VaR model not only had severe financial repercussions but also led to intense scrutiny from regulators and the public. It highlighted the vulnerabilities within JP Morgan's risk management framework and underscored the potential dangers of relying heavily on quantitative models without adequate oversight. 

This case study explores the intricacies of what went wrong and how such failures can be prevented in the future. By analyzing this incident, I seek to understand the systemic issues that contributed to the failure and to identify strategies that can mitigate similar risks in other financial institutions. The insights gleaned from this case are not just relevant to JP Morgan but to the broader financial industry, which increasingly depends on complex models to manage risk.

Background

The Synthetic Credit Portfolio (SCP) at JP Morgan was a part of the bank's Chief Investment Office (CIO), which managed the company's excess deposits through various investments, including credit derivatives. The SCP was specifically designed to hedge against credit risk by trading credit default swaps and other credit derivatives. The portfolio aimed to offset potential losses from the bank's other exposures, thereby stabilizing overall performance.

In 2011, JP Morgan developed the Synthetic Credit VaR Model to quantify and manage the risk associated with the SCP. The model was intended to provide a comprehensive measure of the potential losses the bank could face under various market conditions. This would enable the bank to make informed decisions about its trading strategies and risk exposures. The VaR model was implemented using a series of Excel spreadsheets, which were manually updated and managed.

Despite the sophistication of the model, its development was plagued by several critical issues. The model's architect lacked prior experience in developing VaR models, and the resources allocated to the project were inadequate. This led to a reliance on manual processes, increasing the risk of errors and inaccuracies. Furthermore, the model's implementation and monitoring were insufficiently rigorous, contributing to the eventual failure that led to massive financial losses.

The primary objective of JP Morgan's Synthetic Credit VaR Model was to provide an accurate and reliable measure of the risk associated with the bank's credit derivatives portfolio. This would enable the bank to manage its risk exposures effectively, ensuring that its trading strategies remained within acceptable limits. The model aimed to capture the potential losses under various market conditions, allowing the bank to make informed decisions about its investments.

In addition to the primary objective, the Synthetic Credit VaR Model was expected to provide a foundation for further advancements in the bank's risk management practices. By leveraging the insights gained from the model, JP Morgan hoped to develop more sophisticated tools and techniques for managing risk. This would enable the bank to stay ahead of emerging threats and maintain a competitive edge in the financial industry.

Don’t let your project fail like this one!

Discover here how I can help you turn it into a success.

For a list of all my project failure case studies just click here.

Timeline of Events

Early 2011: Development of the Synthetic Credit VaR Model begins. The project is led by an individual with limited experience in developing VaR models. The model is built using Excel spreadsheets, which are manually updated and managed.

September 2011: The Synthetic Credit VaR Model is completed and implemented within the CIO. The model is intended to provide a comprehensive measure of the potential losses the bank could face under various market conditions.

January 2012: Increased trading activity in the SCP causes the CIO to exceed its stress loss risk limits. This breach continues for seven weeks. The bank informs the OCC of the ongoing breach, but no additional details are provided, and the matter is dropped.

March 23, 2012: Ina Drew, head of the CIO, orders a halt to SCP trading due to mounting concerns about the portfolio's risk exposure.

April 6, 2012: Bloomberg and the Wall Street Journal publish reports on the London Whale, revealing massive positions in credit derivatives held by Bruno Iksil and his team.

April 9, 2012: Thomas Curry becomes the 30th Comptroller of the Currency. Instead of planning for the upcoming 150th anniversary of the Office of the Comptroller of the Currency (OCC), Mr. Curry is confronted with the outbreak of news reports about the London Whale incident.

April 16, 2012: JP Morgan provides regulators with a presentation on SCP. The presentation states that the objective of the "Core Credit Book" since its inception in 2007 was to protect against a significant downturn in credit. However, internal reports indicate growing losses in the SCP.

May 4, 2012: JP Morgan reports SCP losses of $1.6 billion for the second quarter. The losses continue to grow rapidly even though active trading has stopped.

December 31, 2012: Total SCP losses reach $6.2 billion, marking one of the most significant financial debacles in the bank's history.

January 2013: The OCC issues a Cease and Desist Order against JP Morgan, directing the bank to correct deficiencies in its derivatives trading activity. The Federal Reserve issues a related Cease and Desist Order against JP Morgan's holding company.

September - October 2013: JP Morgan settles with regulators, paying $1.020 billion in penalties. The OCC levies a $300 million fine for inadequate oversight and governance, insufficient risk management processes, and other deficiencies.

What Went Wrong?

Model Development and Implementation Failures

The development of JP Morgan's Synthetic Credit VaR Model was marred by several critical issues. The model was built using Excel spreadsheets, which involved manual data entry and copying and pasting of data. This approach introduced significant potential for errors and inaccuracies. As noted in JP Morgan's internal report, "the spreadsheets ‘had to be completed manually, by a process of copying and pasting data from one spreadsheet to another’". This manual process was inherently risky, as even a minor error in data entry or formula could lead to significant discrepancies in the model's output.

Furthermore, the individual responsible for developing the model lacked prior experience in creating VaR models. This lack of expertise, combined with inadequate resources, resulted in a model that was not robust enough to handle the complexities of the bank's trading strategies. The internal report highlighted this issue: "The individual who was responsible for the model’s development had not previously developed or implemented a VaR model, and was also not provided sufficient support". This lack of support and expertise significantly compromised the quality and reliability of the model.

Insufficient Testing and Monitoring

The Model Review Group (MRG) did not conduct thorough testing of the new model. They relied on limited back-testing and did not compare results with the existing model. This lack of rigorous testing meant that potential issues and discrepancies were not identified and addressed before the model was implemented. The internal report criticized this approach: "The Model Review Group’s review of the new model was not as rigorous as it should have been". Without comprehensive testing, the model was not validated adequately, leading to unreliable risk assessments.

Moreover, the monitoring and oversight of the model's implementation were insufficient. The CIO risk management team played a passive role in the model's development, approval, implementation, and monitoring. They viewed themselves more as consumers of the model rather than as responsible for its development and operation. This passive approach resulted in inadequate quality control and frequent formula and code changes in the spreadsheets. The internal report noted, "Data were uploaded manually without sufficient quality control. Spreadsheet-based calculations were conducted with insufficient controls and frequent formula and code changes were made". This lack of oversight and quality control further compromised the reliability of the model.

Regulatory Oversight Failures

Regulatory oversight was inadequate throughout the development and implementation of the Synthetic Credit VaR Model. The OCC, JP Morgan's primary regulator, did not request critical performance data and failed to act on risk limit breaches. As highlighted in the Journal of Financial Crises, "JPM did not provide the OCC with required monthly reports... yet the OCC did not request the missing data". This lack of proactive oversight allowed significant issues to go unnoticed and unaddressed.

Additionally, the OCC was informed of risk limit breaches but did not investigate the causes or implications of these breaches. For instance, the OCC was contemporaneously notified in January 2012 that the CIO exceeded its Value at Risk (VaR) limit and the higher bank-wide VaR limit for four consecutive days. However, the OCC did not investigate why the breach happened or inquire why a new model would cause such a large reduction in VaR. This failure to follow up on critical risk indicators exemplified the shortcomings in regulatory oversight.

How JP Morgan Could Have Done Things Differently?

Improved Model Development Processes

One of the primary ways JP Morgan could have avoided the failure of the Synthetic Credit VaR Model was by improving the model development processes. Implementing automated systems for data management could have significantly reduced the risk of human error and improved accuracy. Manual data entry and copying and pasting of data in Excel spreadsheets were inherently risky practices. By automating these processes, the bank could have ensured more reliable and consistent data management.

Moreover, allocating experienced personnel and adequate resources for model development and testing would have ensured more robust results. The individual responsible for developing the model lacked prior experience in VaR models, and the resources allocated to the project were inadequate. By involving experts in the field and providing sufficient support, the bank could have developed a more sophisticated and reliable model. As highlighted in the internal report, "Inadequate resources were dedicated to the development of the model".

Conducting extensive back-testing and validation against existing models could have identified potential discrepancies and flaws. The Model Review Group did not conduct thorough testing of the new model, relying on limited back-testing. By implementing a more rigorous testing process, the bank could have validated the model's accuracy and reliability before its implementation.

Enhanced Oversight and Governance

Enhanced oversight and governance could have prevented the failure of the Synthetic Credit VaR Model. Ensuring regular, detailed reporting to regulators and internal oversight bodies would have maintained transparency and accountability. JP Morgan failed to provide the OCC with required monthly reports, and the OCC did not request the missing data. By establishing regular reporting protocols and ensuring compliance, the bank could have maintained better oversight of the model's performance.

Addressing risk limit breaches promptly and thoroughly would have mitigated escalating risks. The OCC was informed of risk limit breaches but did not investigate the causes or implications of these breaches. By taking immediate action to address and rectify risk limit breaches, the bank could have prevented further escalation of risks. Proactive risk management is crucial in identifying and mitigating potential issues before they lead to significant losses.

Implementing continuous monitoring and review processes for all models and strategies could have identified issues before they led to significant losses. The CIO risk management team played a passive role in the model's development, approval, implementation, and monitoring. By adopting a more proactive approach to monitoring and reviewing the model, the bank could have ensured that potential issues were identified and addressed promptly. Continuous monitoring and review processes are essential in maintaining the accuracy and reliability of risk management models.

Comprehensive Risk Management Framework

Developing a comprehensive risk management framework could have further strengthened JP Morgan's ability to manage risks effectively. This framework should have included clear policies and procedures for model development, implementation, and monitoring. By establishing a robust risk management framework, the bank could have ensured that all aspects of the model's lifecycle were adequately managed.

Additionally, enhancing collaboration and communication between different teams involved in risk management could have improved the model's reliability. The CIO risk management team viewed themselves more as consumers of the model rather than as responsible for its development and operation. By fostering collaboration and communication between different teams, the bank could have ensured that all stakeholders were actively involved in the model's development and monitoring.

Closing Thoughts

The failure of JP Morgan's Synthetic Credit VaR Model underscores the critical importance of rigorous development, testing, and oversight in financial risk management. This incident serves as a cautionary tale for financial institutions relying on complex models and emphasizes the need for robust governance and proactive risk management strategies. By learning from this failure, financial institutions can develop more reliable and effective risk management frameworks.

The insights gleaned from this case study are not just relevant to JP Morgan but to the broader financial industry, which increasingly depends on complex models to manage risk. By addressing the systemic issues that contributed to the failure and implementing the strategies outlined in this case study, financial institutions can mitigate similar risks in the future.

In conclusion, the London Whale incident highlights the vulnerabilities within JP Morgan's risk management framework and underscores the potential dangers of relying heavily on quantitative models without adequate oversight. By enhancing model development processes, improving oversight and governance, and developing a comprehensive risk management framework, financial institutions can ensure more reliable and effective risk management practices.

Don’t let your project fail like this one!

Discover here how I can help you turn it into a success.

For a list of all my project failure case studies just click here.

Sources

1) Internal Report of JPMorgan Chase & Co. Management Task Force Regarding 2012 CIO Losses, January 16, 2013

2) A whale in shallow waters: JPMorgan Chase, the “London Whale” and the organisational catastrophe of 2012, François Valérian, November 2017

3) JPMorgan Chase London Whale E: Supervisory Oversight, Arwin G. Zeissler and Andrew Metrick, Journal of Financial Crises, 2019

4) JPMorgan Chase London Whale C: Risk Limits, Metrics, and Models, Arwin G. Zeissler and Andrew Metrick, Journal of Financial Crises, 2019

5) JPMorgan Chase Whale Trades: A Case History of Derivatives Risks and Abuses, Permanent Subcommittee on Investigations United States Senate, 2013

Read more…

Sunday, March 19, 2023

Case Study 17: The Disastrous Launch of Healthcare.gov

Case Study 17: The Disastrous Launch of Healthcare.gov

Barack Obama was inaugurated on January 20, 2009, after defeating his opponent John McCain by 365 electoral college votes to 175. One of Obama's primary campaign issues was fixing America's healthcare system by providing affordable options to the 43.8 million uninsured Americans. 

In 2010, the year Obama signed the Affordable Care Act (ACA), the United States spent 17.6% of its GDP on health care, nearly double the OECD average of 9.5%, with the next closest developed nation, the Netherlands, spending 12%.

The 44th president was successful in introducing the ACA; however, the launch of the website that would connect Americans to the marketplace, Healthcare.gov, was a failure. While the platform would eventually enroll an estimated 10 million uninsured Americans in 2014, the rollout was a complete disaster that exposed the challenges the United States government faces in implementing technology.

According to a 2008 report by the Government Accountability Office (GAO), 48% of federal IT projects had to be restructured because of cost overages or changes in project goals. In addition, over half had to be restarted two or more times.

On the first day Heathcare.gov was launched, four million unique users visited the portal, but only six successfully registered. Over the next few days, the site experienced eight million visitors, but according to estimates, around 1% enrolled in a new healthcare plan. Even the users that did sign up experienced errors, including duplicates in enrollment applications submitted to insurers.

The trouble launching Healthcare.gov presents a seemingly reoccurring problem when the US government tech projects. Standish Group International Chairman Jim Johnson is on record praising the rollout based on the government's history of software failing by default. "Anyone who has written a line of code or built a system from the ground up cannot be surprised or even mildly concerned that Healthcare.gov did not work out of the gate. The real news would have been if it actually did work. The very fact that most of it did work at all is a success in itself."

However, there's far more to the failed launch of the federally facilitated marketplace (FFM). The agency responsible for the project, the Centers for Medicare and Medicaid Services (CMS), didn't follow many regulations in place to ensure transparency, proper oversight, and accountability. So was the project destined to fail from the start due to overwhelming layers of bureaucracy, or were the vendors tasked with developing the online marketplace to blame?

In this case study, we'll examine why Healthcare.gov failed to meet expectations and point out what CMS could have done differently to deliver a functioning FFM.

Don’t let your project fail like this one!

Discover here how I can help you turn it into a success.

For a list of all my project failure case studies just click here.

Timeline of Events

2010

On March 23, 2010, President Barack Obama signed ACA, also known as Obamacare, into law. The legislation was the most comprehensive reform of the US medical system in 50 years and is still in place today.

Under the ACA, US citizens were required to have health insurance or pay a monthly fee. The law also required the establishment of online marketplaces that would allow individuals to compare and select health insurance policies by January 1, 2014. States could set up their own marketplace or use the FFM.

Each marketplace created under the ACA was intended to provide a seamless, single point of access for individuals to enroll in qualified healthcare plans and access income-based financial subsidies created under the new law.

Users were required to visit Healthcare.gov, register, verify their identity, determine their eligibility for subsidies, and enroll in a plan. The process appears straightforward; President Obama even touted the marketplace weeks before its launch by saying, "Now, this is real simple. It's a website where you can compare and purchase affordable health insurance plans, side by side, the same way you shop for a plane ticket on kayak… the same way you shop for a TV on Amazon."

However, building an identity verification platform on such a large scale alone is exceptionally challenging. The marketplace also required integration from databases in other government agencies. Once the user successfully was verified as an American citizen, income was determined, and they were filtered through state and federal government programs like Medicaid or the State Children's Health Insurance Program, then they would be matched with private health insurance plans.

The process was not simple and was far more complex than online shopping because it required integration with identification verification software, other government databases, and health insurance providers.

From day one, the project was underestimated. In addition, the requirements in the ACA that all citizens must enroll by January 1, 2014 or would be required to pay a fine created a hard deadline with economic and political consequences.

March 2010 - September 2011

Over a year passed between the ACA becoming a law and the CMS signing contracts with vendors who would build the FFM. During this period, problems directly affecting the launch were already beginning.

While the CMS was in charge of oversight of the project and hiring contractors, leadership was fractured across multiple government agencies. The project was headed by the CMS's Deputy CIO, Henry Chao, but the committee also included:

Todd Park – White House CTO

Jeanne Lambrew – Executive Office of Health Reform

Kathleen Sebelius and Bryan Sivak – Department of Health and Human Services

Members of the committee outside of the CMS executed a great deal of power and influence over the project; however, no one at the various agencies had visibility of the critical milestones that each group needed to reach to complete the project successfully.

The CMS awarded 60 contracts to 33 different vendors. The largest was granted to Conseillers en Gestion et Informatique (CGI), a Montreal-based IT company that employed more than 70,000 people. CGI grew to be worth more than $11 billion by 2013 by acquiring other companies, some of which handled US government contracts such as the 2004 purchase of American management Systems and the 2010 purchase of Stanley.

CGI’s contract consisted of developing the FFM and was valued at over $90 million. CGI was responsible for the most significant, user-facing aspect of the project but was not officially assigned a lead integrator role. CMS would later report they perceived CGI to be the project's lead integrator but didn't have written documentation outlining the agreement.

Representatives from CGI stated they did not have the same understanding at this point of the project.

US federal agencies are required to perform a procurement framework when awarding private companies with government contracts outlined by the Federal Acquisition Regulation (FAR). However, CMS failed to satisfy specific aspects of FAR, including:

> Preparing a written procurement strategy documenting the factors, assumptions, risks, and tradeoffs that would guide project decisions.

> Conduct thorough past performance reviews of potential contractors.

> Only used the federal government's contractor database (PPIRS) when evaluating bids on four of the six key contracts.

CMS leadership later claimed they were unaware that a procurement strategy was required. 

December 2011 – Summer 2013

Development of the Healthcare.gov project began in December 2011 consisting of four major components:

> Front-end website for the FFM (the marketplace)

Back-end data services hub

Enterprise identity management sub-system

Hosting infrastructure

CGI was responsible for the front-end website. The UI was developed with standard web tools, including Bootstrap, CSS, jQuery, Jekyll, Prose.io, and Ruby; however, it would later be revealed that common optimization features to aggregate and minify CSS and JS files were not used.

The back-end data services hub was developed by CGI and Quality Software Services, Inc (QSSI) using Java, JBoss, and the NoSQL MarkLogic database. The hub was responsible for orchestrating data and services from multiple external sources such as agent brokers, insurers, CMS, DHS, Experian, the IRS, state insurance exchanges, and the US Social Security Administration (SSA). While the integration was incredibly complex, utilizing data from multiple sources, the developers chose machine-generated middleware objects to save time.

Enterprise Identity Management (EIDM) was handled by QSSI but depended on the back-end to retrieve data from multiple sources. Before the launch, the EIDM was tested with an expected load of only 2,000 concurrent users.

The final major system component was the hardware infrastructure hosting of the website, FFM, data services hub, and EIDM. Akamai's CDN hosted Healthcare.gov's UI. The original back-end (it would be replaced after the failed launch) consisted of 48 VMWare virtual machine nodes running Red Hat Enterprise Linux and hosted on twelve physical servers in a Terremark data center. Some of the servers ran vSphere v4.1, with others running v5.1; the network was also running at 1 Gb/sec, far below its capacity of 4 Gb/sec.

Failures on all critical components of the project exemplify that CMS didn't have the personnel or experience to handle an IT project of this magnitude.

Late 2013

With only a couple of months left before the scheduled launch, CMS raised its concerns about CGI's performance but didn't take steps to hold the contractor accountable. In September 2013, CMS moved into CGI's offices for on-site support.

CMS delayed governance reviews that would have exposed the issues and did not receive the required approvals to move forward. However, they decided to continue with an on-schedule launch with all the platform's features, available to every American citizen needing affordable healthcare. The estimated project cost at this point was $500 million.

On October 1, 2013, the Healthcare.gov website was online. Most visitors experienced crashes, delays, errors, and slow performance throughout the week. By the weekend, the decision was made to take the site down because it was practically unusable.

Later in the month, HHS announced the following changes to the project:

Project management was centralized and led by Jeffrey Zients, former OMB director who had a reputation within the Whitehouse for solving tough problems and managing teams.

Todd Park, White House CTO, reorganized the technology leadership team, demoted some underperforming CMS employees and 3rd party contractors, and recruited top talent from Silicon Valley for a government sabbatical to save the site.

A Tiger team was formed with the narrow mandate of getting the FFM working properly.

The new team scrummed daily, triaged existential risks, and prioritized defects based on urgency. Over the next six weeks, the Tiger team resolved 400 system defects, increased system concurrency to 25,000 users, and improved the site's responsiveness to one second. The site went back online, and enrollment jumped from 26,000 in October to 975,000 in December.  

By Christmas, most problems had been fixed, but the site was still not fully operational.

2014

CGI was replaced by Accenture as the lead contractor and awarded a $90 billion contract to replace the FFM.

The individual mandate requirement was pushed back to March 31, 2014, giving uninsured Americans more time to sign up without being penalized.  

In July, the GOA released a detailed report outlining the critical failures of the project.

According to the GOA's findings, FFM obligations increased from $56 million to more than $209 million. Similarly, data hub obligations increased from $30 million to nearly $85 million from September 2011 to February 2014. The study recommended that "CMS take immediate actions to assess increasing contract costs and ensure that acquisition strategies are completed, and oversight tools are used as required, among other actions. CMS concurred with four recommendations and partially concurred with one. "

In August, the Office of Inspector General released a report finding that the total cost of the Healthcare.gov website had reached $1.7 billion. A month later, Bloomberg News reported the cost exceeded $2 billion.

By November, open enrollment on Healthcare.gov began for 2015.

What Went Wrong?

A multitude of issues caused the failed launch of the Healthcare.gov website. While it is a clear example of the federal government's continuous struggle to implement functioning, secure software, the problems go beyond Washington's bureaucracy. Nevertheless, much can be learned about releasing digital solutions and managing large-scale projects with many moving parts in general.

Overconfidence

The project started with overconfidence and unrealistic expectations set by the White House. Obama's campaign staff had a reputation for being technologically savvy because they pioneered using social media and data mining in the 2008 presidential election. However, running a social media campaign and releasing a single point of contact that pulls from multiple government agencies and insurance companies aren't comparable.

Underestimated Scale of the Project

Due to overconfidence and unrealistic expectations, the project scale was drastically underestimated, resulting in mismanagement in organizational structure, leadership, accountability, and transparency.

As the deadline approached, the project scope grew, while CMS identified 45 critical and 324 severe code defects across FFM modules.

Politics

Launching large-scale software projects are extremely challenging but adding a hostile political climate made the rollout even more difficult. Not only did CMS not have the personnel or experience to handle an FFM, but they also experienced pressure and influence from outside the agency. One of the most significant examples came in August of 2013, the White House and executive Office of Health Reform insisted on requiring site user registration before shopping for insurance so that concrete user numbers could be shown as proof of the system's success.

Members of the project committee that weren't from the CMS exhibited a great deal of influence over critical decision-making while not having access to accurate progress reports. In addition, the 2012 presidential elections likely impacted delays. Polarizing decisions, such as final rules on private insurance premiums, coverage availability, risk pools, and catastrophic plans, were put off till after the election cycle. These rules had to be translated into software and tested before a successful rollout was possible.

Lack of technology understanding and experience at CMS

The CMS was not prepared to handle a technology project at this scale. Other government agencies, such as the DOD and NASA, had decades of experience navigating the institutional challenges required to develop, deliver, and operate reliable IT systems.

Throughout procurement, development, and launch, CMS made it clear that its personnel didn't understand the requirements necessary to oversee a large-scale technology project, let alone one that had additional regulatory hurdles set by government agencies.

Poor Project Management

The project committee was spread across various government agencies, including the CMS, the White House, the Office of Health Reform, and the Department of Health and Human Services. As a result, there wasn't an organizational structure standard on even small-scale projects. Fractured leadership contributed to the lack of project management, but CMS was primarily responsible. While the problem could have been lessened if CMS had followed the guidelines from FAR, the operators from the agency pleaded ignorance in the GAO report, strengthening the case that CMS didn't have personnel suitable for the project.  

Failed to Postpone Launch

All the problems accumulated, resulting in a failed launch. Typically, the release is delayed when a website or software isn't ready. The engineers perform more testing, fix the problems, and launch at a later time. Healthcare.gov was a unique project with consequences transcending a poor user experience. The time constraints pressured CMS to go forward with the launch rather than being transparent and communicating that the infrastructure couldn't handle millions of users.

Leading up to the launch, CMS was given more business rules and a broader scope. While they had no control over the pressure coming from the Whitehouse, the agency failed to be upfront, and instead of delaying the launch or releasing in stages, they scrambled to save the project.

How CMS Could Have Done Things Differently

CMS was in charge of the project, but the blame falls on the Obama administration. Appointing CMS to lead the project was the first mistake made in a number of shortcomings that led to billions of wasted tax dollars and US citizens being delayed health care coverage.

While appointing a different agency, one that had experience delivering technology projects, was the first significant error, analyzing the oversights made by CMS leading up to the launch is the most practical way to assess what could have been done differently.  

Preparation

An understanding of the scale of Healthcare.gov would have dramatically influenced CMS to prepare better and implement project management best practices. One way the leadership at CMS could have prevented a launch failure was to realize they couldn't handle heading the project. Assigning lead manager and integrator roles to an outside firm or the lead vendor would have been more sensible.

Procurement

Many institutional challenges out of the hands of CMS contributed to the failed launch of Healthcare.gov. However, the agency had complete control over the procurement of government contractors. Simply following the Federal Acquisition Regulation (FAR) procurement framework could have prevented many organizational problems that led to the failure.

Had CMS adhered to standard government agency procurement guidelines, issues like the confusion around who was lead integrator wouldn't have existed. Furthermore, decisions like depending on data provided by Experian, a data source that neither the government nor the other contractors could do any data quality work on, would have likely been questioned if not denied when submitted for approval.

Adopt Iterative Software Development Framework

The project would have benefited if an iterative system development philosophy and a lean software manufacturing process such as Scrum or Kanban were adopted. Project managers could have organized sprints driven by the top priorities, including the complete end-to-end testing of the technology solution. An Iterative Development Framework would have increased visibility across multiple contractors and federal agencies, improved development quality, and reduced time to market.

Strong Leadership

Leadership was a fundamental problem that affected every aspect of the Healthcare.gov launch. Distribution of authority, management, and accountability created an environment where a functioning FFM was impossible to deliver on time. The project steering committee should have elected one project executive to make final decisions, hold contractors accountable, and communicate with other government agencies.

See "Consensus Is the Absence of Leadership" for  for more insights on this topic.

Set and Guard Technical Standards

A rushed, unqualified, and fractured committee led to the development of poor technology. As a result, CGI and other contractors experience zero to very little oversight allowing for an unfinished UI, partially operating back-end, and unstable hosting. 

The developers from CGI delivered an unpolished UI with excessive typos, a bloated directory, sloppy code, and even Lorem Ipsum placeholder text on the web pages. In addition, best practices were not followed or ignored regarding file compression, causing the website to take eight seconds to load and 71 seconds for user account registration pages with client-side loading, according to a report by AppDynamics. 

A basic small business website delivered at this standard would be unacceptable, let alone the focal point of one of the most transformative pieces of legislation in recent US history.

CGI should have been held to higher standards and required to deliver a polished, functioning UI.

While the front end was a disaster, fixing HTML, CSS, and JS and optimizing webpages can be done overnight. Healthcare.gov's back-end was a different story requiring systemic changes to function. More oversight was needed on the database and server-side development, but it could have only gone smoothly with drastic changes to leadership and organizational structure.

Improve Security

Healthcare.gov requires an abundance of personal data to be submitted and collected. While security wasn't the primary failure of the project, the servers were hacked in July 2014. The malware was uploaded into the system but failed to communicate to any external IP addresses. In addition, multiple security defects were found, including the insecure transmission of personal data, unvalidated password resets, error stack traces revealing internal components, and violations of user data privacy.

Comprehensive security audits must be conducted before the launch rather than on the fly after a site is live.

Implement Testing and Bug Fixing Protocols

Adequate testing could have prevented many of the FFM's problems; however, CMS was well aware of the issues but failed to communicate with HHS and other government agencies. Still, testing protocols and a strategy to manage and fix bugs are necessary for an IT project of this scale.

CMS needed to coordinate unit and component testing by 3rd parties much sooner than a couple of weeks before launch. Testing conducted by teams outside of specific features of the project ensures there aren't biases and that the project works on various devices, browsers, and servers.

Another issue CMS encountered was communication with states implementing their own healthcare marketplaces. The date was pushed from November 2012 to December 2012, and some were confirmed as late as February 2013. Uncertainty of the traffic volume should have led to overpreparation and expanded capacity testing of concurrent users. Instead, a load of only 2,000 simultaneous users has been tested rather than the tens of thousands they should have expected.  

Phased or Staged Rollout

One way the project steering committee could have responded to the issues without pushing back the launch was to release the project in phases or stages. Below are multiple options the committee could have taken instead of the Big Bang launch approach:

> Released certain features that were ready or could have been prepared when prioritized before October 1, 2013.

> Roll out a beta version of the platform to a small number of applicants (by region, government employees, sample group, etc.) months before the deadline.

> Release the platform in strategic phases leading up to the deadline, e.g., encourage applicants to register early in August, request eligibility in September, and shop for plans in October.

> Limit the scope of the project months before the launch and focus on minimal requirements rather than continue expanding leading up to the hard deadline.

See "How Your Rollout in Waves Can End in a Tsunami" for more insights on this topic.

Face Reality

The problems facing the Healthcare.gov launch were apparent, and there's evidence in the GAO report that suggests CMS was well aware of the issues. In addition, a McKinsey report was released just a few months before the scheduled rollout in April 2013. The report highlighted the initiative's complexity and identified more than a dozen critical risks spanning the marketplace technology and project governance. 

The problems we've covered were clearly outlined in the report, as well as multiple definitions of success and concerns with a Big Bang launch approach. The McKinsey report also suggested several actionable methods to mitigate the risks. Still, the project steering committee did not act upon the McKinsey report's findings and recommendations before the system launch.

Whether CMS caved under the pressure of the deadline, the political consequences, or was just incompetent enough to expect a positive outcome is unclear. All the warning signs pointed to an unsuccessful launch and should have been taken seriously, resulting in making the necessary adjustments.

See "It Is Time to Face Your Project's Reality" for more insights on this topic.

Closing Thoughts

The Healthcare.gov project exemplifies just about everything that can go wrong with software integration between the federal government and the private sector. The project's failures are incredibly complicated due to the influence of multiple government agencies and a hostile political climate. When one problem is identified, more come to the surface with increasingly challenging solutions.

But was the project doomed to fail just because of the layers of government bureaucracy? I don't believe so.

Had the committee executed strong leadership with personnel who had experience working on institutional software, the project could have gone much differently. The remarkable aspect of Healthcare.gov is how fast it was turned around, given the state of the project after the launch. Once the Whitehouse was fully aware of the problems and competent leaders were put in place, the site was made functional in a few weeks and essentially operational in December 2013.

In a nutshell: Experienced leadership leads to realistic expectations, a healthy organizational structure, and strong communication. Never underestimate the importance of the person or group that is in charge.

Don’t let your project fail like this one!

Discover here how I can help you turn it into a success.

For a list of all my project failure case studies just click here.

Sources

> https://www.bloomberg.com/news/articles/2014-09-24/obamacare-website-costs-exceed-2-billion-study-finds

> https://oig.hhs.gov/oei/reports/oei-03-14-00231.asp

> https://hackernoon.com/small-is-beautiful-the-launch-failure-of-healthcare-gov-5e60f20eb967

> https://www.appdynamics.com/blog/product/technical-deep-dive-whats-impacting-healthcare-gov/

https://www.gao.gov/products/gao-14-694


Read more…