Sunday, June 30, 2019

Your Projects Should Start Slow in Order to Run Fast Later

Your projects should start slow in order to run fast later
I am a big believer of short and fat projects and I am very vocal about it. Because of this I often get asked if I propose to fast track projects and reduce upfront work.

No, I am not. On the contrary, I think more time should be spent on upfront work.

Yes, to keep costs down and maximize benefits you should keep implementation phases short and delays small. This should not be seen as an excuse for fast-tracking projects; that is, rushing them through decision making for an early start.

For smaller projects, this might be something you can get away with, but for large technology projects all you do if you hit the ground running is fail hard. Front-end planning and validation need to be thorough before deciding to give the green light to a project, or to stop it.

You need to go slow at first (during project initiation) in order to run fast later (during project delivery).

Unfortunately, many times the situation is exactly the opposite. Front-end planning and validation is rushed, bad projects are not stopped, important projects do not get the money/ people/management attention they need, implementation phases and delays are long, costs explode, and value diminishes.

In a nutshell: Stop the madness. Start slow, and run fast later.

You can buy my book The Art of Technology Project Portfolio Management on Amazon by clicking on the image.

The Art of Technology Project Portfolio Management

Read more…

Tuesday, June 25, 2019

Your Best Insurance for Multimillion Dollar Tech Projects - Independent Project Reviews

Your best insurance for multimillion dollar tech projects - Independent Project Reviews
It was to be a great digital leap for Germany’s biggest discount grocer. Instead, after seven years and €500 million in sunk costs, the project tasked with creating Lidl’s new inventory management system with SAP was killed in July 2018.

In planning since 2011, the project quickly lost its shine when roughly a thousand staff and hundreds of consultants started the implementation. The costs quickly spiraled beyond the two groups’ estimations without bringing the project much closer to success.

***

The United Kingdom’s National Health Service launched the largest public-sector information technology (IT) program ever attempted, the National Programme for IT.  It was originally budgeted to cost approximately £6 billion over the lifetime of the major contracts.

These contracts were awarded to some of the biggest players in the IT industry, including Accenture, CSC, Atos Origin, Fujitsu, and BT. After significant delays, stakeholder opposition, and implementation issues, the program was dropped in 2011, almost 10 years after its inception and with costs estimated at over £10 billion.

***

The American car rental company, Hertz hired Accenture in 2016 to completely overhaul its online presence. Their new website was expected to launch in December 2017, was then delayed to April 2018, and now indefinitely.

While Hertz weathered the delays, it found itself in a bigger nightmare: it was saddled with a product and design that didn't do half of what it was expected to do and that remains incomplete. “By that point, Hertz no longer had any confidence that Accenture was capable of completing the project, and Hertz terminated [the contract].” The car rental company launched a formal lawsuit against Accenture this past May (2019), suing for the $32 million USD it paid Accenture and millions more to cover the cost of fixing the mess. “Accenture never delivered a functional website or mobile app,” Hertz representatives claimed.

***

These 3 examples (and there are many more like them) have 1 thing in common. The plug was pulled far too late.

All too often, project teams, sponsors, and stakeholders lose sight of the larger vision and are unable to course-correct and make strategic decisions. There are many reasons for this: overconfidence, oversimplification, avoiding pain, binding contracts, lack of skills, lack of experience, egos, lack of information, and confirmation bias.

So how do you insure yourself against such catastrophic failures that can bring your organization to its knees?

Simple. Independent project reviews.

What Are Independent Project Reviews, You Ask?

When we talk about a project review, there are many names thrown around that, at face value, are all taken to mean the same thing: project review, project health check, project audit, project retrospective, and project post mortem. But are they really the same? The short answer is “no.”

A project audit bears on issues of compliance and has to do with the now. An audit aims to show the extent to which your project conforms to the required organizational and project standards, its fidelity. So, if your organization uses PRINCE2, or their own project management methodology, an audit will look at how closely you follow the processes. An audit can take place either during the project or after it is completed.

A project retrospective, or post mortem, is about learning lessons so that your next project will run better or, at least, equally well. A project retrospective is performed after the project closes, so is of no use to the project itself.

A project review has to do with project success. A project review will give you a good understanding of the status of your project and whether it is on track to deliver against your definition of project success on the following 3 levels:

1) Project delivery success: will the project delivery be successful? Essentially, this assesses the classic triangle of scope, time, and budget.

2) Product or service success: this refers to when the product or service is deemed successful (e.g. the system is used by all users in scope, up-time is 99.99%, customer satisfaction has increased by 25%, and operational costs have decreased by 15%).

3) Business success: this has to do with whether the product or service brings value to the overall organization, and how it contributes financially and/or strategically to the business’s success.

Note, “independent” suggests that the person (or team) that is completing the review is not involved in the project, and lacks ties with any associated companies working on the project. In short, reviews by vendors, your own organization, or implementation partners have no place here.

So How Does the Reviewing Party Get the Information They Need?

Below are the 12 building blocks of a typical project review. The row order is not carved in stone and can be adapted based on availability and priorities. It is worth noting that the results of 1 building block will be an input of another.


1) Success: Understanding the project success criteria mentioned above

2) Stakeholders: Understanding the project stakeholders, i.e. their desired outcomes and expectations

3) Governance: Sponsors, steering committee, and controlling. How does it work in theory? How does it work in practice?

4) Engineering: Is the system created through separate development? What about testing and product environments? Is there continuous integration? Bug reports? How is quality so far?

5) Technology: Solution architecture, stable technologies, back-up, disaster recovery, and performance

6) Team: How is the project team working together? What is their capacity, collective skills, relationships, and project management methods?

7) Scope: Understanding when the project is “done.” Is it defined? At what level? Is it clear? Is there a change management process in place? What changes have taken place since the beginning?

8) Schedule: Is there a plan? Is it realistic? Are there contingencies? Have there been any significant changes to date?

9) Financials: Is there a clear overview of costs? Are these complete and correct? What about forecasts, budgeting, and controlling processes?

10) Impact: Who and what will be impacted when the project goes live? What changes need to take place to anticipate and respond to associated needs? How will the change be managed? How is it operationalized?

11) Risk: Assessment of (currently) identified risks, identification of new risks, and review mitigation actions in place

12) Contracts: Review existing contractual obligations for all parties involved

Closing Thoughts

Smart companies will organize periodic reviews of large, multi-year, strategic projects to verify that all components are on track; that the technical foundation is solid; and that the business case(s) remain valid. This can be performed once a year, or at certain project milestones.

When your company is unwilling to make this investment, the second-best approach is to organize a review the moment you think one of your key projects is in trouble.

An independent project review will give you:

> An outside 360-degree view of the current status of your project;

> The information you need in order to make good decisions;

> An outside opinion on the project’s likelihood of success (project delivery success, product/service success, and business success); and

> Suggestions for corrective actions on the discovered project issues and challenges

In a nutshell: Quite simply, an independent project review is your best insurance against losing touch with reality.

When you need some guidance on how to define and measure project success, just download the Project Success Model by clicking on the image.


The Project Success Model

Read more…

Tuesday, June 18, 2019

Your (Lack Off) Training Efforts Can Easily Ruin the Outcome of an Otherwise Well-Executed Project

Your (lack off) training efforts can easily ruin the outcome of an otherwise well-executed project
Any system is only as good as how well it is used. If its a CRM, ERP, or any other system, when users don’t know how to use the system effectively the benefits of the new system for your company will be small, or even negative.

So educating and training your employees is critical to the success of a project — you can never over-train employees on a new system.

Unfortunately, it is hardly ever done right. How many of the below statements sound familiar to you?

“The training was too fast and did not allow time for people to move up the learning curve. There was a very small time window between training and go-live.”

“The training was not supported by written procedures or reference materials — the project team thought some online ‘help’ files would suffice; they didn’t.”

“I think the training team thought they did a great job as their end-of-session evaluations showed good results, but the real measure was the subsequent level of demand on the ‘help desk’ and that showed the training failed to meet the needs of the business.”

“The training was system-operational based. It was too limited. We did not know the business context, the opportunities, why the changes were required, etc. We were just told, this is how you do it now. The business change was ignored in the training scenario, yet this was the most important bit.”

“There were no ‘sustain’ activities, so people quickly reverted to their old habit patterns; often working around the new system to create the old processes as closely as possible. Equally, the new employee’s onboarding training was ignored. We tried to give them the implementation training but found it was inadequate for people new to the firm and its processes.”

“The new systems introduced new disciplines. Correct account codes needed to be entered at source, purchase orders needed correct part numbers on them before they could be sent. These and many other ‘disciplines’ were introduced as part of the system but without any pre-emptive education or communications. They were therefore seen as examples of the new systems’ complexity and increased workload. The downstream benefits were neither known nor considered. As a result, the system got a bad name as ‘too cumbersome’.”

We are all aware of it, and yet we somehow refuse to spend sufficient fund, focus and time on employee education and training.

In a nutshell: In order for your next project that introduces a new system to be a success, make sure that training is a priority.

When you need some guidance on how to define and measure project success, just download the Project Success Model by clicking on the image.


The Project Success Model

Read more…

Thursday, June 13, 2019

A Powerful Story to Help You With Stakeholder Management

A powerful story to help you with stakeholder management
When dealing with one or multiple project stakeholders I often use the story below as the start of a planning workshop. Sometimes it’s at the initiation phase of a project, but more often during re-scoping of projects because of time and/or budget reasons.

A philosophy professor stood before his class and had some items in front of him. When the class began, wordlessly he picked up a large empty jar and proceeded to fill it with rocks about two inches in diameter. He then asked the students if the jar was full. They agreed that it was.

So the professor then picked up a box of pebbles and poured them into the jar. He shook the jar lightly. The pebbles, of course, rolled into the open areas between the rocks. He then asked the students again if the jar was full. They agreed it was.

The students laughed. The professor picked up a box of sand and poured it into the jar. Of course, the sand filled up everything else.

The professor then produced two cans of beer from under the table and proceeded to pour the entire contents into the jar, effectively filling the empty space between the grains of sand. The students laughed again.

“Now,” said the professor, “I want you to recognize that this is your life. The rocks are the important things – your family, your partner, your health, your children – things that if everything else was lost and only they remained, your life would still be full. The pebbles are the other things that matter, like your job, your house, your car. The sand is everything else. The small stuff. 

“If you put the sand into the jar first, there is no room for the pebbles or the rocks. The same goes for your life. If you spend all your time and energy on the small stuff, you will never have room for the things that are important to you. Pay attention to the things that are critical to your happiness. Play with your children. Take time to get medical checkups. Take your partner out dancing. There will always be time to go to work, clean the house, give a dinner party and change a light bulb.  

“Take care of the rocks first – the things that really matter. Set your priorities. The rest is just sand.”

One of the students raised her hand and inquired what the beer represented. The professor smiled. "I'm glad you asked. It just goes to show you that no matter how full your life may seem, there's always room for a couple of beers."

After telling the story I draw a big jar on a white board and ask my stakeholder what the big rocks are for their project. What key elements drive the most benefits? If we could realize only ONE thing, what would this be? Why?

When you have multiple stakeholders (sometimes with conflicting interests) this exercise will help you make it clear to them that you cannot do everything for everybody. And you will have all the right people in the room to come to an agreement.

After we have defined and agreed on the big rocks, we check to see if they all fit in the jar. When they don’t, we start talking about a bigger jar (more time and/or budget), or fewer rocks (scope reduction). When selecting scope reduction, please be very aware of value creep.

Only when the big rocks are all in the jar do we start discussing the pebbles.

In a nutshell: Yes, having a beer with your stakeholders after discussing needs and priorities really helps with your stakeholder relationships.

When you need some guidance on how to define and measure project success, just download the Project Success Model by clicking on the image.


The Project Success Model

Read more…

Wednesday, June 05, 2019

Case Study: The $440 Million Software Error at Knight Capital

Case Study: The $440 million software error at Knight Capital
Knight Capital Group was an American global financial services firm engaging in market making, electronic execution, and institutional sales and trading. In 2012 Knight was the largest trader in U.S. equities with a market share of around 17 percent on the New York Stock Exchange (NYSE) as well as on the Nasdaq Stock Market. Knight’s Electronic Trading Group (ETG) managed an average daily trading volume of more than 3.3 billion trades, trading over $21 billion … daily.

It took 17 years of dedicated work to build Knight Capital Group into one of the leading trading houses on Wall Street. And it all nearly ended in less than one hour.

What happened to Knight on the morning of August 1, 2012, is every CEO’s nightmare: A simple human error, easily spotted with hindsight but nearly impossible to predict in advance, threatened to end the firm.

At Knight, some new trading software contained a flaw that became apparent only after the software was activated when the New York Stock Exchange (NYSE) opened that day. The errant software sent Knight on a buying spree, snapping up 150 different stocks at a total cost of around $7 billion, all in the first hour of trading.

Under stock exchange rules, Knight would have been required to pay for those shares three days later. However, there was no way it could pay, since the trades were unintentional and had no source of funds behind them. The only alternatives were to try to have the trades canceled, or to sell the newly acquired shares the same day.

Knight tried to get the trades canceled. Securities and Exchange Commission (SEC) chairman Mary Schapiro refused to allow this for most of the stocks in question, and this seems to have been the right decision. Rules were established after the “flash crash” of May 2010 to govern when trades should be canceled. Knight’s buying binge did not drive up the price of the purchased stocks by more than 30 percent, the cancellation threshold, except for six stocks. Those transactions were reversed. In the other cases, the trades stood.

This was very bad news for Knight but was only fair to its trading partners, who sold their shares to Knight’s computers in good faith. Knight’s trades were not like those of the flash crash, when stocks of some of the world’s largest companies suddenly began trading for as little as a penny and no buyer could credibly claim the transaction price reflected the correct market value.

Once it was clear that the trades would stand, Knight had no choice but to sell off the stocks it had bought. Just as the morning’s buying rampage had driven up the price of those shares, a massive sale into the market would likely have forced down the price, possibly to a point so low that Knight would not have been able to cover the losses.

Goldman Sachs stepped in to buy Knight’s entire unwanted position at a price that cost Knight $440 million – a staggering blow, but one the firm might be able to absorb. And if Knight failed, the only injured party, apart from Knight’s shareholders (including Goldman), would have been Goldman itself.

Disposing of the accidentally purchased shares was only the first step in Knight CEO Thomas Joyce’s battle to save his company. The trades had sapped the firm’s capital, which would have forced it to greatly cut back its business, or maybe to stop operating altogether, without a cash infusion. And as word spread about the software debacle, customers were liable to abandon the company if they did not trust its financial and operational capacities.

A week later, Knight received a $400 million cash infusion from a group of investors, and by the next summer, it was acquired by a rival, Getco LLC. This case study will discuss the events leading up to this catastrophe, what went wrong, and how this could be prevented.

To download all my Project Failure Case Studies in a single eBook just click on the image.

Timeline of Events

Some of Knight’s biggest customers were the discount brokers and online brokerages such as TD Ameritrade, E*Trade, Scottrade, and Vanguard. Knight also competed for business with financial services giants like Citigroup, UBS, and Citadel. However, these larger competitors could internalize increasingly larger amounts of trading away from the public eye in their own exclusive markets or shared private markets, so-called dark pools. Since 2008, the portion of all stock trades in the U.S. taking place away from public markets has risen from 15 percent to more than 40 percent.

In October 2011, the NYSE proposed a dark pool of its own, called the Retail Liquidity Program (RLP). The RLP would create a private market of traders within the NYSE that could anonymously transact shares for fractions of pennies more or less than the displayed bid and offer prices, respectively. The RLP was controversial even within NYSE Euronext, the parent company of the NYSE; its CEO, Duncan Niederauer, had written a public letter in the Financial Times criticizing dark pools for shifting “more and more information… outside the public view and excluded from the price discovery process.”

The SEC decision benefited large institutional investors who could now buy or sell large blocks of shares with relative anonymity and without moving the public markets; however, it came again at the expense of market makers. During the months of debate, Joyce had not given the RLP much chance for approval, saying in one interview, “Frankly, I don’t see how the SEC can be possibly OK it.” In early June 2012, the NYSE received SEC approval of its RLP, and it quickly announced the RLP would go live on August 1, 2012, giving market makers just over 30 days to prepare. Joyce insisted on participating in the RLP because giving up the order flow without a fight would have further dented profits in its best line of business.

What Went Wrong

With only a month between the RLP’s approval and its go-live, Knight’s software development team worked feverishly to make the necessary changes to its trade execution systems – including SMARS, its algorithmic, high-speed order router. SMARS stands for Smart Market Access Routing System.

SMARS was able to execute thousands of orders per second and could compare prices between dozens of different trading venues within fractions of a second.

A core feature of SMARS is to receive orders from other upstream components in Knight’s trading platform (“parent” orders) and then, as needed based on the available liquidity and price, sends one or more representative (“child”) orders to downstream, external venues for execution.

Power Peg

The new RLP code in SMARS replaced some unused code in the relevant portion of the order router; the old code previously had been used for an order algorithm called “Power Peg,” which Knight had stopped using since 2003. Power Peg was a test program that bought high and sold low; it was specifically designed to move stock prices higher and lower in order to verify the behavior of its other proprietary trading algorithms in a controlled environment. It was not to be used in the live, production environment.

There were grave problems with Power Peg in the current context. First, the Power Peg code remained present and executable at the time of the RLP deployment despite its lack of use. Keeping such “dead code” is bad practice, but common in large software systems maintained for years. Second, the new RLP code had repurposed a flag that was formerly used to activate the Power Peg code; the intent was that when the flag was set to “yes,” the new RLP component – not Power Peg –  would be activated. Such repurposing often creates confusion, had no substantial benefit, and was a major mistake, as we shall see shortly.

Code refactoring

There had been substantial code refactorings in SMARS over the years without thorough regression testing; in 2005, Knight changed the cumulative quantity function that counted the number of shares of the parent order that had been executed and filled to decide whether to route another child order. The cumulative quantity function was now invoked earlier in the SMARS workflow, which in theory was a good idea to prevent excess system activity; in practice, it was now disconnected from Power Peg (which used to call it directly), could no longer throttle the algorithm when orders were filled, and Knight never retested Power Peg after this change.

Manual deployment

In the week before go-live, a Knight engineer manually deployed the new RLP code in SMARS to its eight servers. However, the engineer made a mistake and did not copy the new code to one of the servers. Knight did not have a second engineer review the deployment, and neither was there an automated system to alert anyone to the discrepancy. Knight also had no written procedures requiring a supervisory review, all facts we shall return to later.

The crash

On August 1, 8:01 a.m. EST, an internal system called BNET generated 97 email messages that referenced SMARS and identified an error described as “Power Peg disabled.” These obscure, internal messages were sent to Knight personnel, but their channel was not designated for high-priority alerts and the staff generally did not review them in real-time; however, they were the proverbial smoke of the smoldering code and deployment bits about to burn, and it was a missed opportunity to identify and fix the DevOps issue prior to market open.

At 9:30 a.m. EST, Knight began receiving RLP orders from broker-dealers, and SMARS distributed the incoming work to its servers. The seven servers that had the new RLP code processed the orders correctly. However, orders sent to the eighth server with the defective Power Peg code activated by the repurposed flag soon triggered the fault line of a financial tectonic plate. This server began to continuously send child orders for each incoming parent order without regard to the number of confirmed executions Knight had already received from other trading venues.

The results were immediately catastrophic. For the 212 incoming parent orders processed by the defective Power Peg code, SMARS sent thousands of child orders per second that would buy high and sell low, resulting in 4 million executions in 154 stocks for more than 397 million shares in approximately 45 minutes. For 75 of these stocks, Knight’s executions jostled prices more than 5% and comprised more than 20% of trading volume; for 37 stocks, prices lurched more than 10% and Knight’s executions constituted more than 50% of trading volume.

Following the flash crash of May 6, 2010, in which the Dow Jones Industrial Average (DJIA) lost over 1000 points in minutes, the SEC announced several new rules to regulate securities trading.

1) Circuit breakers were required to stop trading if the market experienced what was labeled as “significant price fluctuations” of more than 10 percent during a five-minute period.

2) The SEC required more specific conditions governing the cancellation of trades. For events involving between five and 20 stocks, trades could be cancelled if they were at least 10 percent away from the “reference price,” the last sale before pricing was disrupted; for events involving more than 20 stocks, trades could be cancelled if they deviated more than 30 percent from the reference price.

3) Securities Exchange Act Rule C.F.R 240.15c3–5 (“Rule”) went into effect, requiring the exchanges and broker-dealers to implement risk management controls to ensure the integrity of their systems as well as executive review and certification of the controls.

Since the flash crash rules were designed for price swings, not trading volume, they did not kick in as intended and stop trading because few of the stocks traded by Knight on that fateful day exceeded the 10 percent price change threshold.

By 9:34 a.m., NYSE computer analysts noticed that market volumes were double the normal level and traced the volume spike back to Knight. Niederauer tried calling Joyce, but Joyce was still at home recovering from knee surgery.

The NYSE then alerted Knight’s chief information officer, who gathered the firm’s top IT people; most trading shops would have flipped a kill switch in their algorithms or would have simply shut down systems. However, Knight had no documented procedures for incident response, again, another fact we shall return to later. So, it continued to fumble in the dark for another 20 minutes, deciding next that the problem was the new code.

Because the “old” version allegedly worked, Knight reverted back to the old code still running on the eighth server and reinstalled it on the others. As it turned out, this was the worst possible decision because all eight servers now had the defective Power Peg code activated by the misappropriated RLP flag and executing without a throttle.

It was not until 9:58 a.m. that Knight engineers identified the root cause and shut down SMARS on all the servers; however, the damage had been done. Knight had executed over 4 million trades in 154 stocks totaling more than 397 million shares; it assumed a net long position in 80 stocks of approximately $3.5 billion as well as a net short position in 74 stocks of approximately $3.15 billion.

How Knight Capital Could Have Done Things Differently

This case study contains several lessons useful for project managers, IT professionals, and business leaders. Knight could have prevented the failure and minimized the damage with a variety of modern software development and operating practices (DevOps). Below, I describe eight of these measures and how they could have made a difference for Knight Capital.

Use of Version Control

Do not run dead code. Instead, always prune dead code and use version control systems to track the changes. You should not re-purpose configuration flags; rather, activate new features with new flags.

Version control is any kind of practice that tracks and provides control over changes to source code. Teams can use version control software to maintain documentation and configuration files as well as source code.

As teams design, develop, and deploy software, it is common for multiple versions of the same software to be deployed in different sites and for the software's developers to be working simultaneously on updates. Bugs or features of the software are often only present in certain versions (because of the fixing of some problems and the introduction of others as the program develops).

Therefore, for the purposes of locating and fixing bugs, it is vitally important to be able to retrieve and run different versions of the software to determine in which version(s) the problem occurs. It may also be necessary to develop two versions of the software concurrently: for instance, where one version has bugs fixed, but no new features (branch), while the other version is where new features are worked on (trunk).

Writing Unit Tests

The purpose of unit testing is not for finding bugs. It is a specification for the expected behavior of the code under test. The code under test is the implementation for those expected behaviors. So unit tests and the code under test are used to check the correctness of each other and protect each other. Later, when someone changes the code under test, and it changes the behavior that is expected by the original author, the test will fail. If your code is covered by a reasonable amount of unit tests, you can maintain the code without breaking the existing feature. That’s why Michael Feathers defines legacy code in his seminal book "Working Effectively with Legacy Code" as code without unit tests. Without unit tests your development efforts will be a major risk every time you change your legacy code.

Code Reviews

Code review is a systematic examination (sometimes referred to as peer review) of source code. It is intended to find mistakes overlooked in software development, improving the overall quality of software. Reviews are done in various forms such as pair programming, informal walkthroughs, and formal inspections.

Automated Tests and Test Automation

In the world of testing in general, and continuous integration and delivery in particular, there are two types of automation:

1) Automated Tests
2) Test Automation

While it might just seem like two different ways to say the same thing, these terms actually have very different meanings.

Automated tests are tests that can be run automatically, often developed in a programming language. In this case, we talk about the individual test cases, either unit-tests, integration/service, performance tests, end-2-end tests, or acceptance tests. The latter is also known as specification by example.

Test automation is a broader concept and includes automated tests. From my perspective, it should be about the full automation of test cycles, from check-in up to deployment – also called continuous testing. Both automated testing and test automation are important to continuous delivery, but it's really the latter that makes continuous delivery of a high quality even possible.

Had Knight implemented automated tests and test automation for the new and existing SMARS functionalities they would have caught the error before deploying it in production.

Automated Deployment Process

It is not enough to build great software and test it; you also have to ensure it is delivered to market correctly so that your customers get the value you are delivering (and so you don’t bankrupt your company). The engineer(s) who deployed SMARS are not solely to blame here – the process Knight had set up was not appropriate for the risk they were exposed to. Additionally, their process (or lack thereof) was inherently prone to error. Any time your deployment process relies on humans reading and following instructions you are exposing yourself to risk. Humans make mistakes. The mistakes could be in the instructions, in the interpretation of the instructions, or in the execution of the instructions.

Deployments need to be automated and repeatable and as free from potential human error as possible. Had Knight implemented an automated deployment system – complete with configuration, deployment, and test automation – the error that caused the nightmare would have been avoided.

Step-by-Step Deployment Process Guide 

Anybody (even somebody who is usually not doing this) should be able to deploy on production with this guide on the table. Of course, the more you go into the direction of automated deployment, the smaller this guide becomes, because all documentation of this process is coded in your automated processes. The probability of doing something wrong with a step-by-step guide (or a checklist) is a multitude smaller as without. This has been proven many times in the medical space.

Timeline

The timeline was another reason Knight failed to deliver the RLP solution. Knight’s IT project managers and CIO should have pushed back on the hyper-aggressive delivery schedule and countered its business leaders with an alternative phased schedule instead of the Big Bang – pun intended. Thirty days to implement, test, and deploy major changes to an algorithmic trading system that is used to make markets daily worth billions of dollars is impulsive, naive, and reckless.

Risk Management

Risk management is a vital capability for a modern organization, especially for financial services companies. The SEC’s report (see References) concluded: “Although automated technology brings benefits to investors, including increased execution speed and some decreased costs, automated trading also amplifies certain risks. As market participants increasingly rely on computers to make order routing and execution decisions, it is essential that compliance and risk management functions at brokers or dealers keep pace… Good technology risk management practices include quality assurance, continuous improvement, controlled user acceptance testing, process measuring, management and control, regular and rigorous review for compliance with applicable rules and regulations, an independent audit process, technology governance that prevents software malfunctions, system errors and failures, service outages, and when such issues arise, a prompt, effective, and risk-mitigating response.”

While Knight had order controls in other systems, it did not compare orders exiting SMARS with those that entered it. Knight’s primary risk monitoring tool, known as “PMON,” is a post-execution position monitoring system. At the opening of the market, senior Knight personnel observed a large volume of positions in a special account called 33 that temporarily held multiple types of positions, including positions resulting from executions that Knight received back from markets that its systems could not match to the unfilled quantity of a parent order. There was a $2 million gross limit to the 33 account, but it was not linked to any automated controls concerning Knight’s overall financial exposure.

Furthermore, PMON relied entirely on human monitoring, did not generate automated alerts, and did not highlight the display of account exposures based on whether a limit had been exceeded. Moreover, Knight also had no circuit breakers, which is a standard pattern and practice for financial services companies.

Closing Thoughts

Although Knight was one of the most experienced companies in automated trading at the time (and the software that goes with it), it failed to implement many of the standard DevOps best practices that could have prevented this disaster at any number of intervals.

Knight Capital Group Holdings was eventually acquired by another market making rival, Virtu LLC, in July 2017 for $1.4 billion. The silver lining to the story was that Knight was not too big to fail, and the market handled the failure with a relatively organized rescue without the help of taxpayers. However, a dark cloud remains: market data suggests that 70 percent of U.S. equity trading is now executed by high-frequency trading firms, and one can only wonder when, not if, the next flash crash will occur.

Other Project Failure Case Studies

> For an overview of all case studies I have written please click here.

> To download all my Project Failure Case Studies in a single eBook just click on the image.

References

> SECURITIES EXCHANGE ACT OF 1934 Release No. 70694 / October 16, 2013, ADMINISTRATIVE PROCEEDING File No. 3-15570

Read more…