Would Your Team Work With the Chaos Monkey?

Advances in large-scale, distributed software systems are changing the game for software engineering. As an industry, we are quick to adopt practices that increase flexibility of development and velocity of deployment. An urgent question follows on the heels of these benefits: How much confidence we can have in the complex systems that we put into production?

Chaos Engineering

Even when all of the individual services in a distributed system are functioning properly, the interactions between those services can cause unpredictable outcomes. Unpredictable outcomes, compounded by rare but disruptive real-world events that affect production environments, make these distributed systems inherently chaotic.

We need to identify weaknesses before they manifest in system-wide, aberrant behaviors. Systemic weaknesses could take the form of: improper fallback settings when a service is unavailable; retry storms from improperly tuned timeouts; outages when a downstream dependency receives too much traffic; cascading failures when a single point of failure crashes; etc. We must address the most significant weaknesses proactively, before they affect our customers in production. We need a way to manage the chaos inherent in these systems, take advantage of increasing flexibility and velocity, and have confidence in our production deployments despite the complexity that they represent.

An empirical, systems-based approach addresses the chaos in distributed systems at scale and builds confidence in the ability of those systems to withstand realistic conditions. We learn about the behavior of a distributed system by observing it during a controlled experiment. This is called Chaos Engineering.

Chaos Monkey

Chaos Engineering was the philosophy when Netflix built Chaos Monkey, a tool that randomly disables Amazon Web Services (AWS) production instances to make sure you can survive this common type of failure without any customer impact. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while you continue serving your customers without interruption.

By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, you can still learn the lessons about the weaknesses of your system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, you won’t even notice.

Chaos Monkey has a configurable schedule that allows simulated failures to occur at times when they can be closely monitored. In this way, it’s possible to prepare for major unexpected errors rather than just waiting for catastrophe to strike and seeing how well you can manage.

Chaos Monkey was the original member of Netflix’s Simian Army, a collection of software tools designed to test the AWS infrastructure. The software is open source (GitHub) to allow other cloud services users to adapt it for their use. Other Simian Army members have been added to create failures and check for abnormal conditions, configurations and security issues.

An Agile and DevOps engineering culture doesn’t have a mechanism to force engineers to architect their code in any specific way. Instead, you can build strong alignment around resiliency by taking the pain of disappearing servers and bringing that pain forward.

Most people think this is a crazy idea, but you can’t depend on the infrequent occurrence of outages to impact behavior. Knowing that this will happen on a frequent basis creates strong alignment among engineers to build in the redundancy and automation to survive this type of incident without any impact on your customers.

Would your team be willing to implement their own Chaos Monkey?

Posted on Monday, August 07, 2017 by Henrico Dolfing