Thursday, November 30, 2017

DevOps Case Study: Netflix and the Chaos Monkey

After some discussion of Netflix and the Chaos Monkey on our DevOps blog, I thought I would offer some detail of how Chaos Monkey and the Simian Army works. It's a great case study, posted on April 30th by C. Aaron Cois, from the SEI Institute at CMU.  I did not think to discuss until it was brought up. Maybe next semester, we'll start with it.

Anyway, Netflix's streaming service is a large distributed system hosted on Amazon Web Services (AWS). Since there are so many components that have to work together to provide reliable video streams to customers across a wide range of devices, Netflix engineers needed to focus heavily on the quality attributes of reliability and robustness for both server- and client-side components. In short, they concluded that the only way to be comfortable handling failure is to constantly practice failing. To achieve the desired level of confidence and quality, in true DevOps style, Netflix engineers set about automating failure.

Basically, you may have noticed that while the software is impressively reliable, occasionally the available streams of videos change. Sometimes, the 'Recommended Picks' stream may not appear, for example. When this happens it is because the service in AWS that serves the 'Recommended Picks' data is down. However, your Netflix application doesn't crash, it doesn't throw any errors, and it doesn't suffer from any degradation in performance. Netflix software merely omits the stream, or displays an alternate stream, with no hindered experience to the user, thus, exhibiting ideal, elegant failure behavior.

To achieve this result, Netflix dramatically altered their engineering process by introducing a tool called Chaos Monkey, the first in a series of tools collectively known as the Netflix Simian Army. Chaos Monkey is basically a script that runs continually in all Netflix environments, causing chaos by randomly shutting down server instances. Thus, while writing code, Netflix developers are constantly operating in an environment of unreliable services and unexpected outages. This chaos not only gives developers a unique opportunity to test their software in unexpected failure conditions, but incentivizes them to build fault-tolerant systems to make their day-to-day job as developers less frustrating.

This is DevOps at its finest: altering the development process and using automation to set up a system where the behavioral economics favors producing a desirable level of software quality. In response to creating software in this type of environment, Netflix developers will design their systems to be modular, testable, and highly resilient against back-end service outages from the start.

1 comment:

  1. Wanted: New homes for Yummies these Holidays

    Dapp.com has new decentralized game and they are under Pre-sale on December. This was called Yummies.
    As part of its effort to find new owners for homeless Yummies these Holidays, blockchain game and collectables publisher BitCrystals today announced that the presale of the first characters (called “Yummies”) from forthcoming blockchain-enabled game à table! will open at 10 am CET on December 5th. Potential owners will be able to give a Yummy a new home on the à table!

    Yummies are cute animals made from food-related items.
    https://www.dapp.com/article/a-table-yummies-pre-sale/

    For more info you can visit the links below👇👇
    website: Dapp.com
    Telegram: https://t.me/dapp_com

    ReplyDelete