Five days ago, the internet had a conniption. In broad patches around the globe, YouTube sputtered. Shopify stores shut down. Snapchat blinked out. And millions of people couldn’t access their Gmail accounts. The disruptions all stemmed from Google Cloud, which suffered a prolonged outage—which also prevented Google engineers from pushing a fix.
This article builds upon Vivek Rau’s chapter “Eliminating Toil” in Site Reliability Engineering: How Google Runs Production Systems . We begin by recapping Vivek’s definition of toil and Google’s approach to balancing operational work with engineering project work.  B. Beyer, C. Jones, J. Petoff, and N. Murphy, eds., Site Reli- ability Engineering (O’Reilly Media, 2016).