Example Error Budget Policy - Site Reliability Engineering
文章推薦指數: 80 %
An error budget is 1 minus the SLO of the service. A 99.9% SLO service has a 0.1% error budget. If our service receives 1,000,000 requests in four weeks, a 99.9 ... TableofContents ForewordI ForewordII Preface 1.HowSRERelatestoDevOps PartI-Foundations 2.ImplementingSLOs 3.SLOEngineeringCaseStudies 4.Monitoring 5.AlertingonSLOs 6.EliminatingToil 7.Simplicity PartII-Practices 8.On-Call 9.IncidentResponse 10.PostmortemCulture:LearningfromFailure 11.ManagingLoad 12.IntroducingNon-AbstractLargeSystemDesign 13.DataProcessingPipelines 14.ConfigurationDesignandBestPractices 15.ConfigurationSpecifics 16.CanaryingReleases PartIII-Processes 17.IdentifyingandRecoveringfromOverload 18.SREEngagementModel 19.SRE:ReachingBeyondYourWalls 20.SRETeamLifecycles 21.OrganizationalChangeManagementinSRE Conclusion AppendixA.ExampleSLODocument AppendixB.ExampleErrorBudgetPolicy AppendixC.ResultsofPostmortemAnalysis Index AbouttheEditors Colophon ExampleErrorBudgetPolicy Status Published Author StevenThurgood Date 2018-02-19 Reviewers DavidFerguson Approvers BetsyBeyer ApprovalDate 2018-02-20 RevisitDate 2019-02-01 ServiceOverview TheExampleGameServiceallowsAndroidandiPhoneuserstoplayagamewitheachother.Newreleasesofthebackendcodearepusheddaily.Newreleasesofclientsarepushedweekly.Thispolicyappliesbothtobackendandclientreleases. Goals Thegoalsofthispolicyareto: ProtectcustomersfromrepeatedSLOmisses Provideanincentivetobalancereliabilitywithotherfeatures Non-Goals ThispolicyisnotintendedtoserveasapunishmentformissingSLOs.Haltingchangeisundesirable;thispolicygivesteamspermissiontofocusexclusivelyonreliabilitywhendataindicatesthatreliabilityismoreimportantthanotherproductfeatures. SLOMissPolicy IftheserviceisperformingatoraboveitsSLO,thenreleases(includingdatachanges)willproceedaccordingtothereleasepolicy. Iftheservicehasexceededitserrorbudgetfortheprecedingfour-weekwindow,wewillhaltallchangesandreleasesotherthanP01issuesorsecurityfixesuntiltheserviceisbackwithinitsSLO. DependinguponthecauseoftheSLOmiss,theteammaydevoteadditionalresourcestoworkingonreliabilityinsteadoffeaturework. Theteammustworkonreliabilityif: Acodebugorproceduralerrorcausedtheserviceitselftoexceedtheerrorbudget. Apostmortemrevealsanopportunitytosoftenaharddependency. MiscategorizederrorsfailtoconsumebudgetthatwouldhavecausedtheservicetomissitsSLO. Theteammaycontinuetoworkonnon-reliabilityfeaturesif: Theoutagewascausedbyacompany-widenetworkingproblem. Theoutagewascausedbyaservicemaintainedbyanotherteam,whohavethemselvesfrozenreleasestoaddresstheirreliabilityissues. TheerrorbudgetwasconsumedbyusersoutofscopefortheSLO(e.g.,loadtestsorpenetrationtesters). Miscategorizederrorsconsumebudgeteventhoughnouserswereimpacted. OutagePolicy Ifasingleincidentconsumesmorethan20%oferrorbudgetoverfourweeks,thentheteammustconductapostmortem.ThepostmortemmustcontainatleastoneP0actionitemtoaddresstherootcause. Ifasingleclassofoutageconsumesmorethan20%oferrorbudgetoveraquarter,theteammusthaveaP0itemontheirquarterlyplanningdocument2toaddresstheissuesinthefollowingquarter. EscalationPolicy Intheeventofadisagreementbetweenpartiesregardingthecalculationoftheerrorbudgetorthespecificactionsitdefines,theissueshouldbeescalatedtotheCTOtomakeadecision. Background Note Thissectionisboilerplate,intendedtogiveasuccinctoverviewoferrorbudgetstothoseunfamiliarwiththem. ErrorbudgetsarethetoolSREusestobalanceservicereliabilitywiththepaceofinnovation.Changesareamajorsourceofinstability,representingroughly70%ofouroutages,anddevelopmentworkforfeaturescompeteswithdevelopmentworkforstability.Theerrorbudgetformsacontrolmechanismfordivertingattentiontostabilityasneeded. Anerrorbudgetis1minustheSLOoftheservice.A99.9%SLOservicehasa0.1%errorbudget. Ifourservicereceives1,000,000requestsinfourweeks,a99.9%availabilitySLOgivesusabudgetof1,000errorsoverthatperiod. 1P0isthehighestpriorityofbug:allhandsondeck;dropeverythingelseuntilthisisfixed. 2AtGoogle,quarterlyplanningispublic,andteamsareheldaccountabletotheirplans.
延伸文章資訊
- 1Chapter 3 - Embracing Risk - Site Reliability Engineering
An error budget aligns incentives and emphasizes joint ownership between SRE and product developm...
- 2SRE error budgets and maintenance windows - Google Cloud
- 3Risk and Error Budgets (class SRE implements DevOps)
- 4Service Level Objectives (SLO) and Error budgets - Servian
To learn more about how reliability is done in Google, the SRE Book and SRE Workbook are availabl...
- 5What is an error budget—and why does it matter? - Atlassian