Example Error Budget Policy - Site Reliability Engineering

文章推薦指數: 80 %
投票人數:10人

An error budget is 1 minus the SLO of the service. A 99.9% SLO service has a 0.1% error budget. If our service receives 1,000,000 requests in four weeks, a 99.9 ... TableofContents ForewordI ForewordII Preface 1.HowSRERelatestoDevOps PartI-Foundations 2.ImplementingSLOs 3.SLOEngineeringCaseStudies 4.Monitoring 5.AlertingonSLOs 6.EliminatingToil 7.Simplicity PartII-Practices 8.On-Call 9.IncidentResponse 10.PostmortemCulture:LearningfromFailure 11.ManagingLoad 12.IntroducingNon-AbstractLargeSystemDesign 13.DataProcessingPipelines 14.ConfigurationDesignandBestPractices 15.ConfigurationSpecifics 16.CanaryingReleases PartIII-Processes 17.IdentifyingandRecoveringfromOverload 18.SREEngagementModel 19.SRE:ReachingBeyondYourWalls 20.SRETeamLifecycles 21.OrganizationalChangeManagementinSRE Conclusion AppendixA.ExampleSLODocument AppendixB.ExampleErrorBudgetPolicy AppendixC.ResultsofPostmortemAnalysis Index AbouttheEditors Colophon ExampleErrorBudgetPolicy Status Published Author StevenThurgood Date 2018-02-19 Reviewers DavidFerguson Approvers BetsyBeyer ApprovalDate 2018-02-20 RevisitDate 2019-02-01 ServiceOverview TheExampleGameServiceallowsAndroidandiPhoneuserstoplayagamewitheachother.Newreleasesofthebackendcodearepusheddaily.Newreleasesofclientsarepushedweekly.Thispolicyappliesbothtobackendandclientreleases. Goals Thegoalsofthispolicyareto: ProtectcustomersfromrepeatedSLOmisses Provideanincentivetobalancereliabilitywithotherfeatures Non-Goals ThispolicyisnotintendedtoserveasapunishmentformissingSLOs.Haltingchangeisundesirable;thispolicygivesteamspermissiontofocusexclusivelyonreliabilitywhendataindicatesthatreliabilityismoreimportantthanotherproductfeatures. SLOMissPolicy IftheserviceisperformingatoraboveitsSLO,thenreleases(includingdatachanges)willproceedaccordingtothereleasepolicy. Iftheservicehasexceededitserrorbudgetfortheprecedingfour-weekwindow,wewillhaltallchangesandreleasesotherthanP01issuesorsecurityfixesuntiltheserviceisbackwithinitsSLO. DependinguponthecauseoftheSLOmiss,theteammaydevoteadditionalresourcestoworkingonreliabilityinsteadoffeaturework. Theteammustworkonreliabilityif: Acodebugorproceduralerrorcausedtheserviceitselftoexceedtheerrorbudget. Apostmortemrevealsanopportunitytosoftenaharddependency. MiscategorizederrorsfailtoconsumebudgetthatwouldhavecausedtheservicetomissitsSLO. Theteammaycontinuetoworkonnon-reliabilityfeaturesif: Theoutagewascausedbyacompany-widenetworkingproblem. Theoutagewascausedbyaservicemaintainedbyanotherteam,whohavethemselvesfrozenreleasestoaddresstheirreliabilityissues. TheerrorbudgetwasconsumedbyusersoutofscopefortheSLO(e.g.,loadtestsorpenetrationtesters). Miscategorizederrorsconsumebudgeteventhoughnouserswereimpacted. OutagePolicy Ifasingleincidentconsumesmorethan20%oferrorbudgetoverfourweeks,thentheteammustconductapostmortem.ThepostmortemmustcontainatleastoneP0actionitemtoaddresstherootcause. Ifasingleclassofoutageconsumesmorethan20%oferrorbudgetoveraquarter,theteammusthaveaP0itemontheirquarterlyplanningdocument2toaddresstheissuesinthefollowingquarter. EscalationPolicy Intheeventofadisagreementbetweenpartiesregardingthecalculationoftheerrorbudgetorthespecificactionsitdefines,theissueshouldbeescalatedtotheCTOtomakeadecision. Background Note Thissectionisboilerplate,intendedtogiveasuccinctoverviewoferrorbudgetstothoseunfamiliarwiththem. ErrorbudgetsarethetoolSREusestobalanceservicereliabilitywiththepaceofinnovation.Changesareamajorsourceofinstability,representingroughly70%ofouroutages,anddevelopmentworkforfeaturescompeteswithdevelopmentworkforstability.Theerrorbudgetformsacontrolmechanismfordivertingattentiontostabilityasneeded. Anerrorbudgetis1minustheSLOoftheservice.A99.9%SLOservicehasa0.1%errorbudget. Ifourservicereceives1,000,000requestsinfourweeks,a99.9%availabilitySLOgivesusabudgetof1,000errorsoverthatperiod. 1P0isthehighestpriorityofbug:allhandsondeck;dropeverythingelseuntilthisisfixed. 2AtGoogle,quarterlyplanningispublic,andteamsareheldaccountabletotheirplans.



請為這篇文章評分?