Chapter 3 - Embracing Risk - Site Reliability Engineering
文章推薦指數: 80 %
An error budget aligns incentives and emphasizes joint ownership between SRE and product development. Error budgets make it easier to decide the rate of ... TableofContents Foreword Preface PartI-Introduction 1.Introduction 2.TheProductionEnvironmentatGoogle,fromtheViewpointofanSRE PartII-Principles 3.EmbracingRisk 4.ServiceLevelObjectives 5.EliminatingToil 6.MonitoringDistributedSystems 7.TheEvolutionofAutomationatGoogle 8.ReleaseEngineering 9.Simplicity PartIII-Practices 10.PracticalAlerting 11.BeingOn-Call 12.EffectiveTroubleshooting 13.EmergencyResponse 14.ManagingIncidents 15.PostmortemCulture:LearningfromFailure 16.TrackingOutages 17.TestingforReliability 18.SoftwareEngineeringinSRE 19.LoadBalancingattheFrontend 20.LoadBalancingintheDatacenter 21.HandlingOverload 22.AddressingCascadingFailures 23.ManagingCriticalState:DistributedConsensusforReliability 24.DistributedPeriodicSchedulingwithCron 25.DataProcessingPipelines 26.DataIntegrity:WhatYouReadIsWhatYouWrote 27.ReliableProductLaunchesatScale PartIV-Management 28.AcceleratingSREstoOn-CallandBeyond 29.DealingwithInterrupts 30.EmbeddinganSREtoRecoverfromOperationalOverload 31.CommunicationandCollaborationinSRE 32.TheEvolvingSREEngagementModel PartV-Conclusions 33.LessonsLearnedfromOtherIndustries 34.Conclusion AppendixA.AvailabilityTable AppendixB.ACollectionofBestPracticesforProductionServices AppendixC.ExampleIncidentStateDocument AppendixD.ExamplePostmortem AppendixE.LaunchCoordinationChecklist AppendixF.ExampleProductionMeetingMinutes Bibliography EmbracingRisk WrittenbyMarcAlvidrezEditedbyKavitaGuliani YoumightexpectGoogletotrytobuild100%reliableservices—onesthatneverfail.Itturnsoutthatpastacertainpoint,however,increasingreliabilityisworseforaservice(anditsusers)ratherthanbetter!Extremereliabilitycomesatacost:maximizingstabilitylimitshowfastnewfeaturescanbedevelopedandhowquicklyproductscanbedeliveredtousers,anddramaticallyincreasestheircost,whichinturnreducesthenumbersoffeaturesateamcanaffordtooffer.Further,userstypicallydon’tnoticethedifferencebetweenhighreliabilityandextremereliabilityinaservice,becausetheuserexperienceisdominatedbylessreliablecomponentslikethecellularnetworkorthedevicetheyareworkingwith.Putsimply,auserona99%reliablesmartphonecannottellthedifferencebetween99.99%and99.999%servicereliability!Withthisinmind,ratherthansimplymaximizinguptime,SiteReliabilityEngineeringseekstobalancetheriskofunavailabilitywiththegoalsofrapidinnovationandefficientserviceoperations,sothatusers’overallhappiness—withfeatures,service,andperformance—isoptimized. ManagingRisk Unreliablesystemscanquicklyerodeusers’confidence,sowewanttoreducethechanceofsystemfailure.However,experienceshowsthataswebuildsystems,costdoesnotincreaselinearlyasreliabilityincrements—anincrementalimprovementinreliabilitymaycost100xmorethanthepreviousincrement.Thecostlinesshastwodimensions: Thecostofredundantmachine/computeresources Thecostassociatedwithredundantequipmentthat,forexample,allowsustotakesystemsofflineforroutineorunforeseenmaintenance,orprovidesspaceforustostoreparitycodeblocksthatprovideaminimumdatadurabilityguarantee. Theopportunitycost Thecostbornebyanorganizationwhenitallocatesengineeringresourcestobuildsystemsorfeaturesthatdiminishriskinsteadoffeaturesthataredirectlyvisibletoorusablebyendusers.Theseengineersnolongerworkonnewfeaturesandproductsforendusers. InSRE,wemanageservicereliabilitylargelybymanagingrisk.Weconceptualizeriskasacontinuum.WegiveequalimportancetofiguringouthowtoengineergreaterreliabilityintoGooglesystemsandidentifyingtheappropriateleveloftolerancefortheserviceswerun.Doingsoallowsustoperformacost/benefitanalysistodetermine,forexample,whereonthe(nonlinear)riskcontinuumweshouldplaceSearch,Ads,Gmail,orPhotos.Ourgoalistoexplicitlyaligntherisktakenbyagivenservicewiththeriskthebusinessiswillingtobear.Westrivetomakeaservicereliableenough,butnomorereliablethanitneedstobe.Thatis,whenwesetanavailabilitytargetof99.99%,wewanttoexceedit,butnotbymuch:thatwouldwasteopportunitiestoaddfeaturestothesystem,cleanuptechnicaldebt,orreduceitsoperationalcosts.Inasense,weviewtheavailabilitytargetasbothaminimumandamaximum.Thekeyadvantageofthisframingisthatitunlocksexplicit,thoughtfulrisktaking. MeasuringServiceRisk AsstandardpracticeatGoogle,weareoftenbestservedbyidentifyinganobjectivemetrictorepresentthepropertyofasystemwewanttooptimize.Bysettingatarget,wecanassessourcurrentperformanceandtrackimprovementsordegradationsovertime.Forservicerisk,itisnotimmediatelyclearhowtoreduceallofthepotentialfactorsintoasinglemetric.Servicefailurescanhavemanypotentialeffects,includinguserdissatisfaction,harm,orlossoftrust;directorindirectrevenueloss;brandorreputationalimpact;andundesirablepresscoverage.Clearly,someofthesefactorsareveryhardtomeasure.Tomakethisproblemtractableandconsistentacrossmanytypesofsystemswerun,wefocusonunplanneddowntime. Formostservices,themoststraightforwardwayofrepresentingrisktoleranceisintermsoftheacceptablelevelofunplanneddowntime.Unplanneddowntimeiscapturedbythedesiredlevelofserviceavailability,usuallyexpressedintermsofthenumberof"nines"wewouldliketoprovide:99.9%,99.99%,or99.999%availability.Eachadditionalninecorrespondstoanorderofmagnitudeimprovementtoward100%availability.Forservingsystems,thismetricistraditionallycalculatedbasedontheproportionofsystemuptime(seeTime-basedavailability). Time-basedavailability Usingthisformulaovertheperiodofayear,wecancalculatetheacceptablenumberofminutesofdowntimetoreachagivennumberofninesofavailability.Forexample,asystemwithanavailabilitytargetof99.99%canbedownforupto52.56minutesinayearandstaywithinitsavailabilitytarget;seeAvailabilityTableforatable. AtGoogle,however,atime-basedmetricforavailabilityisusuallynotmeaningfulbecausewearelookingacrossgloballydistributedservices.Ourapproachtofaultisolationmakesitverylikelythatweareservingatleastasubsetoftrafficforagivenservicesomewhereintheworldatanygiventime(i.e.,weareatleastpartially"up"atalltimes).Therefore,insteadofusingmetricsarounduptime,wedefineavailabilityintermsoftherequestsuccessrate.Aggregateavailabilityshowshowthisyield-basedmetriciscalculatedoverarollingwindow(i.e.,proportionofsuccessfulrequestsoveraone-daywindow). Aggregateavailability Forexample,asystemthatserves2.5Mrequestsinadaywithadailyavailabilitytargetof99.99%canserveupto250errorsandstillhititstargetforthatgivenday. Inatypicalapplication,notallrequestsareequal:failinganewusersign-uprequestisdifferentfromfailingarequestpollingfornewemailinthebackground.Inmanycases,however,availabilitycalculatedastherequestsuccessrateoverallrequestsisareasonableapproximationofunplanneddowntime,asviewedfromtheend-userperspective. Quantifyingunplanneddowntimeasarequestsuccessratealsomakesthisavailabilitymetricmoreamenableforuseinsystemsthatdonottypicallyserveendusersdirectly.Mostnonservingsystems(e.g.,batch,pipeline,storage,andtransactionalsystems)haveawell-definednotionofsuccessfulandunsuccessfulunitsofwork.Indeed,whilethesystemsdiscussedinthischapterareprimarilyconsumerandinfrastructureservingsystems,manyofthesameprinciplesalsoapplytononservingsystemswithminimalmodification. Forexample,abatchprocessthatextracts,transforms,andinsertsthecontentsofoneofourcustomerdatabasesintoadatawarehousetoenablefurtheranalysismaybesettorunperiodically.Usingarequestsuccessratedefinedintermsofrecordssuccessfullyandunsuccessfullyprocessed,wecancalculateausefulavailabilitymetricdespitethefactthatthebatchsystemdoesnotrunconstantly. Mostoften,wesetquarterlyavailabilitytargetsforaserviceandtrackourperformanceagainstthosetargetsonaweekly,orevendaily,basis.Thisstrategyletsusmanagetheservicetoahigh-levelavailabilityobjectivebylookingfor,trackingdown,andfixingmeaningfuldeviationsastheyinevitablyarise.SeeServiceLevelObjectivesformoredetails. RiskToleranceofServices Whatdoesitmeantoidentifytherisktoleranceofaservice?Inaformalenvironmentorinthecaseofsafety-criticalsystems,therisktoleranceofservicesistypicallybuiltdirectlyintothebasicproductorservicedefinition.AtGoogle,services’risktolerancetendstobelessclearlydefined. Toidentifytherisktoleranceofaservice,SREsmustworkwiththeproductownerstoturnasetofbusinessgoalsintoexplicitobjectivestowhichwecanengineer.Inthiscase,thebusinessgoalswe’reconcernedabouthaveadirectimpactontheperformanceandreliabilityoftheserviceoffered.Inpractice,thistranslationiseasiersaidthandone.Whileconsumerservicesoftenhaveclearproductowners,itisunusualforinfrastructureservices(e.g.,storagesystemsorageneral-purposeHTTPcachinglayer)tohaveasimilarstructureofproductownership.We’lldiscusstheconsumerandinfrastructurecasesinturn. IdentifyingtheRiskToleranceofConsumerServices Ourconsumerservicesoftenhaveaproductteamthatactsasthebusinessownerforanapplication.Forexample,Search,GoogleMaps,andGoogleDocseachhavetheirownproductmanagers.Theseproductmanagersarechargedwithunderstandingtheusersandthebusiness,andforshapingtheproductforsuccessinthemarketplace.Whenaproductteamexists,thatteamisusuallythebestresourcetodiscussthereliabilityrequirementsforaservice.Intheabsenceofadedicatedproductteam,theengineersbuildingthesystemoftenplaythisroleeitherknowinglyorunknowingly. Therearemanyfactorstoconsiderwhenassessingtherisktoleranceofservices,suchasthefollowing: Whatlevelofavailabilityisrequired? Dodifferenttypesoffailureshavedifferenteffectsontheservice? Howcanweusetheservicecosttohelplocateaserviceontheriskcontinuum? Whatotherservicemetricsareimportanttotakeintoaccount? Targetlevelofavailability ThetargetlevelofavailabilityforagivenGoogleserviceusuallydependsonthefunctionitprovidesandhowtheserviceispositionedinthemarketplace.Thefollowinglistincludesissuestoconsider: Whatlevelofservicewilltheusersexpect? Doesthisservicetiedirectlytorevenue(eitherourrevenue,orourcustomers’revenue)? Isthisapaidservice,orisitfree? Iftherearecompetitorsinthemarketplace,whatlevelofservicedothosecompetitorsprovide? Isthisservicetargetedatconsumers,oratenterprises? ConsidertherequirementsofGoogleAppsforWork.Themajorityofitsusersareenterpriseusers,somelargeandsomesmall.TheseenterprisesdependonGoogleAppsforWorkservices(e.g.,Gmail,Calendar,Drive,Docs)toprovidetoolsthatenabletheiremployeestoperformtheirdailywork.Statedanotherway,anoutageforaGoogleAppsforWorkserviceisanoutagenotonlyforGoogle,butalsoforalltheenterprisesthatcriticallydependonus.ForatypicalGoogleAppsforWorkservice,wemightsetanexternalquarterlyavailabilitytargetof99.9%,andbackthistargetwithastrongerinternalavailabilitytargetandacontractthatstipulatespenaltiesifwefailtodelivertotheexternaltarget. YouTubeprovidesacontrastingsetofconsiderations.WhenGoogleacquiredYouTube,wehadtodecideontheappropriateavailabilitytargetforthewebsite.In2006,YouTubewasfocusedonconsumersandwasinaverydifferentphaseofitsbusinesslifecyclethanGooglewasatthetime.WhileYouTubealreadyhadagreatproduct,itwasstillchangingandgrowingrapidly.WesetaloweravailabilitytargetforYouTubethanforourenterpriseproductsbecauserapidfeaturedevelopmentwascorrespondinglymoreimportant. Typesoffailures Theexpectedshapeoffailuresforagivenserviceisanotherimportantconsideration.Howresilientisourbusinesstoservicedowntime?Whichisworsefortheservice:aconstantlowrateoffailures,oranoccasionalfull-siteoutage?Bothtypesoffailuremayresultinthesameabsolutenumberoferrors,butmayhavevastlydifferentimpactsonthebusiness. Anillustrativeexampleofthedifferencebetweenfullandpartialoutagesnaturallyarisesinsystemsthatserveprivateinformation.Consideracontactmanagementapplication,andthedifferencebetweenintermittentfailuresthatcauseprofilepicturestofailtorender,versusafailurecasethatresultsinauser’sprivatecontactsbeingshowntoanotheruser.Thefirstcaseisclearlyapooruserexperience,andSREswouldworktoremediatetheproblemquickly.Inthesecondcase,however,theriskofexposingprivatedatacouldeasilyunderminebasicusertrustinasignificantway.Asaresult,takingdowntheserviceentirelywouldbeappropriateduringthedebuggingandpotentialclean-upphaseforthesecondcase. AttheotherendofservicesofferedbyGoogle,itissometimesacceptabletohaveregularoutagesduringmaintenancewindows.Anumberofyearsago,theAdsFrontendusedtobeonesuchservice.Itisusedbyadvertisersandwebsitepublisherstosetup,configure,run,andmonitortheiradvertisingcampaigns.Becausemostofthisworktakesplaceduringnormalbusinesshours,wedeterminedthatoccasional,regular,scheduledoutagesintheformofmaintenancewindowswouldbeacceptable,andwecountedthesescheduledoutagesasplanneddowntime,notunplanneddowntime. Cost Costisoftenthekeyfactorindeterminingtheappropriateavailabilitytargetforaservice.Adsisinaparticularlygoodpositiontomakethistrade-offbecauserequestsuccessesandfailurescanbedirectlytranslatedintorevenuegainedorlost.Indeterminingtheavailabilitytargetforeachservice,weaskquestionssuchas: Ifweweretobuildandoperatethesesystemsatonemorenineofavailability,whatwouldourincrementalincreaseinrevenuebe? Doesthisadditionalrevenueoffsetthecostofreachingthatlevelofreliability? Tomakethistrade-offequationmoreconcrete,considerthefollowingcost/benefitforanexampleservicewhereeachrequesthasequalvalue: Proposedimprovementinavailabilitytarget:99.9%→99.99% Proposedincreaseinavailability:0.09% Servicerevenue:$1M Valueofimprovedavailability:$1M*0.0009=$900 Inthiscase,ifthecostofimprovingavailabilitybyonenineislessthan$900,itisworththeinvestment.Ifthecostisgreaterthan$900,thecostswillexceedtheprojectedincreaseinrevenue. Itmaybehardertosetthesetargetswhenwedonothaveasimpletranslationfunctionbetweenreliabilityandrevenue.OneusefulstrategymaybetoconsiderthebackgrounderrorrateofISPsontheInternet.Iffailuresarebeingmeasuredfromtheend-userperspectiveanditispossibletodrivetheerrorratefortheservicebelowthebackgrounderrorrate,thoseerrorswillfallwithinthenoiseforagivenuser’sInternetconnection.WhiletherearesignificantdifferencesbetweenISPsandprotocols(e.g.,TCPversusUDP,IPv4versusIPv6),we’vemeasuredthetypicalbackgrounderrorrateforISPsasfallingbetween0.01%and1%. Otherservicemetrics Examiningtherisktoleranceofservicesinrelationtometricsbesidesavailabilityisoftenfruitful.Understandingwhichmetricsareimportantandwhichmetricsaren’timportantprovidesuswithdegreesoffreedomwhenattemptingtotakethoughtfulrisks. ServicelatencyforourAdssystemsprovidesanillustrativeexample.WhenGooglefirstlaunchedWebSearch,oneoftheservice’skeydistinguishingfeatureswasspeed.WhenweintroducedAdWords,whichdisplaysadvertisementsnexttosearchresults,akeyrequirementofthesystemwasthattheadsshouldnotslowdownthesearchexperience.ThisrequirementhasdriventheengineeringgoalsineachgenerationofAdWordssystemsandistreatedasaninvariant. AdSense,Google’sadssystemthatservescontextualadsinresponsetorequestsfromJavaScriptcodethatpublishersinsertintotheirwebsites,hasaverydifferentlatencygoal.ThelatencygoalforAdSenseistoavoidslowingdowntherenderingofthethird-partypagewheninsertingcontextualads.Thespecificlatencytarget,then,isdependentonthespeedatwhichagivenpublisher’spagerenders.ThismeansthatAdSenseadscangenerallybeservedhundredsofmillisecondsslowerthanAdWordsads. Thislooserservinglatencyrequirementhasallowedustomakemanysmarttrade-offsinprovisioning(i.e.,determiningthequantityandlocationsofservingresourcesweuse),whichsaveussubstantialcostovernaiveprovisioning.Inotherwords,giventherelativeinsensitivityoftheAdSenseservicetomoderatechangesinlatencyperformance,weareabletoconsolidateservingintofewergeographicallocations,reducingouroperationaloverhead. IdentifyingtheRiskToleranceofInfrastructureServices Therequirementsforbuildingandrunninginfrastructurecomponentsdifferfromtherequirementsforconsumerproductsinanumberofways.Afundamentaldifferenceisthat,bydefinition,infrastructurecomponentshavemultipleclients,oftenwithvaryingneeds. Targetlevelofavailability ConsiderBigtable[Cha06],amassive-scaledistributedstoragesystemforstructureddata.SomeconsumerservicesservedatadirectlyfromBigtableinthepathofauserrequest.Suchservicesneedlowlatencyandhighreliability.OtherteamsuseBigtableasarepositoryfordatathattheyusetoperformofflineanalysis(e.g.,MapReduce)onaregularbasis.Theseteamstendtobemoreconcernedaboutthroughputthanreliability.Risktoleranceforthesetwousecasesisquitedistinct. Oneapproachtomeetingtheneedsofbothusecasesistoengineerallinfrastructureservicestobeultra-reliable.Giventhefactthattheseinfrastructureservicesalsotendtoaggregatehugeamountsofresources,suchanapproachisusuallyfartooexpensiveinpractice.Tounderstandthedifferentneedsofthedifferenttypesofusers,youcanlookatthedesiredstateoftherequestqueueforeachtypeofBigtableuser. Typesoffailures Thelow-latencyuserwantsBigtable’srequestqueuestobe(almostalways)emptysothatthesystemcanprocesseachoutstandingrequestimmediatelyuponarrival.(Indeed,inefficientqueuingisoftenacauseofhightaillatency.)Theuserconcernedwithofflineanalysisismoreinterestedinsystemthroughput,sothatuserwantsrequestqueuestoneverbeempty.Tooptimizeforthroughput,theBigtablesystemshouldneverneedtoidlewhilewaitingforitsnextrequest. Asyoucansee,successandfailureareantitheticalforthesesetsofusers.Successforthelow-latencyuserisfailurefortheuserconcernedwithofflineanalysis. Cost Onewaytosatisfythesecompetingconstraintsinacost-effectivemanneristopartitiontheinfrastructureandofferitatmultipleindependentlevelsofservice.IntheBigtableexample,wecanbuildtwotypesofclusters:low-latencyclustersandthroughputclusters.Thelow-latencyclustersaredesignedtobeoperatedandusedbyservicesthatneedlowlatencyandhighreliability.Toensureshortqueuelengthsandsatisfymorestringentclientisolationrequirements,theBigtablesystemcanbeprovisionedwithasubstantialamountofslackcapacityforreducedcontentionandincreasedredundancy.Thethroughputclusters,ontheotherhand,canbeprovisionedtorunveryhotandwithlessredundancy,optimizingthroughputoverlatency.Inpractice,weareabletosatisfytheserelaxedneedsatamuchlowercost,perhapsaslittleas10–50%ofthecostofalow-latencycluster.GivenBigtable’smassivescale,thiscostsavingsbecomessignificantveryquickly. Thekeystrategywithregardstoinfrastructureistodeliverserviceswithexplicitlydelineatedlevelsofservice,thusenablingtheclientstomaketherightriskandcosttrade-offswhenbuildingtheirsystems.Withexplicitlydelineatedlevelsofservice,theinfrastructureproviderscaneffectivelyexternalizethedifferenceinthecostittakestoprovideserviceatagivenleveltoclients.Exposingcostinthiswaymotivatestheclientstochoosethelevelofservicewiththelowestcostthatstillmeetstheirneeds.Forexample,Google+candecidetoputdatacriticaltoenforcinguserprivacyinahigh-availability,globallyconsistentdatastore(e.g.,agloballyreplicatedSQL-likesystemlikeSpanner[Cor12]),whileputtingoptionaldata(datathatisn’tcritical,butthatenhancestheuserexperience)inacheaper,lessreliable,lessfresh,andeventuallyconsistentdatastore(e.g.,aNoSQLstorewithbest-effortreplicationlikeBigtable). Notethatwecanrunmultipleclassesofservicesusingidenticalhardwareandsoftware.Wecanprovidevastlydifferentserviceguaranteesbyadjustingavarietyofservicecharacteristics,suchasthequantitiesofresources,thedegreeofredundancy,thegeographicalprovisioningconstraints,and,critically,theinfrastructuresoftwareconfiguration. Example:Frontendinfrastructure Todemonstratethattheserisk-toleranceassessmentprinciplesdonotjustapplytostorageinfrastructure,let’slookatanotherlargeclassofservice:Google’sfrontendinfrastructure.Thefrontendinfrastructureconsistsofreverseproxyandloadbalancingsystemsrunningclosetotheedgeofournetwork.Thesearethesystemsthat,amongotherthings,serveasoneendpointoftheconnectionsfromendusers(e.g.,terminateTCPfromtheuser’sbrowser).Giventheircriticalrole,weengineerthesesystemstodeliveranextremelyhighlevelofreliability.Whileconsumerservicescanoftenlimitthevisibilityofunreliabilityinbackends,theseinfrastructuresystemsarenotsolucky.Ifarequestnevermakesittotheapplicationservicefrontendserver,itislost. We’veexploredthewaystoidentifytherisktoleranceofbothconsumerandinfrastructureservices.Now,we’lldiscussusingthattoleranceleveltomanageunreliabilityviaerrorbudgets. MotivationforErrorBudgets14 WrittenbyMarkRothEditedbyCarmelaQuinito OtherchaptersinthisbookdiscusshowtensionscanarisebetweenproductdevelopmentteamsandSREteams,giventhattheyaregenerallyevaluatedondifferentmetrics.Productdevelopmentperformanceislargelyevaluatedonproductvelocity,whichcreatesanincentivetopushnewcodeasquicklyaspossible.Meanwhile,SREperformanceis(unsurprisingly)evaluatedbaseduponreliabilityofaservice,whichimpliesanincentivetopushbackagainstahighrateofchange.Informationasymmetrybetweenthetwoteamsfurtheramplifiesthisinherenttension.Theproductdevelopershavemorevisibilityintothetimeandeffortinvolvedinwritingandreleasingtheircode,whiletheSREshavemorevisibilityintotheservice’sreliability(andthestateofproductioningeneral). Thesetensionsoftenreflectthemselvesindifferentopinionsaboutthelevelofeffortthatshouldbeputintoengineeringpractices.Thefollowinglistpresentssometypicaltensions: Softwarefaulttolerance Howhardeneddowemakethesoftwaretounexpectedevents?Toolittle,andwehaveabrittle,unusableproduct.Toomuch,andwehaveaproductnoonewantstouse(butthatrunsverystably). Testing Again,notenoughtestingandyouhaveembarrassingoutages,privacydataleaks,oranumberofotherpress-worthyevents.Toomuchtesting,andyoumightloseyourmarket. Pushfrequency Everypushisrisky.Howmuchshouldweworkonreducingthatrisk,versusdoingotherwork? Canarydurationandsize It’sabestpracticetotestanewreleaseonsomesmallsubsetofatypicalworkload,apracticeoftencalledcanarying.Howlongdowewait,andhowbigisthecanary? Usually,preexistingteamshaveworkedoutsomekindofinformalbalancebetweenthemastowheretherisk/effortboundarylies.Unfortunately,onecanrarelyprovethatthisbalanceisoptimal,ratherthanjustafunctionofthenegotiatingskillsoftheengineersinvolved.Norshouldsuchdecisionsbedrivenbypolitics,fear,orhope.(Indeed,GoogleSRE’sunofficialmottois"Hopeisnotastrategy.")Instead,ourgoalistodefineanobjectivemetric,agreeduponbybothsides,thatcanbeusedtoguidethenegotiationsinareproducibleway.Themoredata-basedthedecisioncanbe,thebetter. FormingYourErrorBudget Inordertobasethesedecisionsonobjectivedata,thetwoteamsjointlydefineaquarterlyerrorbudgetbasedontheservice’sservicelevelobjective,orSLO(seeServiceLevelObjectives).Theerrorbudgetprovidesaclear,objectivemetricthatdetermineshowunreliabletheserviceisallowedtobewithinasinglequarter.ThismetricremovesthepoliticsfromnegotiationsbetweentheSREsandtheproductdeveloperswhendecidinghowmuchrisktoallow. Ourpracticeisthenasfollows: ProductManagementdefinesanSLO,whichsetsanexpectationofhowmuchuptimetheserviceshouldhaveperquarter. Theactualuptimeismeasuredbyaneutralthirdparty:ourmonitoringsystem. Thedifferencebetweenthesetwonumbersisthe"budget"ofhowmuch"unreliability"isremainingforthequarter. AslongastheuptimemeasuredisabovetheSLO—inotherwords,aslongasthereiserrorbudgetremaining—newreleasescanbepushed. Forexample,imaginethataservice’sSLOistosuccessfullyserve99.999%ofallqueriesperquarter.Thismeansthattheservice’serrorbudgetisafailurerateof0.001%foragivenquarter.Ifaproblemcausesustofail0.0002%oftheexpectedqueriesforthequarter,theproblemspends20%oftheservice’squarterlyerrorbudget. Benefits ThemainbenefitofanerrorbudgetisthatitprovidesacommonincentivethatallowsbothproductdevelopmentandSREtofocusonfindingtherightbalancebetweeninnovationandreliability. Manyproductsusethiscontrollooptomanagereleasevelocity:aslongasthesystem’sSLOsaremet,releasescancontinue.IfSLOviolationsoccurfrequentlyenoughtoexpendtheerrorbudget,releasesaretemporarilyhaltedwhileadditionalresourcesareinvestedinsystemtestinganddevelopmenttomakethesystemmoreresilient,improveitsperformance,andsoon.Moresubtleandeffectiveapproachesareavailablethanthissimpleon/offtechnique:15forinstance,slowingdownreleasesorrollingthembackwhentheSLO-violationerrorbudgetisclosetobeingusedup. Forexample,ifproductdevelopmentwantstoskimpontestingorincreasepushvelocityandSREisresistant,theerrorbudgetguidesthedecision.Whenthebudgetislarge,theproductdeveloperscantakemorerisks.Whenthebudgetisnearlydrained,theproductdevelopersthemselveswillpushformoretestingorslowerpushvelocity,astheydon’twanttoriskusingupthebudgetandstalltheirlaunch.Ineffect,theproductdevelopmentteambecomesself-policing.Theyknowthebudgetandcanmanagetheirownrisk.(Ofcourse,thisoutcomereliesonanSREteamhavingtheauthoritytoactuallystoplaunchesiftheSLOisbroken.) WhathappensifanetworkoutageordatacenterfailurereducesthemeasuredSLO?Sucheventsalsoeatintotheerrorbudget.Asaresult,thenumberofnewpushesmaybereducedfortheremainderofthequarter.Theentireteamsupportsthisreductionbecauseeveryonesharestheresponsibilityforuptime. Thebudgetalsohelpstohighlightsomeofthecostsofoverlyhighreliabilitytargets,intermsofbothinflexibilityandslowinnovation.Iftheteamishavingtroublelaunchingnewfeatures,theymayelecttoloosentheSLO(thusincreasingtheerrorbudget)inordertoincreaseinnovation. KeyInsights Managingservicereliabilityislargelyaboutmanagingrisk,andmanagingriskcanbecostly. 100%isprobablynevertherightreliabilitytarget:notonlyisitimpossibletoachieve,it’stypicallymorereliabilitythanaservice’suserswantornotice.Matchtheprofileoftheservicetotheriskthebusinessiswillingtotake. AnerrorbudgetalignsincentivesandemphasizesjointownershipbetweenSREandproductdevelopment.Errorbudgetsmakeiteasiertodecidetherateofreleasesandtoeffectivelydefusediscussionsaboutoutageswithstakeholders,andallowsmultipleteamstoreachthesameconclusionaboutproductionriskwithoutrancor. 14Anearlyversionofthissectionappearedasanarticlein;login:(August2015,vol.40,no.4).15Knownas"bang/bang"control—seehttps://en.wikipedia.org/wiki/Bang–bang_control.
延伸文章資訊
- 1What is an error budget—and why does it matter? - Atlassian
- 2Chapter 3 - Embracing Risk - Site Reliability Engineering
- 3Chapter 3 - Embracing Risk - Site Reliability Engineering
An error budget aligns incentives and emphasizes joint ownership between SRE and product developm...
- 4Risk and Error Budgets (class SRE implements DevOps)
- 5Example Error Budget Policy - Site Reliability Engineering