Error Budgets Explained (And How to Make One for Your Team)

2024-09-21

文章推薦指數： 80 %

投票人數：10人

The error budget is usually tracked by the SRE team. However, the SRE team doesn't make decisions regarding how it should be spent. They work ... NeweBook:HowtoInvestinReliability:Top4Priorities.Downloadforfree.ProductBlamelessProductsIncidentResolutionReliabilityInsightsIntegrationsIncidentRetrospectivesSLOManagerCommsFlowFeaturedGuideJiraFollowUpActionsbyIncidentType03.09.2022ProductProductRoundup:NewBlamelessFeaturesinJune202206.22.2022SeeallpostsBlogBlogCategoriesMainCommunityCompanyDevOpsIncidentResponseProductSREFeaturedSRE4SREGoldenSignals(Whattheyareandwhytheymatter)07.29.2021SREBuildinganSRETeamwithSpecialization01.05.2022SeeallpostsCustomersResourcesResourceCategoriesBlogseBooksPodcastsVideosWebinarsResourceLibraryDocumentationCustomerStoriesReliabilityBuyer’sGuideTalksFeaturedeBookHowtoInvestinReliability:Top4Priorities05.02.2022WebinarSRE:FROMTHEORYTOPRACTICEWhat'sdifficultabouttechdebt?06.22.2022SeeallresourcesCompanyCompanyAboutUsEventsContactUsNewsroomCareersTalksFeaturedDevOps.comSheddingLightOnToil:WaysEngineersCanReduceIt05.02.2022DevOpsParadoxWhyIncidentsAreSlowingDownCompanies06.01.2022SeeallrelatednewsDocsScheduleDemoScheduleDemoLogInProductBlamelessProductsIncidentResolutionReliabilityInsightsIntegrationsIncidentRetrospectivesSLOManagerCommsFlowFeaturedGuideJiraFollowUpActionsbyIncidentType03.09.2022ProductProductRoundup:NewBlamelessFeaturesinJune202206.22.2022SeeallpostsBlogBlogCategoriesMainCommunityCompanyDevOpsIncidentResponseProductSREFeaturedSRE4SREGoldenSignals(Whattheyareandwhytheymatter)07.29.2021SREBuildinganSRETeamwithSpecialization01.05.2022SeeallpostsCustomersResourcesResourceCategoriesBlogseBooksPodcastsVideosWebinarsResourceLibraryDocumentationCustomerStoriesReliabilityBuyer’sGuideTalksFeaturedeBookHowtoInvestinReliability:Top4Priorities05.02.2022WebinarSRE:FROMTHEORYTOPRACTICEWhat'sdifficultabouttechdebt?06.22.2022SeeallresourcesCompanyCompanyAboutUsEventsContactUsNewsroomCareersTalksFeaturedDevOps.comSheddingLightOnToil:WaysEngineersCanReduceIt05.02.2022DevOpsParadoxWhyIncidentsAreSlowingDownCompanies06.01.2022SeeallrelatednewsDocsScheduleDemoTheblamelessblogErrorBudgetsExplained(AndHowtoMakeOneforYourTeam)BlamelessBloghomeSRENoor-ul-AnamRuqayyaNoor-ul-AnamRuqayya|6/2/2021Wonderingwhaterrorbudgets(EBs)areandhowtheyareuseful?Weexplainwhattheyare,howtheyaredefined,andhowtheycanhelpyourteam. WhatisanErrorBudget?Anerrorbudgetistheamountofacceptableunreliabilityaservicecanhavebeforecustomerhappinessisimpacted.Ifaserviceiswellwithinitsbudget,thedeveloperscantakemorerisksintheirreleases.Ifnot,developersneedtomakesaferchoices.Incomplexsystems,failuresareinevitable,errorbudgetsnormalizefailureasapartofthedevelopmentprocess.Italsobridgesthegapbetweenthedevelopmentandoperationsteamsbyreducingorganizationalsilosandsharingownershipofreliabilitytargets. SiteReliabilityEngineering(SRE)Tounderstanderrorbudgets,youneedtounderstandSRE.SiteReliabilityEngineering(SRE)isasetofpracticesemployedbytechgiantssuchasGoogle,Netflix,andLinkedIntocontinuouslyimprovethereliabilityoftheirservices. SREpracticesmonitoring,alerting,incidentresponse,postmortem,testing,capacityplanning,anddevelopmenttomakeaservicereliable.ThefollowingillustrationlistsSREpracticesthatmakeaservicereliable,frombasic(atthebottomofthepyramid)toadvanced:SRETeamandTheErrorBudgetTheSREteamcomprisessoftwareengineerswhobuildsoftwaretoimprovethereliabilityoftheirsystems.Thesesoftwareengineersarededicatedtoimprovingthereliabilityofsoftwareinproduction.SomeexamplesofmanualworkthatSREteamsdooutsideofbuildingsoftwarearefixingbugs,respondingtoincidents,andworkingon-call.SREteamsworkwithdevelopmentteamstosetEBsandEBpolicies.TheerrorbudgetburndownprovidessoliddatatothedevelopmentandSREteamsonhowtosetreleasevelocity.Forexample,supposetheEBis2hoursofdowntimeina28-dayperiod,andtwoincidentshavecausedover1.5hoursofdowntimeinthefirst5daysalready.Theremainingbudgetof30minutesofdowntimedetermineswhetherthedevelopmentteamshouldslowdownandspendmoretimeontestingandimprovingreliability. WhatisThePurposeofAnErrorBudget?Alltechcompaniessharethesamegoal:innovation.Theyaimtobebetterthantomorrow,keepgrowingataconstantpaceandmaketheworldabetterplace. Whenyouconstantlychangeandimprovetheexistingproduct,youwillcomeacrossoreventriggersystemfailures.Withcomplexsystems,pursuingperfectionisfruitless.Failureisinevitableatsomepoint,andthebestyoucandoisbepreparedforit. Tomakemistakesishuman,andthepurposeofanerrorbudgetistomakemistakeswithoutgettingcaughtbyyourcustomers! Theyhelpcompaniesmakeadata-drivendecisiononhowtobalancebetweennewfeaturesandreliability.SLI,SLO,andSLAThenotionofSRErevolvesaroundtheideathatmetricsshouldbetiedtobusinessobjectives.ThreeprimarytoolsareutilizedinSREplanningandwork:SLIs,SLOs,andSLAs.Withoutthem,youcannotmeasureyoursystem’sreliability,availability,andusefulness. SLO(ServiceLevelObjective) SLOisanagreed-uponobjectiveabouthowreliableaserviceshouldbe.It’stheminimumreliabilityyouneedtokeepyourcustomershappy.InSREterms,SLOisthenumericaltargetvalueforsystemavailability. SLA(ServiceLevelAgreement) SLAisaformalagreementbetweencustomerandserviceproviderthatcoverstherepercussionsoffailure.Itspecifieswhatyouwilldo(apartialrefundordiscount)ifyourserviceisnotasreliableasitclaimstobe. EveryserviceproviderneedsanSLAwhenafinancialpenaltyisinvolved.DevisinganSLArequiresagoodunderstandingofbusinessandlegaltermsinordertodecideappropriatepenaltiesandconsequencesforanagreementbreach. SoSLAsaretypicallysetbylegalandfinanceteams,notSREteams.SinceSLOisaninternalobjective,itismorestringentthantheSLA.Forexample,anSLAof99.9%overamonthcouldrequireaninternalSLOof99.95%.ByusingatighterinternalSLOinlieuofanSLAtomeasurereliability,companiesgetachancetoreactandtakeproactivemeasurestoavoidbreakingtheagreement. SLI(ServiceLevelIndicator) SLIisametricthatdefinesthehealthofaserviceovertimeandisusedtodeterminewhethertheSLOsaremet.SelectingtherightSLIisaboutunderstandingwhattheuserexpectsfromtheservice.Youdon’twanttouseeverymetricyoucantrackinthemonitoringsystem.Infact,choosingtoomanySLIscanmakeitdifficulttopayattentiontotherightmetrics. SLIreflectsasnapshotofthecurrentservicereliability.IftheSLIdropsbelowacertainpoint,theserviceproviderneedstotakeappropriateactioninordertoincreaseavailability. SLIisgenerallytheratiooftwonumbers:TotalGoodEvents,andTotalEvents. SLI=(TotalGoodEvents/TotalEvents)x100Forexample:NumberofsuccessfulHTTPrequests/totalHTTPRequests.NumberofHTTPRequeststhatCompletedSuccessfullyin200ms/TotalHTTPRequestsAnSREtriestosolvethereliabilityissueinthreeways:TheystartbydefiningAvailability FindingtheappropriatelevelofAvailabilitythattheserviceneedsCreatingaplantodealwithFailureofAvailability Thethreemetricsmustbecommunicatedacrosseveryleveloftheorganization.ThatincludeseveryonefromdeveloperstoSREsandVPs.Onlybyhavingasharedgoalcanyoumaketheproductbetterthanever.Now,let’stakealookathowthethreeacronyms(SLO,SLA,andSLI)worktogetherandinconjunctionwiththeerrorbudget. SupposeapaymentservicehasanSLAof98%,thentheSLOmustbehigher.ConsideringanSLOof99%availability,theerrorbudgetwouldbe1%.That1%ina28-daywindowis3.65daysofdowntime.Now,after15days,iftheSLIis99.5%,thenyou’remeetingyourSLOandwithinyourEB.IftheSLIdipsbelow99%,thenyou’veusedupallofyourEBandarenolongermeetingtheSLO.HowisanErrorBudgetDeterminedandbyWhom?Theerrorbudgetsetstheappropriatelevelofreliabilitythattheservice’scustomersshouldexpect.Withinthebudget,theusersarelikelytobehappy(aslongasthey’resatisfiedwiththeservice).Ifyouburnallyourerrorbudget,customersarelikelytostartcomplainingandbeunhappywiththeservice. Thenumberofninesreflectstheavailabilityoftheservice.Whensomeonementionsfournines(99.99%)ofavailability,thenitmeansthatitisacceptablefortheservicetobedownforonly52minutesand35secondsayear,whichistheEBfortheyear.AcommonmisconceptionisthattheEBallocatedwillbeconsumedinonecontiguouschunkwithasingleincident,causingexecutives’concernsaboutcustomers'experience.Whilethisscenariomayhappen,morefrequently,EBsareconsumedinsmallportionsthroughoutthemonthoryear. MoredecimalsmeanhigheruptimeandasmallerEB.Forexample,ifyoudefinearuleinyourSLOspecifyingthatthesystemwillrespondinunder600ms99timesoutof100,andthelatencyisover600ms,thenthesystemisconsideredtobedown. InthefollowingavailabilitySLOtable,wewilllisttheavailabilityvs.downtimeperyearandmonth:‍Now,anerrorbudgetisatoolusedbytheSREstobalancethereliabilityofservicewiththepaceofinnovation.Innovationmeanschangeandthemainreasonbehindinstabilityischange.Thedevelopmenttoilfornewfeaturesisalwayscompetingwiththedevelopmenttoilrequiredforstability.Changeisinevitableandthereforetheerrorbudgetworksasacontrolmechanismtodivertattentiontostabilityasneeded.AccordingtoGoogle’sSREbook’sAppendix: ErrorBudget=1-AvailabilitySLOForexample,iftheSLOis99.9%,thentocalculateerrorbudget: ErrorBudget=1-99.9%=0.1%Therefore,ifyourservicereceives100,000requestsinfourweeks,thenwitha99.9%availabilitySLO,theerrorbudgetstandsat1000errorsinfourweeks.MakinganerrorbudgetandSLOisnotjustanengineeringdecision.It’sratherabusinessdecisionthatrequiresinputfromvariousstakeholdersfromallpartsoftheorganization. Thekeystakeholdersinvolvedincreatingtheerrorbudgetare:Productownersincludingproductmanagers,businessanalysts,andproductleadswhotrytorepresentthecustomertothedevelopmentteam.Theyanticipatethecustomerneeds,articulatetheuserjourney,andcommunicatethemtotheengineeringteams. TheSRE&operationsteamincludesDevOps,ITSM&problem,management,andinfrastructureengineers.Theirroleistousesoftwaretomanageaservice,solveproblems,andautomateoperationstasks. Engineersarepartofthemaindevelopmentteamthatworksontheproduct.CustomersarebothinternalandexternalusersandstakeholdersandtheSLOsarenon-legallybindingpromisesthattheserviceprovidermakestothem. Source:BlamelessTheerrorbudgetisusuallytrackedbytheSREteam.However,theSREteamdoesn’tmakedecisionsregardinghowitshouldbespent.TheyworkwithdevelopmentteamstobuildpoliciesforacceleratingdevelopmentorimplementingfreezesbasedontheremainingEB.HowtoDetermineUptimeandDowntimeinSRE? Yourservice’suptimeisoneofthemostimportantmetricsthatcanbeusedtomeasureitsperformance.Itshowsthetimeorpercentageofwhentheserviceisupandrunning. Theoppositeofuptimeisdowntime,whichisthetimeorpercentageofwhentheservicewasdown.Tocalculateuptime,first,weneedtocalculatedowntime. Here’showyoucalculatedowntime: Downtime=(TotalTimetheWebsitewasDown/TotalTimetheWebsitewasMonitored)x100Tocalculateuptime: Uptime=100%-DowntimePercentage ‍NotethatuptimeanddowntimespecificallyrefertotheavailabilitySLIsandEBs.OtherSLItypesincludelatency,datafreshness,throughputandwouldnotbereferredtoasuptimeordowntime. HowcanDevelopers“Spend”theirErrorBudget?Anerrorbudgetisjustlikeyourhousebudget.It’stheallowedexpenses(unreliability)thatyoursystemcanaffordwithoutmakingthecustomerunhappy.Justlikeyourhousebudget,you’reallowedtospendyourEBwithinagivenperiodaslongasyoudon’toverspend. DeveloperscanspendtheEBanywaytheyseefit.TeamsnewtoSLOsoftenreleasenewfeaturesasfrequentlyastheywantonlytosuddenlyrealizethey’vespentalloftheirEBandit’stimetostopshippingnewfeatures.Withbetteralerting,teamslearntoslowdowndevelopmentbythetimetheyspendasignificantpercentageofEB.AsteamsadvanceinSREmaturityandgainbettercontroloverhowtospendtheirEB,theybegintostrategicallyspendtheirEBbytakingcalculatedriskswithshippinginnovativeorexperimentalfeatures.EBpreventscompaniesfromgoingaftertoomuchreliabilityattheexpenseoftheseinnovationopportunitiesthatdon’timpactcustomerhappiness.ThisishowEBcaneventuallyspeedupinnovationandvelocity.Infact,increasingthedevelopmentvelocitywillgiveyourproductanadvantageovertheothers.Byoutpacingyourcompetitor,you’reurgingcompaniestobuyyourproductfirst.Sinceyou’rethefirstoneinthemarket,there’slesscompetitionandmorechancesofsuccess.Bythetimeyourcompetitor’sproductevenhitsthemarket,you’realreadyhittingyourbusinessgoals! WhatActionsShouldaTeamTakeiftheirErrorBudgetisSpentorClosetoSpent? TheSREteamworkswiththedevelopmentteamtoimplementalertsandpoliciestominimizecustomerimpactinthecasewhendifferentamountsoferrorbudgethavebeenburned(50%,75%,100%,forexample).AteammaychoosetoalerthigherlevelsofmanagementastheEBburndowngetscloserto100%andthemanagerwoulddeterminethebestcourseofactionaccordingly.Thisalerting/EBpolicyiswhatmakesEBsandSLOsactionable.Infact,TwitterdidnotsuccessfullyimplementSLOsuntiltheyinstitutedEBpoliciesaswell. Ifateamhasburnedtheirentireerrorbudget,previouslyagreed-uponpoliciescancomeintoeffecttopreventfurthercustomerimpact.Forexample,themanagermaygointocoderedandfreezeallnewreleasesuntilthey’vebroughtthenumberoferrorsdowntoareasonablepoint.Iftherearewaytoomanyerrors,thentheSREteammayhavetodoasystemrollback.Thatgivesdevelopersenoughtimetodealwiththeerrorsgraduallyandreleasethechangesovertime. Herearesomewaysthatthedevelopmentteamcanfocusonimprovingreliabilityinsteadofshippingnewfeatureswhentheerrorbudgetisspentornearlyspent:Fixingbugsintheprogramcodeorresolvingproceduralerrors. Softenharddependenciesthatwereidentifiedinpreviousincidentretrospectives.Removingdependencieswillmakethecodelesscomplexandeasiertomanage.IftheEBwasconsumedbymiscategorizederrors(incorrectlycategorizederrors)thatwouldhavecausedtheservicetomissitsSLO,theerrorsmustbecategorizedappropriatelytoavoidfurtherconfusion.WhatActionscantheDevelopmentTeamtakeiftheyarewellAbovetheTargetUptime? Ifthedevelopmentteamiswellabovethetargetuptime,thentheyhaveanadvantage.Itallowsthemtoincreasetheirpushvelocityandtakeriskswithoutputtingtheproductatrisk. Hereareafewthingsthatthedevelopmentteamcandoifthey’rewellabovethetargetuptime:IntroducebiggerchangesIncreasereleasevelocity TakeriskswithouttroublingtheSREteamErrorBudgetandMaintenanceWindow Everysystemrequiressomelevelofmaintenancefromtimetotime.InSRE,themaintenancewindowisapre-allottedtimeframedesignatedbythetechnicalstaff.It'sdedicatedtopreventivemaintenancethatrequiresdisruptingthesystem’snormaloperation. Technologieslikevirtualization,multi-threadedprocessors,andcontainerizationhavereducedoreliminatedtheneedforamaintenancewindow. Everyonetriestominimizedowntime,butsometimesit’ssimplyunavoidable. Insuchacase,shouldthemaintenancewindowaffecttheerrorbudget?Youcantreatmaintenanceasdowntimebyburningthroughtheerrorbudgetassociatedwithserviceavailability.However,it’snotexactlyagoodpractice,andthedecisionshouldbemadeonlyifyou’reconsideringthedowntimeaspartofyourreliabilitywork,andplantoreducethat.Ideally,there’sanoptiontoexemptmaintenancewindowerrorsfrombeingcountedtowardstheEB.Let’stakealookattwoscenarioswherethemaintenancewindowismandatory,andhowtochooseasuitablemaintenancewindow. BusinessHoursWhenwe’redealingwithaservicewhereoperationsrunfrom9to5,thentheservicecanbedownoutsideofthebusinesshours.Thatgivestheserviceproviderthemaintenancewindowofabout~15hours,wheretheycankeeptheservicedownwithoutaffectingtheirEB. TrafficAnalysisSchedulingthemaintenancewindowbyanalyzingtrafficpatternsandchoosingatimewherethetrafficislow.Inthisscenario,you'restillreceivingrequestsbutareminimizingtheimpactoncustomers.HowcanBlamelessHelpYouTrackandImplementErrorBudgets?Creatinganerrorbudgetisalongjourney,butthebenefitsareworththeinvestment.WorkingwithSREandcreatinganEBshouldbeanindispensablepartofyourjourneytowardsdevelopingmorereliablesoftware. Blamelessprovidestheindustry’sfirstend-to-endSREplatformthatempowersyoutooptimizeyourserviceforreliabilitywithoutsacrificinginnovation.OurSLOproductenablescustomerstocreateuserjourneysandSLOs,setEBpolicyandnotifications,andautomaticallykickoffanincidentwhenEBisdepleted.Requestademotoday,orsignupforournewsletterbelowtolearnmoreabouttheBlamelessculture,andhowwecanhelpwithyourSREjourney.AboutNoor-ul-AnamRuqayyaNoorisasoftwareengineerwhocontributeseducationalarticlesonSREandDevOpsfundamentalstoourblog.ReadmorepostsbymeRelatedResourcesWhat'sDifficultAboutOn-Call?BridgingtheGap:FromDevOpstoSREBeyondthe4GoldenSignalsBuildReliableServicesontheCloudGetthelatestfromBlamelessReceivenews,announcements,andspecialoffers.