What is an error budget—and why does it matter? - Atlassian

2024-11-13

文章推薦指數： 80 %

投票人數：10人

An error budget is the maximum amount of time that a technical system can fail without contractual consequences. For example, if your Service Level Agreement ( ... CloseViewthispageinyourlanguage?AlllanguagesChooseyourlanguage中文DeutschEnglishEspañolFrançaisItaliano한국어MagyarNederlands日本語PortuguêsPусскийPolskiIncidentManagementOpenandclosethenavigationmenuStartyourjourneyITSMMoreResourcesStartyourjourneyITSMMoreResourcesThepathtobetterincidentmanagementstartshereStartyourjourneyBrowsetopicsRespondIncidentcommunicationBackOverviewTemplatesWorkshopIncidentresponseBackOverviewBestpracticesIncidentcommanderAviationRolesandresponsibilitiesLifecyclePlaybookOncallBackOverviewOncallschedulesOncallpayAlertfatigueImprovingoncallITalertingEscalationpoliciesResolveToolsKPIsBackOverviewCommonmetricsSeveritylevelsCostofdowntimeSLAvs.SLOvs.SLIErrorbudgetDevOpsBackOverviewSREYoubuiltit,yourunitProblemmanagementvs.incidentmanagementChatOpsITSMBackOverviewMajorincidentmanagementITincidentmanagementModernincidentmanagementforITopsDisasterrecoveryplansforITopsandDevOpsprosBugtrackingbestpracticesLearnPostmortemBackOverviewTemplateBlamelessReportsMeetingTimelines5whysPublicvsprivateMoreresourcesTutorialsBackOverviewIncidentcommunicationOncallscheduleAutomatingcustomernotificationsHandbookBackOverviewIncidentresponsePostmortemsTemplategeneratorGlossaryBackOverview2020StateofIncidentManagement2021StateofIncidentManagementWhatisanerrorbudget—andwhydoesitmatter?Everydevelopment,operations,andITteamknowsthatsometimesincidentshappen. Eventhebiggestcompanieswiththebrightesttalentandareputationfornearly100%uptimesometimeswatchinfrustrationastheirsystemsgodown.JustlookatApple,Delta,orFacebook,allhavelosttensofmillionstoincidentsinthepastfiveyears. ThisrealitymeansServiceLevelAgreements(SLAs)shouldneverpromise100%uptime.Becausethat’sapromisenocompanycankeep. Italsomeansthatifyourcompanyisverygoodatavoidingorresolvingincidents,youmightconsistentlyknockyouruptimegoalsoutofthepark.Perhapsyoupromise99%uptimeandactuallycomecloserto99.5%.Perhapsyoupromise99.5%uptimeandactuallyreach99.99%onatypicalmonth.Whenthathappens,industryexpertsrecommendthatinsteadofsettinguserexpectationstoohighbyconstantlyovershootingyourpromises,youconsiderthatextra.99%anerrorbudget—timethatyourteamcanusetotakerisks.Whatisanerrorbudget?Anerrorbudgetisthemaximumamountoftimethatatechnicalsystemcanfailwithoutcontractualconsequences. Forexample,ifyourServiceLevelAgreement(SLA)specifiesthatsystemswillfunction99.99%ofthetimebeforethebusinesshastocompensatecustomersfortheoutage,thatmeansyourerrorbudget(orthetimeyoursystemscangodownwithoutconsequences)is52minutesand35secondsperyear.IfyourSLApromises99.95%uptime,yourerrorbudgetisfourhours,22minutes,and48seconds.AndwithanSLApromiseof99.9%uptime,yourerrorbudgetiseighthours,46minutes,and12seconds.Whydotechteamsneederrorbudgets?Atfirstglance,errorbudgetsdon’tseemthatimportant.They’rejustanothermetricITandDevOpsneedtotracktomakesureeverything’srunningsmoothly,right?Theanswer,fortunately,isno.Errorbudgetsaren’tjustaconvenientwaytomakesureyou’remeetingcontractualpromises.They’realsoanopportunityfordevelopmentteamstoinnovateandtakerisks. AsweexplaininourSREarticle, “Thedevelopmentteamcan‘spend’thiserrorbudgetinanywaytheylike.Iftheproductiscurrentlyrunningflawlessly,withfewornoerrors,theycanlaunchwhatevertheywant,whenevertheywant.Conversely,iftheyhavemetorexceededtheerrorbudgetandareoperatingatorbelowthedefinedSLA,alllaunchesarefrozenuntiltheyreducethenumberoferrorstoalevelthatallowsthelaunchtoproceed.”Thebenefitofthisapproachisthatitencouragesteamstominimizerealincidentsandmaximizeinnovationbytakingriskswithinacceptablelimits.Italsobridgesthegapbetweendevelopmentteams,whosegoalsareinnovationandagility,andoperations,whoareconcernedwithstabilityandsecurity.Aslongasdowntimeremainslow,developerscanremainagileandpushchangeswithoutfrictionfromoperations.HowtouseanerrorbudgetFirst,you’llneedtoconsultyourSLAsandSLOs.Whatobjectiveshaveyoualreadysetforuptimeorsuccessfulsystemrequests?Whatpromiseshasyourcompanymadetoclients?Thosewilldictateyourerrorbudget.ErrorbudgetsbasedonuptimeMostteamsmonitoruptimeonamonthlybasis.IfavailabilityisabovethenumberpromisedbytheSLA/SLO,theteamcanreleasenewfeaturesandtakerisks.Ifit’sbelowthetarget,releaseshaltuntilthetargetnumbersarebackontrack. Tousethismethodeffectively,you’llneedtotranslateyourSLOtarget(usuallyapercentage)intorealfiguresyourdeveloperscanworkwithin.Thismeanscalculatinghowmanyhoursandminutesyour1%or.5%or.1%ofalloweddowntimeactuallytranslatesto.Commontargetsinclude:99.99%uptimeYearlyalloweddowntime52minutes,35secondsMonthlyalloweddowntime4minutes,23seconds99.95%uptimeYearlyalloweddowntime4hours,22minutes,48secondsMonthlyalloweddowntime21minutes,54seconds99.9%uptimeYearlyalloweddowntime8hours,45minutes,57secondsMonthlyalloweddowntime43minutes,50seconds99.5%uptimeYearlyalloweddowntime43hours,49minutes,45secondsMonthlyalloweddowntime3hours,39minutes99%uptimeYearlyalloweddowntime87hours,39minutesMonthlyalloweddowntime7hours,18minutesErrorbudgetsbasedonsuccessfulrequestsSLOsgetlesshatethanSLAs,buttheycancreatejustasmanyproblemsifthey’revague,overlycomplicated,orimpossibletomeasure.ThekeytoSLOsthatdon’tmakeyourengineerswanttoteartheirhairoutissimplicityandclarity.OnlythemostimportantmetricsshouldqualifyforSLOstatus,theobjectivesshouldbespelledoutinplainlanguage,and,aswithSLAs,theyshouldalwaysaccountforissuessuchasclient-sidedelays.TutorialLearnincidentcommunicationwithStatuspageInthistutorial,we’llshowyouhowtouseincidenttemplatestocommunicateeffectivelyduringoutages.Adaptabletomanytypesofserviceinterruption.ReadthistutorialUpnextTheimportanceofanincidentpostmortemprocessAnincidentpostmortem,alsoknownasapost-incidentreview,isthebestwaytoworkthroughwhathappenedduringanincidentandcapturelessonslearned.ReadthisarticleUpNextDevOps