What is an error budget—and why does it matter? - Atlassian
文章推薦指數: 80 %
An error budget is the maximum amount of time that a technical system can fail without contractual consequences. For example, if your Service Level Agreement ( ... CloseViewthispageinyourlanguage?AlllanguagesChooseyourlanguage中文DeutschEnglishEspañolFrançaisItaliano한국어MagyarNederlands日本語PortuguêsPусскийPolskiIncidentManagementOpenandclosethenavigationmenuStartyourjourneyITSMMoreResourcesStartyourjourneyITSMMoreResourcesThepathtobetterincidentmanagementstartshereStartyourjourneyBrowsetopicsRespondIncidentcommunicationBackOverviewTemplatesWorkshopIncidentresponseBackOverviewBestpracticesIncidentcommanderAviationRolesandresponsibilitiesLifecyclePlaybookOncallBackOverviewOncallschedulesOncallpayAlertfatigueImprovingoncallITalertingEscalationpoliciesResolveToolsKPIsBackOverviewCommonmetricsSeveritylevelsCostofdowntimeSLAvs.SLOvs.SLIErrorbudgetDevOpsBackOverviewSREYoubuiltit,yourunitProblemmanagementvs.incidentmanagementChatOpsITSMBackOverviewMajorincidentmanagementITincidentmanagementModernincidentmanagementforITopsDisasterrecoveryplansforITopsandDevOpsprosBugtrackingbestpracticesLearnPostmortemBackOverviewTemplateBlamelessReportsMeetingTimelines5whysPublicvsprivateMoreresourcesTutorialsBackOverviewIncidentcommunicationOncallscheduleAutomatingcustomernotificationsHandbookBackOverviewIncidentresponsePostmortemsTemplategeneratorGlossaryBackOverview2020StateofIncidentManagement2021StateofIncidentManagementWhatisanerrorbudget—andwhydoesitmatter?Everydevelopment,operations,andITteamknowsthatsometimesincidentshappen. Eventhebiggestcompanieswiththebrightesttalentandareputationfornearly100%uptimesometimeswatchinfrustrationastheirsystemsgodown.JustlookatApple,Delta,orFacebook,allhavelosttensofmillionstoincidentsinthepastfiveyears. ThisrealitymeansServiceLevelAgreements(SLAs)shouldneverpromise100%uptime.Becausethat’sapromisenocompanycankeep. Italsomeansthatifyourcompanyisverygoodatavoidingorresolvingincidents,youmightconsistentlyknockyouruptimegoalsoutofthepark.Perhapsyoupromise99%uptimeandactuallycomecloserto99.5%.Perhapsyoupromise99.5%uptimeandactuallyreach99.99%onatypicalmonth.Whenthathappens,industryexpertsrecommendthatinsteadofsettinguserexpectationstoohighbyconstantlyovershootingyourpromises,youconsiderthatextra.99%anerrorbudget—timethatyourteamcanusetotakerisks.Whatisanerrorbudget?Anerrorbudgetisthemaximumamountoftimethatatechnicalsystemcanfailwithoutcontractualconsequences. Forexample,ifyourServiceLevelAgreement(SLA)specifiesthatsystemswillfunction99.99%ofthetimebeforethebusinesshastocompensatecustomersfortheoutage,thatmeansyourerrorbudget(orthetimeyoursystemscangodownwithoutconsequences)is52minutesand35secondsperyear.IfyourSLApromises99.95%uptime,yourerrorbudgetisfourhours,22minutes,and48seconds.AndwithanSLApromiseof99.9%uptime,yourerrorbudgetiseighthours,46minutes,and12seconds.Whydotechteamsneederrorbudgets?Atfirstglance,errorbudgetsdon’tseemthatimportant.They’rejustanothermetricITandDevOpsneedtotracktomakesureeverything’srunningsmoothly,right?Theanswer,fortunately,isno.Errorbudgetsaren’tjustaconvenientwaytomakesureyou’remeetingcontractualpromises.They’realsoanopportunityfordevelopmentteamstoinnovateandtakerisks. AsweexplaininourSREarticle, “Thedevelopmentteamcan‘spend’thiserrorbudgetinanywaytheylike.Iftheproductiscurrentlyrunningflawlessly,withfewornoerrors,theycanlaunchwhatevertheywant,whenevertheywant.Conversely,iftheyhavemetorexceededtheerrorbudgetandareoperatingatorbelowthedefinedSLA,alllaunchesarefrozenuntiltheyreducethenumberoferrorstoalevelthatallowsthelaunchtoproceed.”Thebenefitofthisapproachisthatitencouragesteamstominimizerealincidentsandmaximizeinnovationbytakingriskswithinacceptablelimits.Italsobridgesthegapbetweendevelopmentteams,whosegoalsareinnovationandagility,andoperations,whoareconcernedwithstabilityandsecurity.Aslongasdowntimeremainslow,developerscanremainagileandpushchangeswithoutfrictionfromoperations.HowtouseanerrorbudgetFirst,you’llneedtoconsultyourSLAsandSLOs.Whatobjectiveshaveyoualreadysetforuptimeorsuccessfulsystemrequests?Whatpromiseshasyourcompanymadetoclients?Thosewilldictateyourerrorbudget.ErrorbudgetsbasedonuptimeMostteamsmonitoruptimeonamonthlybasis.IfavailabilityisabovethenumberpromisedbytheSLA/SLO,theteamcanreleasenewfeaturesandtakerisks.Ifit’sbelowthetarget,releaseshaltuntilthetargetnumbersarebackontrack. Tousethismethodeffectively,you’llneedtotranslateyourSLOtarget(usuallyapercentage)intorealfiguresyourdeveloperscanworkwithin.Thismeanscalculatinghowmanyhoursandminutesyour1%or.5%or.1%ofalloweddowntimeactuallytranslatesto.Commontargetsinclude:99.99%uptimeYearlyalloweddowntime52minutes,35secondsMonthlyalloweddowntime4minutes,23seconds99.95%uptimeYearlyalloweddowntime4hours,22minutes,48secondsMonthlyalloweddowntime21minutes,54seconds99.9%uptimeYearlyalloweddowntime8hours,45minutes,57secondsMonthlyalloweddowntime43minutes,50seconds99.5%uptimeYearlyalloweddowntime43hours,49minutes,45secondsMonthlyalloweddowntime3hours,39minutes99%uptimeYearlyalloweddowntime87hours,39minutesMonthlyalloweddowntime7hours,18minutesErrorbudgetsbasedonsuccessfulrequestsSLOsgetlesshatethanSLAs,buttheycancreatejustasmanyproblemsifthey’revague,overlycomplicated,orimpossibletomeasure.ThekeytoSLOsthatdon’tmakeyourengineerswanttoteartheirhairoutissimplicityandclarity.OnlythemostimportantmetricsshouldqualifyforSLOstatus,theobjectivesshouldbespelledoutinplainlanguage,and,aswithSLAs,theyshouldalwaysaccountforissuessuchasclient-sidedelays.TutorialLearnincidentcommunicationwithStatuspageInthistutorial,we’llshowyouhowtouseincidenttemplatestocommunicateeffectivelyduringoutages.Adaptabletomanytypesofserviceinterruption.ReadthistutorialUpnextTheimportanceofanincidentpostmortemprocessAnincidentpostmortem,alsoknownasapost-incidentreview,isthebestwaytoworkthroughwhathappenedduringanincidentandcapturelessonslearned.ReadthisarticleUpNextDevOps
延伸文章資訊
- 1Chapter 3 - Embracing Risk - Site Reliability Engineering
An error budget aligns incentives and emphasizes joint ownership between SRE and product developm...
- 2SRE 是什麼? 維運管理與SRE 的關係 - Cloud Ace 技術部落格
SRE 全稱Site Reliability Engineering,根據Google 當時提出SRE 概念 ... SRE 提出的概念是Error Budget,所謂的Error Budget...
- 3How and why to create an SRE error budget - TechTarget
An error budget encourages developers to take risks in a way that won't significantly compromise ...
- 4Error Budgets Explained (And How to Make One for Your Team)
The error budget is usually tracked by the SRE team. However, the SRE team doesn't make decisions...
- 5SRE error budgets and maintenance windows - Google Cloud