Engineering Error Budgets | GitLab
文章推薦指數: 80 %
The Google SRE book is generally a recommended read and under the "Motivation for Error Budgets" section, it states: The error budget provides a clear, ... EngineeringErrorBudgets ShareonTwitter Editthispage OpenWebIDE Maintainedby: Contributetothispage Viewsource • OpeninWebIDE Youarehere: EngineeringEngineeringErrorBudgets Maintainedby: Onthispage Whatareerrorbudgets? Whatarethecomponentsoferrorbudgets? Whichtypesoferrorsareincluded? Whyareweusingerrorbudgets? Howdowedeterminethehighestpriorityimprovements? TheErrorBudgetPolicyforGitLab.com Budgetstakeholders Budgetallocation Budgetspendannouncements Budgetspend(byservice) Budgetspend(bystagegroup) System-wideincidents Howtochangeerrorbudgetattribution Contract StageGroupswithdifferenterrorbudgets ErrorBudgetImprovements ErrorBudgetDRIs CurrentStateandFutureIntent CurrentState Roadmap 1.IncreaseprecisionofErrorBudgetcalculations(apdexportion) 2.IncreasevisibilityintoErrorBudgets(errorportion) 3.TunethescopeofErrorBudgets Moreinformation AspartofourstrategytoreinforceGitLabSaaSasanenterprisegradeplatformreadyforbusinesscriticalworkloads,GitLab.comhasspecificAvailabilityandPerformancetargets. Thesetargetsgiveourusersindicationoftheplatformreliability. Additionally,GitLab.comServiceLevelAvailabilityisalsoapartofourcontractualagreementwithplatformcustomers.Thecontractmightdefineaspecifictargetnumber,andnothonouringthatagreementmayresultinfinancialandreputationalburdens. Whatareerrorbudgets? TheGoogleSREbookisgenerallyarecommendedreadandunderthe"MotivationforErrorBudgets"section,itstates: Theerrorbudgetprovidesaclear,objectivemetricthatdetermineshowunreliabletheserviceisallowedtobewithinasinglequarter.ThismetricremovesthepoliticsfromnegotiationsbetweentheSREsandtheproductdeveloperswhendecidinghowmuchrisktoallow. Thisisthegoalwearestrivingfortoo,whilealsoacknowledgingthatinordertoarriveatthesamelevelofsophistication,weneedtotakeintoaccountourspecificsituation,maturityandadditionalrequirements.OurinitialapproachwilldirectlytieErrorBudgetSLOwithourexistingapproachtoavailability. FutureiterationsofourerrorbudgetswillseektofurtherdeveloptheimportanceoftheProductManagerinbalancingrisktolerancewithfeaturevelocity.Theabove-mentionedclaritybetweendevelopersandSREisachievedbyestablishingtheappropriatemeasuresandtargetsforeachserviceorareaofproduct.Ultimatelythisbalancestheimportanceofnewfeatureworkwiththeongoingserviceexpectationsofusers. Whatarethecomponentsoferrorbudgets? ErrorBudgetsfirstdependonestablishinganSLO(ServiceLevelObjective).SLOsaremadeupofanobjective,aSLI(ServiceLevelIndicator),andatimeframe. Objective:Thedesiredlevelofsucccess,notedasapercentage SLI:anevaluationusedtodistinguishnumberoffailedevents Timeframe:enforcingarecencybiastotheSLI Hereisanexampleoftheseelements: Objective:99.95% SLI:95thpercentilelatencyofapirequestsover5minsis<100ms Timeframe:previous28days Takenalltogether,theaboveexampleSLOwouldbe:99.95%ofthe95thpercentilelatencyofapirequestsover5minsis<100msovertheprevious28days TheErrorBudgetisthen1-ObjectiveoftheSLO,inthiscase(1-.9995=.0005).Usingour28daytimeframe,the"budget"forerrorsis20.16minutes(.0005*(28*24*60)) WhiletheaboveexampleshowstheSLIasalatencymeasurement,itisimportanttonotethatothermeasurements(suchas%errors)arealsogoodelementstouseforSLIs. GitLab'scurrentimplementationofErrorBudgetsisonlyusingsomeoftheabovesophisticationofSLOsandErrorBudgets,butweexpecttoincreasethesophisticationinthefuture.ItisexpectedthatthepracticesofSLOsandErrorBudgetsevolvetohaveboththeobjectiveandtheSLIvary(appropriately)basedonthecriticalityoftheserviceaswellastheresiliencyofotherservicesandcomponentswhichdependonit. Whichtypesoferrorsareincluded? Webrequeststhatresultina500statuscodeerrorarecounted.InSidekiq,jobsthatfailduetoanunhandledexceptionarecounted. IfagrouphascustomSLIs,orthere'sanSLIwithafixedfeaturecategoryconfiguredinourmetricscatalog,thenthoseerrorswillalsobecounted. EngineerscanuseGitlab::ErrorTracking.track_exception,orotherlogging,freelywithoutaffectingtheerrorbudget. Whyareweusingerrorbudgets? GitLabisacomplexsystemthatneedstobedeliveredasahighlyavailableSaaSplatform.Overtheyears,severalprocesseshavebeenintroducedtoaddresssomeofthechallengesofmaintainingfeaturedeliveryvelocitywhileensuringthattheSaaSreliabilitycontinuestoincrease. TheInfradevProcesswascreatedtoprioritizeresolvinganissueafteranincidentordegradationhashappened.Whiletheprocesshasproventobesuccessful,itisevent-focusedandevent-driven. TheEngineeringAllocationProcesswascreatedtoaddresslongtermteamefficiency,performanceandsecurityitems. TheinitialiterationoferrorbudgetsatGitLabaimstointroduceobjectivedataandestablishasystemthatwillcreategreaterinsightintohowindividualfeaturesareperformingoveranextendedperiodoftime.Thiscanbeusedbytheorganizationtocorrectlyallocatefocus,ensurethattheriskiswellbalancedandthatthesystemasawholeremainshealthierforextendedperiodsoftime. Assigningerrorbudgetsdowntothefeaturecategorysetsabaselineforspecificfeatures,whichinturnshouldensurealignmentonprioritizingwhat'simportantforGitLabSaaS. Howdowedeterminethehighestpriorityimprovements? EachgrouphasaBudgetspendattributionsectionintheir Budgetdetaildashboardthatallowsthemtodiscoverwheretheirbudgetisbeingspent. BoththeBudgetfailurespanelandeachlinkintheFailureloglinkspanelareorderedbythenumberoferrors.Prioritisingfixingthetopoffendersinthesetableswillhavethebiggestimpactonthebudgetspent. TheErrorBudgetPolicyforGitLab.com Theerrorbudgetsprocesshasafewdistinctitems: Budgetstakeholders Budgetallocation Budgetspendandaccounting Communicationbetweenthestakeholders Budgetstakeholders ThestakeholdersintheErrorBudgetprocessare: Stageteams(ProductdepartmentandthesupportingEngineeringteamsrepresentedontheproductcategoriespage) Infrastructureteams(Teamsrepresentedontheinfrastructureteampage) VPofInfrastructureandInfrastructureLeadership VPofDevelopmentandVPofProduct Budgetallocation Errorbudgetiscalculatedbasedontheavailabilitytargets. Withthecurrenttargetof99.95%availability,allowedunavailabilitywindowis20minutesper28dayperiod. Weelectedtousethe28dayperiodtomatchProductreportingmethods. ThebudgetissetontheSaaSplatformandissharedbetweenstageandinfrastructureteams.ServiceLevelAvailabilitycalculationmethodologyiscoveredindetailsattheGitLab.comSLApage. ThisincludesallRailsControllers,APIEndpoints,Sidekiqworkers,andotherSLIsdefinedintheservicecatalog.Thisisattributedtogroupsbydefiningafeaturecategory.Documentationaboutfeaturecategorizationisavailableinthedeveloperguide. Thenumberorcomplexityoffeaturesownedbyateam,existingproductpriorities,ortheteamsizedoesnotinfluencethebudget. Budgetspendannouncements Onthe4thofeachmonth,thefollowingannouncementsaremade: BudgetSpendbyService BudgetSpendbyStageGroup Theannouncementsappearin#product,#eng-managers,#f_error_budgetsand#development FeaturecategorieswithmonthlyspendabovetheallocatedbudgetforthreeconsecutivemonthsarereportedaspartoftheEngineeringAllocationmeeting. Budgetspend(byservice) ThecurrentbudgetspendcanbefoundonthegeneralSLAdashboard. Spentbudgetisthetime(inminutes)duringwhichuserfacingserviceshaveexperiencedapercentageoferrorsbelowthespecifiedthresholdandlatencyisabovethespecifiedobjectivesfortheservice.ThedetailsonhowSLAiscalculatedcanbefoundattheGitLab.comSLApage. Thebudgetspendiscurrentlyaggregatedattheprimaryservicelevel. Detailsonwhatcontributedtothebudgetspendcanbefurtherfoundbyexaminingtheraisedincidents,andexploringthespecificservicedashboard(anditsresources). Budgetspend(bystagegroup) Thereisanexampleavailablewithamoredetailedlookathowthisisbuilt. Thecurrent28daybudgetspendcanbefoundoneachstagegroupdashboard.Featurecategoriesforthatstagegrouparerolleduptoasinglevalue. Stagegroupscanusetheirdashboardstoexplorethecauseoftheirbudgetspend.Theprocesstoinvestigatethebudgetspendisdescribedinthedeveloperdocumentation Theformulaforcalculatingavailability: thenumberofoperationswithasatisfactoryapdex+thenumberofoperationswithouterrors / thetotalnumberofapdexmeasurements+thetotalnumberofoperations Thisgivesusthepercentageofoperationsthatcompletedsuccessfullyandisconvertedtominutes: (1-stagegroupavailability)*(28*24*60) ApdexandErrorRatesareexplainedinmoredetailonthehandbookpage. ErrorBudgetSpendinformationisavailableontheErrorBudgetsOverviewDashboardinSisense. System-wideincidents System-wideincidentsaffectingsharedservices(suchasthedatabaseorRedis)mayhaveanimpactonateam's ErrorBudgetspend.Sincewelookatspendovera28-dayperiod,theimpactoftheseshortlivedeventsshouldbemostly negligible. Iftheimpactissignificant,wecandiscussontheMonthlyReportifthisincidentshouldwarrantamanualadjustmenttospend. Atthistimewearenotlookingfurtherintoautomaticallydiscountingsystem-wideeventsfromgroup-levelerrorbudgets.Theteamisfocusedonbuildingastrongfoundationforerrorbudgets,withsufficienttuningcapabilitytoberelevantforeachgroup. Howtochangeerrorbudgetattribution Errorbudgeteventsareattributedtostagegroupsviafeaturecategorization.Updatestothismappingcanbemanagedviamergerequestsifownershipofapartoftheplatformmovesfromonefeaturecategorytoanother. Updatestofeaturecategoriesonlychangehowfutureeventsaremappedtostagegroups.Previouslyreportedeventswillnotberetroactivelyupdatedandwillcontinuetocountagainststagegrouperrorbudgets. Contract AllfeaturecategoriesareexpectedtoperformwithintheirErrorBudgetregardlessoftrafficshare.Thisensuresaconsistentapproachtoprioritizationofreliabilityconcerns. ErrorBudgetsshouldbereviewedmonthlyaspartoftheProductDevelopmentTimeline. Thebalancebetweenfeaturedevelopmentandreliabilitydevelopmentforafeaturecategoryshouldbeasfollows: MonthlySpend(28days) Action <=20minutes Understandyourspend-nofurtheractionrequired. >20minutes Commitmenttoreliability/availabilityimprovements,featuredevelopmentissecondary. Featurecategorieswithmonthlyspendabovetheallocatedbudgetforthreeconsecutivemonthsmayhaveadditionalfeaturedevelopmentrestrictionsputinplace. ThisissubjecttochangeasErrorBudgetspendacrossfeaturecategoriesdecreases. StageGroupswithdifferenterrorbudgets Ourcurrentcontractis99.95%availabilityanda20minutemonthlyerrorbudget.However,thefollowinggroupshaveatemporarilyadjustedbudgetbasedonbusinessneeds: StageGroup MonthlySpend(28days) BusinessReason ReviewDate Allfulfillmentstages <=3.65hours/monthabout5mins/day(99.5%) WeareprioritizingaddingmoreendpointstoPrometheussotheerrorbudgetshavemoredatapoints. 2022-07-31 Exceptions Temporaryexceptionsaregrantedasameanstoallowdifferentstakeholderstofulfillhigherprioritybusinessneeds,ifitisestimatedthatthegrantedexceptionisnotcreatingadditionalrisktoGitLab.comreliability.NotethatexceptionsaredifferentfromCustomTargets,whichsetpropertiesonendpointsdefiningacceptableperformance. Validreasonsforanexceptionare: Workforimprovingtheerrorbudgetisscopedoutandfullyplanned,andfunded.Completingtheworkwilltakemorethanasinglereleasemonth,andwhiletheworkisbeingcompleteditisexpectedthattheErrorBudgetwillberegularlyspent. Workforimprovingtheerrorbudgetisscopedoutandfullyplanned,buttheworkisnotcurrentlyfunded.Thestakeholdersareintheprocessofsecuringthefunding,andtheErrorBudgetwillberegularlyspentuntiltheadditionalfundsaresecured. Temporarily,thehighestpriorityistoachieveasignificantbusinessgoal,andthereliabilityofGitLab.comisnotdirectlyaffected.ThegroupislikelytoregularlyspendtheErrorBudgetwhiletheyarefocusedonthisotherpriority. InstructionsforRequestinganException Torequestanexception,openanMRandaddthestagegrouptothetableabove.Inthedescription,supplythefollowingdetails: Cleardescriptionoftheproblemthatisthecauseofthebudgetspend Relevantresourcesshowingthattheworkisscopedout Targetdatewhentheexceptionmustberevisited AdditionalGuidance DocumenttheworktobedoneusingEpicsandIssues AdddetaileddescriptionstotheEpicsandIssuestoensuretheworkisclearlyscopedout. AddastartdateandduedatetotheEpicsothatitistransparenthowlongitwilltaketocompletethework. Addmilestonestotheissuessoitistransparentwhentheworkwillbeplanned. Provideanswerstothefollowingquestions: Whatportionofyourteam'sbudgetisduetothisexception?Ifyouweretoremovetheoffendingendpointscoveredbythisexceptionwouldyourerrorbudgetbecomegreen? Whatisthemaincontributortoyourteam'serrorbudgetspend?Isthattheresponsetime? Whatdoessuccesslooklikeattheclosureofreferencedepic? Followtheguidanceandinstructionsabovetoexpeditetheapprovalprocess. AssigntheMRforapprovalto: DirectorofProductorabove(oftheaffectedstagegroup) Theyareresponsibleforensuringthatthebusinessneedismet,andwillneedtocommunicatethechangeupanddownthechainofreporting. DirectorofInfrastructureorup TheyareresponsibleforensuringthatGitLab.comwillnotbenegativelyimpacted,andwillneedtocommunicatetheexceptionupanddownthechainofreporting. ErrorBudgetImprovements Workrelatingtoerrorbudgetimprovementsshouldbedetailedinanissue. PleaselabeltheseissueswithErrorBudget Improvementandthegroup::labelsotheycanbetrackedinreports. ErrorBudgetDRIs Role K/PI Target CurrentTrackingStatus ProductManagement MaintainingtheSpendoftheErrorBudget 20minutesover28days(equivalentto99.95%availability) Complete-InSisense Infrastructure SettingtheErrorBudgetMinutesandAvailabilityTarget 99.95%(20minutesover28daysErrorBudget) Complete-InGrafana Forgroupswithengineeringallocations,theresponsibilitytomaintainthespendoferrorbudgetiswiththedevelopmentteaminsteadoftheproductmanagementteam. CurrentStateandFutureIntent CurrentState Errorbudgetsexistforeachfeaturecategoryandincorporateastandardapdexthresholdanderrorrate. ErrorbudgetsarepublishedforstagegroupsandstagesthroughGrafanaandSisenseDashboards. ContributingfactorsareexplorablethroughlinksavailableontheGrafanaDashboards. ErrorbudgetsareincludedintheProductPrioritizationprocess. Roadmap ThechangesbelowaimtoincreasethematurityoftheErrorBudgets. 1.IncreaseprecisionofErrorBudgetcalculations(apdexportion) Improvements CancelledTheSLOtargetsoriginallyusedforErrorBudgetsarecoupledwiththealertingusedforInfrastructuremonitoring.WeproposedtouseSisensetobeabletosettargetsbystagegroup,butthisapproachwasnotfavoured.WefoundamethodtouseseparatetargetsfortheInfrastructuremonitoringandtheErrorBudgets,butthedecisionwastakentokeepthetargetsthesameandadjustthedefaultlatencythresholdfortheapdexportionoftheErrorBudgets(seenextitem). CompletedSLIcalculationsusedrequestdurationthresholdwhichwasnotappropriateforallendpoints.Thethresholdwasincreasedto5sonthe21stofSeptanditwilltake28daysforthefulleffecttobeshownintheErrorBudgets. CompletedStagegroupswillnextbeenabledtosettheirownSLIperendpointbyexpandingontheconfigurabilityofSLIrequestdurationthreshold.epic. Endpointsthatarecurrentlynot_ownedwillbeattributedtothecorrectfeaturecategory.Thiswillbeaddressedby CompletedusingcallerinformationforSidekiq,and havinggraphQLquery-to-featurecorrelation. Theimpactofsystem-wideoutagesonErrorBudgetsshouldbemoreclear. ProvideguidanceforPM'swhoreportonbothErrorBudgetsandServiceAvailability.(SuchasRunnerandPages). ProductDevelopmentActivities ProductDevelopmentteamsareencouragedto: ContinueworkingonRapidAction,Infradev,CorrectiveActions,Security,andEngineeringAllocationissuesperourPrioritizationguidelines ProposeSLOsfortheirendpoints Opt-intousingthenewapdexcalculationmethodsthatusethecustomtargetdurations ProvidefurtherfeedbackforfutureimprovementstoErrorBudgets 2.IncreasevisibilityintoErrorBudgets(errorportion) Stagegroupsareprovidedwitherrorcountinformation.Thiscanbesupplementedwithfurtherdetailbymakingerrorinformation explorablewithSentry. 3.TunethescopeofErrorBudgets ConsiderincorporatingP1/S1incidentsintotheErrorBudgetCalculation. Moreinformation ErrorBudgetAMA UnderstandingStageLevelErrorBudgetDashboards SettinguprecurringSlackupdates OpeninWebIDE Viewsource
延伸文章資訊
- 1Chapter 3 - Embracing Risk - Site Reliability Engineering
- 2SRE 是什麼? 維運管理與SRE 的關係 - Cloud Ace 技術部落格
SRE 全稱Site Reliability Engineering,根據Google 當時提出SRE 概念 ... SRE 提出的概念是Error Budget,所謂的Error Budget...
- 3Risk and Error Budgets (class SRE implements DevOps)
- 4Why you need an error budget—and how to make it work | TechBeacon
- 5What is an error budget—and why does it matter? - Atlassian