How and why to create an SRE error budget - TechTarget

文章推薦指數: 80 %
投票人數:10人

An error budget encourages developers to take risks in a way that won't significantly compromise service availability. Fotolia Sharethisitemwithyournetwork: By AdamBertram Published:03Dec2020 OrganizationsthatdeliverITservicesfaceaninterestingconundrum:Theyneedtomaketheirsystemsastechnologicallyadvancedaspossible,whilemaintainingimpeccableuptime.Butnoteventhebestserviceproviderscanachieve100%availability. Thatdoesn'tmean,however,thatteamscanstrayfarfromthatpercentagewithoutconsequences.Anyserviceproviderisaccountabletoitsusers--especiallywhenthoseusersarenotinternal.Top-notchreliabilityandfunctionalityareexpected. Occasionally,usersandserviceproviderscontractuallydetermineacompromisebetweenuptimeandfeaturedevelopment.Thesetermshelppartiesdefinewhat'sknownasanerrorbudget.Thesitereliabilityengineer(SRE)playsakeyroleinthisprocess. Let'sdiscusswhaterrorbudgetsare,whythey'rebeneficial,andsomepotentialdrawbacks. Thebasics Tounderstanderrorbudgets,youmustfirstunderstandservicelevelagreements(SLAs)andservicelevelobjectives(SLOs).Whenvendorsprovideaservice,theydosowithabaselineunderstandingofthatservice'sperformanceunderavarietyofcircumstances. Accordingly,usersexpectperformancetofallinlineontheirend.AnSLAisacontractdictatingthataservicewillperformonparwiththeusers'estimates.Thisincludesmetricslikeuptime,throughputandlatency.TechnicalsupportandgeneralcommunicationpracticesarealsocriticalcomponentsofanSLA.TheSLAisbinding--failuretoprovidequalityserviceresultsinpenalties,whichareoftenfinancial,fortheserviceprovider. SLOsaremoregranular.Theseareassurancesbasedaroundusers'specificKPIs.TheSLAisanall-encompassingcontract,andtheSLOformsasubsetofagreementswithinthatcontract.SLOsenableDevOps,ITandSREteamstosetservicedeliverygoals. Theseperformancegoalsmustalignwithpredeterminederrorbudgets--ortheamountofacceptabledowntimeaservicecanendure,contractually.Noserviceisperfect,butit'scriticaltokeeperrorsbelowacertainthreshold. Errorbudgetsarerelevantacrosstheentiretyoftheserviceecosystem.Whetherteamsassessuptime,downtime,requesterrorsorlatency,it'spossibletocreateKPI-basedallowances. Determineanerrorbudget Therearemanywaystomeasureserviceengagement,includingreal-timeusertraffic,consumedbandwidthandAPIrequests.Thatlastmetricisperfecttodetermineanerrorbudget--acalculationwhich,atitscore,isrelativelysimple. APIrequestsareHTTP-basedandthusmadeviaanetworkconnection.Asanyonewho'svisitedawebsiteortriedtouseonlineservicesknows,attemptsaren'talwayssuccessful.Webpageshang,requeststimeoutandaccessissometimesdeniedduetoauthenticationissuesorbadgateways.Usersreceivefeedbackwhenthishappens.SREteamsreceivesimilarinformation. AnyAPIrequestsmadewithinamonitoredservicearelogged;thisenablesretrospectiveanalysisafterproblemsarise.Teamscanseeexactlyhowmanyerrorsoccurduringaperiodoftime.Theycanthentakethiscount,weighitagainstsuccessfulrequestsanddetermineanerrorpercentage. Here'swhereanSLOcomesin.AccordingtoGoogle'sSREappendix,anerrorbudgetis1minustheSLOofagivenservice.Forexample,let'ssayaservice'sAPIrequestsmustsucceed99.8%ofthetime,peranSLOagreement.Themathwouldbeasfollows: 1.0(100%)baseline-99.8%SLO=0.2%errorbudget Let'sassumethatacompanymakes100,000monthlyAPIrequests.Ifonly0.2%ofthoserequestscanresultinerrors,thenthemonthly,numericalerrorbudgetisamere200requests. WheredoestheSREteamcomein? SREsactivelyworktostaywithinpredeterminederrorbudgets.Toaccomplishthis,thereliabilityengineertacklesavarietyofDevOpsandITadministrationtasks.Forexample,anSREmust: maintainserviceavailability; mitigatelatency; boostserviceperformanceandefficiency; monitorservice(s); managechanges; respondtooutagesandemergences;and performresourceandcapacityplanning. SREsmustensurethatcompute,memoryandnetworkingresourcescanadequatelysupportascalinguserbase.Serversmustbeabletodirecttrafficandloadbalanceeffectively. Errorbudgetsrallydiversetechnicalteamsaroundacentralgoal. SREsmustalsoknowwhichmetricsaremostcrucialtotroubleshootinganderrorreduction.Theycanthenusethisinteltomakerecommendations.TheSRE'sgoalistoremovetheburdenofserviceavailabilityfromotherteams'shoulders.Theyachievethisthroughtaskautomationandthecreationofself-servicetooling.Acorestepinthisprocessisconsultationwithotherteamsandthoseintegraltotheservicesinquestion. Throughautomatedlogging,remediationandmetrics-gathering,SREsmakeiteasierfortheircolleaguestoimproveexistingservicesmorequicklythanbefore.Thisincreasesservicequality--thusreducingerrorsandensuringteamsdon'texceedtheirerrorbudgets. Errorbudgetbenefitsanddrawbacks Errorbudgetsrallydiversetechnicalteamsaroundacentralgoal.Theassignmentofahardnumberincentivizeshigh-qualitysoftwaredevelopment.Itencouragesdeveloperstotakerisksandexperimentwithnewfunctionality--withinreason.Softwarefeaturechanges,adjustmentstoadjacentservices,andevenhotfixescanintroduceerrors;anerrorbudgetreducestheblastradiiofanyissues. Determininganerrorbudgetiscrucialforstrategicremediation.Ifservicesexceedthebudget,DevOpsandSREprofessionalsmustdeterminewhichfixeswillbethemosteffective.ThisiswheretheSRE'stoolingandautomationcomein. Unfortunately,therearetwosidestothatcoin.Avarietyofconditionsmightdeterminewhetherteamsreactivelyprioritizefeaturedevelopmentorstability.Keepingtrackofthesefactors--codebugs,proceduralerrors,outageoriginsanduserscope--canbetricky.Theprocessmightnotbeblackandwhite. SLO-basederrorbudgetscanbecomplicatedanddifficulttomeasure.SREsdon'talwaysreceiveclearguidelinespertainingtoSLOs.Thismakesitdifficulttodetermineclearerrorbudgetsandsticktothem. Furthermore,teamswhoseterrorbudgetsforthefirsttime--orsetbudgetsthatareexceedinglyambitious--canoverpromiseandunderdeliver.Pushingforextremereliabilityisanadmirablegoal,butSREsmustberealisticaboutservicelimits.Otherwise,afailuretodelivercomesatacost. DigDeeperonSystemsautomationandorchestration Understandonlinemarketplaces,SLOsforproviders NewRelicaddsSREfeatures,prepssecurityoffering By:Beth Pariseau PrepareforthesecommonSREinterviewquestions By:Stephen Bigelow Observabilitymaturityfavorsdataclarityoverquantity By:Beth Pariseau SponsoredNews AccelerateandSimplifyYourJourneytoaZeroTrustArchitecture –DellTechnologies COVID-19TriggersEmphasisonRemoteWork,HighlightsITBudgetInefficiencies –HPE Savetimeandmoneywithdata-drivenITpurchasedecisions –DellTechnologies SeeMore RelatedContent Familiarizeyourselfwiththese7keySREterms –SearchITOperations GoogleCloudNext'18:Whatdatacentreoperatorscan... –ComputerWeekly.com NewRelicaddsSREfeatures,prepssecurityoffering –SearchITOperations LatestTechTargetresources SoftwareQuality ApplicationArchitecture CloudComputing AWS Java DataCenter SearchSoftwareQuality No-codeAPItoolsforExcelwrangledata Tampa-basedVaultInsurancestruggledwithanalyzingspreadsheetsatscaleuntilitpurchasedspreadsheet-to-APIsoftware... CircleCIvs.GitHubActions:CI/CDplatformcomparison Learnthedifferingfeatures,commandsandperksthatmakeCircleCIandGitHubActionsdistinctCI/CDplatforms.Then,decide... Keythingstoconsiderwhenchoosingregressiontestingtools MakingadecisionaboutregressiontestingtoolingpurchasesmeanstakingstockofplansforCI/CDimplementations,existing... SearchAppArchitecture A7-pointlanguagecomparisonofBallerinavs.Golang ThiscomparisontakesalookatbothBallerinaandGolang,includingtheirsupportforhigh-levelprogramming,libraries,... HowviableisittocreatemicroservicesinPython? Pythonoffersmicroservicesdevelopersaccesstoadvancedscripting,embeddedtestingandplugintooling.Butisitagoodfit... ArdoqDiscoverboostsenterprisearchitecturetoolusability Toolsforenterprisearchitectsgetcriticizedfortheircomplexity,buttheyaregettingbetter.Arecentlyreleasedmodulefrom... SearchCloudComputing Cloud-nativemodernizationisontherise Researchsuggeststhatcloud-nativeapplicationdeploymentisbecomingmoreprevalentasorganizationscontinuetoembracepublic... ITbuyersrespondtoHPEGreenLakecloudpitch TechbuyersareinterestedinthebreadthanddepthofservicessoldthroughtheHPEGreenLakeservice,butwantproofofcost... HowtocreateanEC2instanceusingPowerShell PowerShellhaspracticalintegrationsthatprovideuserswithcross-platformcapabilities.Gettoknowprerequisitesand... SearchAWS AWSControlToweraimstosimplifymulti-accountmanagement ManyorganizationsstruggletomanagetheirvastcollectionofAWSaccounts,butControlTowercanhelp.Theserviceautomates... BreakdowntheAmazonEKSpricingmodel ThereareseveralimportantvariableswithintheAmazonEKSpricingmodel.Digintothenumberstoensureyoudeploytheservice... CompareEKSvs.self-managedKubernetesonAWS AWSusersfaceachoicewhendeployingKubernetes:runitthemselvesonEC2orletAmazondotheheavyliftingwithEKS.See... TheServerSide.com IsHTMLaprogramminglanguage? UnlikeJava,PythonorC,HTMLisnotaprogramminglanguagebecauseitlacksvariables,conditionalstatementsoriterative... 8techjobsthatdon'trequirecoding LookingforanITjobthatdoesn'tinvolvecoding?Theseeighttechrolesareimportantinanyorganization,withnoprogramming... AgileversusScrum:What'sthedifference? What'sthedifferencebetweenAgileandScrum?Whataretheirsimilarities?Learnhowtocompareandcontrastthesetwopopular... SearchDataCenter CreatefileserverscreensandquotasinFSRM FileServerResourceManagercanhelpdatacenteradminskeeptrackofwhatfilesgetstoredontheirserversandhowmuchspace... HPEGreenLakeachieveseverythingasaservice ThelatestversionofHPEGreenLakehasrealizedthevisionofeverythingasaservice,althoughcustomersarestillfiguringout... HPEGreenLakeupdatesaimforpubliccloud-likesimplicity HPEGreenLakehasbeensimplifiedtofeelmorelikeapubliccloudwhilemakingmanagementofVMsacrosshybridcloudsmore... Close



請為這篇文章評分?