Service Level Objectives (SLO) and Error budgets - Servian

文章推薦指數: 80 %
投票人數:10人

To learn more about how reliability is done in Google, the SRE Book and SRE Workbook are available to read online for free. OpeninappHomeNotificationsListsStoriesWritePublishedinServianSiteReliabilityEngineeringServiceLevelObjectives(SLO)andErrorbudgetsIntroductiontoSiteReliabilityandEngineeringPhotobyAmmarElAmironUnsplashItwasearly2020,whileIwasstudyingfortheGCPArchitectexamwhenIfirstencounteredthetermSiteReliabilityEngineeringorSRE.ThisbuzzwordwhichwasfairlynewtomeatthattimeisaconceptthatdefinitelyexistedlongerthanIthought.IthadbeenthesecretingredientthatisbeingpracticedinternallyinGooglesince2004whichenabledthegiantTechcompanytodeliverreliableservicestoitsusersfromallaroundtheworld.AgoodamountofinsightfulconceptsonSREhasbeendiscussedbyLizFong-JonesandSethFargointhisYoutubeplaylist,whichIhaveenjoyablywatchednumeroustimes.TolearnmoreabouthowreliabilityisdoneinGoogle,theSREBookandSREWorkbookareavailabletoreadonlineforfree.TheGISTSREisnotjustanewtermthatissynonymouswithDevOps.Whilethelatterfocusesmoreonestablishingcultureandphilosophies,SRE,ontheotherhand,comprisesofaprescriptivesetofpracticesthatimplementDevOps.Letmesharewithyoumytop5learningsaboutSiteReliabilityEngineering:FAILUREisNORMAL100%isNOTthecorrectreliabilitytargetDefineanacceptablereliabilitytargetontheservicelevelthatisJUSTENOUGHtoMAKECUSTOMERSHAPPY.ItisworthAUTOMATINGayear’sjobawayMEASUREeverything!IhaveembarkedonajourneytolearnmoreabouthowSREisdoneandsharemylearningsthroughthisblog-postseries.Inthisblog-postseries,Iwillbeexploringanddiscussingthefollowingtopics:PART1:ServiceLevelObjectives(SLOs)andErrorbudgets(Thisarticle)PART2:CreatingSLOsinGoogleCloudPART3:ToAlertornottoAlert,thatisthequestionPART4:ImplementingSLO-basedalertsinGoogleCloudServiceMonitoring(Comingsoon)Inthisarticle,I’llwalkyouthroughhowtosetreliabilitytargetsusingServiceLevelObjectivesorSLOsandErrorbudgets.Butbeforewegettothat,doyoueverwonderwhy100%availabilityisnotwhatyoushouldbeaimingfor?Why100%reliabilityisthewrongtarget?Implementing100%availabilityiscostlyandintroducesmoretechnicalcomplexity—mostespeciallyinadistributedenvironmentThereisachainofsystemsthatstandinbetweenyourserviceandend-userthatcanimpactthereliabilityofyourservice.Ex:InternetspeedandavailabilityThenumberonecauseofoutagesarereleasesofnewfeaturesThatbeingsaid,itisfairtoconcludethatnomatterhowmuchyouspendorhowmucheffortyouexerttokeepyoursystemsperfectlyavailable,theuserswon’texperience100%availabilityanyway.Makingyourservicetoohighlyreliablemaymeanslowingdownorevenceasingaltogetherthereleaseofnewfeaturesthatcouldpotentiallyincreaseuser’ssatisfactionandloyaltytoyourservice.Ifthecostofmaintainingthatlevelofreliabilitydoesn’tleadtobusinessvalue,itisjustnotworththecostandeffort.Soifnot100%,whatisthecorrectreliabilitytarget?Theremaybenorightorwronganswertothis.Itultimatelydependsonproductowners,incollaborationwiththeSREteam,todefinehowmuchunreliabilitytheycantoleratewhilestillkeepingtheircustomerssatisfied.Itcouldbe99%,99.9%,99.99%,soonandsoforth.Butwhatdoes99.99%reliabilityevenmean?Thiswillgraduallymakesenseonceyoulearnacouplemoreterms.SLOandSLISLO,alsoknownasServiceLevelObjective,isagreeduponobjectivesofhowreliableaserviceisexpectedtobe.BeforeonecanfullyunderstandSLO,onehastoknowwhatSLIis.SLI,alsoknownasServiceLevelIndicator,isametricoveraperiodoftimethatinformsaboutthehealthofaserviceandusedtodetermineifSLOsaremet.BelowarethetwomostcommonSLIcategory:AVAILABILITY—Howmanyrequestshavereturnedsuccessfully?LATENCY—Howlongdoestheapplicationtaketorespond?Generally,itisrecommendedthatSLIbetheratiooftwonumbers:Totalgoodevents/Totalevents.Seeexamplesbelow:NumberofsuccessfulHTTPrequests/totalHTTPrequestsNumberofHTTPrequeststhatcompletedsuccessfullyin200ms/totalHTTPrequestsHowdoyouthinkthesemetricstranslatetousersatisfaction?Let’shearJacky’sstory.Jackyusesaremittanceplatformtosendallowancetoherfamilyfromoverseasonamonthlybasis.Oneday,theapphadanunexpectedoutagethatshehadtowaitthenextdaytousetheservice.Shewasdisappointedatnotbeingabletosendthemoneyasitwasintendedtopayabillthatwasalreadydue.Butshecontinuedtousetheplatformanywayasshehadbeenalongtimecustomerandhasbeensatisfiedwithitsservice.However,theabruptoutagesbecamemorefrequenttothepointthatsheexperienceddowntimeonceamonthinrandomoccasions.Thisisnolongeracceptableforherwhichpromptedhertoconsiderswitchingtoanotherplatform.Servicesdon’thavetobe100%availablebutstillneedtobeatacertainlevelthatkeepscustomersengaged.SLOisallaboutfiguringoutthatSWEETSPOT,thebareminimumreliabilitypercentagethatisGOODENOUGHTOKEEPCUSTOMERSHAPPY.Thewindowofunavailabilityisutilisedforreleasingnewfeaturesandsystemupgradesthatmaycauseconsiderabledowntimebutwouldresultinamoreimproveduserexperienceandmuchbetterperformance.Thisbringsustothenextconceptcalled“Errorbudgets”.ErrorBudgetsAnerrorbudgetinitssimplestdescriptionis1minustheSLOoftheservice.A99.9%SLOservicehasa0.1%errorbudget.ThatwindowistechnicallytheROOMFORERROR.Theruleofthumbisnottoexhausttheerrorbudgetforanagreedperiodoftimeorelse,itwillconsequentlyleadtotheunhappinessofusers.CalculatingSLOusingSLILet’sgothroughasimpleexampletakenfromtheSREWorkbooktomakesenseofitall.Forexample,overfourweeks,theAPImetricsshow:Totalrequests:3,663,253Totalsuccessfulrequests:3,557,865(97.123%)90thpercentilelatency:432ms99thpercentilelatency:891msSeebelowproposedSLO:BasedonthisproposedSLO,wecancalculateourerrorbudgetoverthosefourweeks.Giventhatthetargetavailabilityis97%,theerrorbudgetwouldbe3%of3,663,253(totalrequests)whichequalsto109,897badrequests.AnSLOof97%availabilityallowsatotalof109,897badrequestsinaspanof4weeks.Ifaprolongedormorefrequentoutageoccurs,thetotalfailuresmayexceedtheerrorbudgetandthat’sthetimewhenusersstartnoticingaproblemandbecomedissatisfiedwiththeproduct.BothDevsandSREteammustensurethattheerrorbudgetdoesnotbecomeexhausted.Toavoidit,releaseshavetostopforthetimebeinguntiltheerrorbudgetresets.Theteamwouldhavetoreprioritisetofocusonreliabilitytogetitbacktoanacceptablestate.SUMMARYService-levelindicator(SLI):ameasurementofperformance.Service-levelobjective(SLO):astatementofdesiredreliability.Errorbudgets:balancesreliabilitywithfeaturedevelopmentorotherengineeringworkandinfluencesprioritisation.FURTHERREADINGInPart2,IwillwalkyouthroughhowtogenerateSLOandtrackErrorBudgetsinGoogleCloud.InPart3,IwilldiscusstheappropriatemethodtoalertonSLOsbasedonSREBook’srecommendations.REFERENCESGoogle-SiteReliabilityEngineeringServicelevelobjectives(SLOs)specifyatargetlevelforthereliabilityofyourservice.BecauseSLOsarekeyto…landing.google.comGoogle-SiteReliabilityEngineeringEditdescriptionlanding.google.comhttps://landing.google.com/sre/sre-book/toc/--MorefromServianFollowAtServian,wedesign,deliverandmanageinnovativedata&analytics,digital,customerengagementandcloudsolutionsthathelpyousustaincompetitiveadvantage.ReadmorefromServianRecommendedfromMediumCatoMinorinGeekCulture2BigIssuesThatRoamShouldHaveFixedLongTimeAgoXiHREAD/DOWNLOAD^MedicalCodingOnline2012forStepDamianDąbrowskiUnityScriptCommunicationwithGetComponentBetterCallAlex#Project|PatternRecognitioninImportPricesJonathanAppsforstudyingSpanishvocabularyDr.BankruptcyinBlockMagnatesRSIandBollingerBandsContrarianStrategyAvishalomShalitAGTDhackforSlackTianGaoPrintyourpythonobjectswithobjprintAboutHelpTermsPrivacyGettheMediumappGetstartedKarenBajadorValencia224FollowersIwriteaboutCloudandDataEngineering,Blockchain,PersonalFinanceandPersonalDevelopment.SeniorConsultant@Servian.FollowMorefromMediumAramKoukiainKoukiaUpstreamby:DanHeathVanessaElyCorrelationID:thepowerofuniquetransactionidentifiersChiragKikkeriinHybridCloudEngineeringConfiguringSelf-HostedObservabilityonIBMCloudStephanieValarezoDataStageonCloudPakforDatav4.5:NewFeaturesReleaseHelpStatusWritersBlogCareersPrivacyTermsAboutKnowable



請為這篇文章評分?