Chapter 8 - On-Call - Google - Site Reliability Engineering

文章推薦指數: 80 %
投票人數:10人

In SRE, whenever an alert is created, a corresponding playbook entry is usually created. These guides reduce stress, the mean time to repair (MTTR), ... TableofContents ForewordI ForewordII Preface 1.HowSRERelatestoDevOps PartI-Foundations 2.ImplementingSLOs 3.SLOEngineeringCaseStudies 4.Monitoring 5.AlertingonSLOs 6.EliminatingToil 7.Simplicity PartII-Practices 8.On-Call 9.IncidentResponse 10.PostmortemCulture:LearningfromFailure 11.ManagingLoad 12.IntroducingNon-AbstractLargeSystemDesign 13.DataProcessingPipelines 14.ConfigurationDesignandBestPractices 15.ConfigurationSpecifics 16.CanaryingReleases PartIII-Processes 17.IdentifyingandRecoveringfromOverload 18.SREEngagementModel 19.SRE:ReachingBeyondYourWalls 20.SRETeamLifecycles 21.OrganizationalChangeManagementinSRE Conclusion AppendixA.ExampleSLODocument AppendixB.ExampleErrorBudgetPolicy AppendixC.ResultsofPostmortemAnalysis Index AbouttheEditors Colophon On-Call ByOllieCook,SaraSmollett,AndreaSpadaccini,CaraDonnelly,JianMa,andGarrettPlasky(Evernote)withStephenThorneandJessieYang Beingon-callmeansbeingavailableduringasetperiodoftime,andbeingreadytorespondtoproductionincidentsduringthattimewithappropriateurgency.SiteReliabilityEngineers(SREs)areoftenrequiredtotakepartinon-callrotations.Duringon-callshifts,SREsdiagnose,mitigate,fix,orescalateincidentsasneeded.Inaddition,SREsareregularlyresponsiblefornonurgentproductionduties. AtGoogle,beingon-callisoneofthedefiningcharacteristicsofSRE.SREteamsmitigateincidents,repairproductionproblems,andautomateoperationaltasks.SincemostofourSREteamshavenotyetfullyautomatedalltheiroperationaltasks,escalationsneedhumanpointsofcontact—on-callengineers.Dependingonhowcriticalthesupportedsystemsare,orthestateofdevelopmentthesystemsarein,notallSREteamsmayneedtobeon-call.Inourexperience,mostSREteamsstaffon-callshifts. On-callisalargeandcomplextopic,saddledwithmanyconstraintsandalimitedmarginfortrialanderror.Chapter11ofourfirstbook(SiteReliabilityEngineering),“BeingOn-Call,”alreadyexploredthistopic.Thischapteraddressesspecificfeedbackandquestionswereceivedaboutthatchapter.Theseincludethefollowing: “WearenotGoogle;we’remuchsmaller.Wedon’thaveasmanypeopleintherotation,andwedon’thavesitesindifferenttimezones.Whatyoudescribedinyourfirstbookisirrelevanttome.” “WehaveamixtureofdevelopersandDevOpsforon-callrotation.What’sthebestwaytoorganizethem?Splittheresponsibilities?” “Ouron-callengineergetspagedaboutahundredtimesinatypical24-hourshift.Alotofpagesgetignored,whiletherealproblemsareburiedunderthepile.Whereshouldwestart?” “Wehaveahighturnoverrateforon-callrotations.Howdoyouaddresstheknowledgegapwithintheteam?” “WewanttoreorgourDevOpsteamintoSRE.1What’sthedifferencebetweenSREon-call,DevOpson-call,anddeveloperson-call?Pleasebespecific,becausetheDevOpsteamisveryconcernedaboutthis.” Weofferpracticaladviceforthesesituations.GoogleisalargecompanywithamatureSREorganization,butmuchofwhatwe’velearnedovertheyearscanbeappliedtoanycompanyororganization,regardlessofsizeormaturity.Googlehashundredsofon-callrotationsacrossservicesofallsizes,andvariouson-callsetupsfromsimpletocomplicated.On-callisnotexclusivelyanSREfunction:manydeveloperteamsaredirectlyon-callfortheirservice.Eachon-callsetupmeetstheneedofaparticularservice. Thischapterdescribeson-callsetupsbothwithinGoogleandoutsideofGoogle.Whileyoursetupandsituationwilllikelydifferfromourspecificexamples,theessentialconceptswecoverarewidelyapplicable. Wethendelveintotheanatomyofpagerload,explainingwhatcausespagerload.Wesuggeststrategiestooptimizeon-callsetupandminimizethatload. Finally,wesharetwoexamplesofpracticesinsideGoogle:on-callflexibilityandon-callteamdynamics.Thesepracticesshowthatnomatterhowmathematicallysoundanon-callsetupis,youcannotsolelyrelyonlogisticsoftheon-callsetup.Incentivesandhumannatureplayanimportantrole,andshouldalsobetakenintoaccount. Recapof“BeingOn-Call”ChapterofFirstSREBook SiteReliabilityEngineering,in“BeingOn-Call”,explainstheprinciplesbehindon-callrotationsatGoogle.Thissectiondiscussesthemainpointsofthatchapter. AtGoogle,theoverallgoalofbeingon-callistoprovidecoverageforcriticalservices,whilemakingsurethatweneverachievereliabilityattheexpenseofanon-callengineer’shealth.Asaresult,SREteamsstriveforbalance.SREworkshouldbeahealthymixofduties:on-callandprojectwork.SpecifyingthatSREsspendatleast50%oftheirtimeonprojectworkmeansthatteamshavetimetotackletheprojectsrequiredtostrategicallyaddressanyproblemsfoundinproduction.Teamstaffingmustbeadequatetoensuretimeforprojectwork. Wetargetamaximumoftwoincidentsperon-callshift,2toensureadequatetimeforfollow-up.Ifthepagerloadgetstoohigh,correctiveactioniswarranted.(Weexplorepagerloadlaterinthischapter.) Psychologicalsafety3isvitalforeffectiveon-callrotations.Sincebeingon-callcanbedauntingandhighlystressful,on-callengineersshouldbefullysupportedbyaseriesofproceduresandescalationpathstomaketheirliveseasier. On-callusuallyimpliessomeamountofout-of-hourswork.Webelievethisworkshouldbecompensated.Whiledifferentcompaniesmaychoosetohandlethisindifferentways,Googleofferstime-off-in-lieuorcashcompensation,cappedatsomeproportionoftheoverallsalary.Thecompensationschemeprovidesanincentiveforbeingpartofon-call,andensuresthatengineersdonottakeontoomanyon-callshiftsforeconomicreasons. ExampleOn-CallSetupsWithinGoogleandOutsideGoogle Thissectiondescribesreal-worldexamplesofon-callsetupsatGoogleandEvernote,aCaliforniacompanythatdevelopsacross-platformappthathelpsindividualsandteamscreate,assemble,andshareinformation.Foreachcompany,weexplorethereasoningbehindon-callsetups,generalon-callphilosophy,andon-callpractices. Google:FormingaNewTeam Initialscenario Afewyearsago,Sara,anSREatGoogleMountainView,startedanewSREteamthatneededtobeon-callwithinthreemonths.Toputthisinperspective,mostSREteamsatGoogledonotexpectnewhirestobereadyforon-callbeforethreetoninemonths.ThenewMountainViewSREteamwouldsupportthreeGoogleAppsservicesthatwerepreviouslysupportedbyanSREteaminKirkland,Washington(atwo-hourflightfromMountainView).TheKirklandteamhadasisterSREteaminLondon,whichwouldcontinuetosupporttheseservicesalongsidethenewMountainViewSREteam,anddistributedproductdevelopmentteams.4 ThenewMountainViewSREteamcametogetherquickly,assemblingsevenpeople: Sara,anSREtechlead Mike,anexperiencedSREfromanotherSREteam AtransferfromaproductdevelopmentteamwhowasnewtoSRE Fournewhires(“Nooglers”) Evenwhenateamismature,goingon-callfornewservicesisalwayschallenging,andthenewMountainViewSREteamwasarelativelyjuniorteam.Nonetheless,thenewteamwasabletoonboardtheserviceswithoutsacrificingservicequalityorprojectvelocity.Theymadeimmediateimprovementstotheservices,includingloweringmachinecostsby40%,andfullyautomatingreleaserolloutswithcanaryingandothersafetychecks.Thenewteamalsocontinuedtodeliverreliableservices,targeting99.98%availability,orroughly26minutesofdowntimeperquarter. HowdidthenewSREteambootstrapthemselvestoaccomplishsomuch?Throughstarterprojects,mentoring,andtraining. Trainingroadmap AlthoughthenewSREteamdidn’tknowmuchabouttheirservices,SaraandMikewerefamiliarwithGoogle’sproductionenvironmentandSRE.AsthefourNooglerscompletedcompanyorientation,SaraandMikecompiledachecklistoftwodozenfocusareasforpeopletopracticebeforegoingon-call,suchas: Administeringproductionjobs Understandingdebugginginfo “Draining”trafficawayfromacluster Rollingbackabadsoftwarepush Blockingorrate-limitingunwantedtraffic Bringingupadditionalservingcapacity Usingthemonitoringsystems(foralertinganddashboards) Describingthearchitecture,variouscomponents,anddependenciesoftheservices TheNooglersfoundsomeofthisinformationontheirownbyresearchingexistingdocumentationandcodelabs(guided,hands-oncodingtutorials)andgainedunderstandingonrelevanttopicsbyworkingontheirstarterprojects.WhenateammemberlearnedaboutspecifictopicsrelevanttotheNooglers’starterprojects,thatpersonledashort,impromptusessiontosharethatinformationwiththerestoftheteam.SaraandMikecoveredtheremainingtopics.Theteamalsoheldlabsessionstoperformcommondebuggingandmitigationtaskstohelpeveryonebuildmusclememoryandgainconfidenceintheirabilities. Inadditiontothechecklist,thenewSREteamranaseriesof“deepdives”todigintotheirservices.Theteambrowsedmonitoringconsoles,identifiedrunningjobs,andtrieddebuggingrecentpages.SaraandMikeexplainedthatanengineerdidn’tneedyearsofexpertisewitheachoftheservicestobecomereasonablyproficient.Theycoachedtheteamtoexploreaservicefromfirstprinciples,andencouragedNooglerstobecomefamiliarwiththeservices.Theywereopenaboutthelimitsoftheirknowledge,andtaughtotherswhentoaskforhelp. Throughouttheramp-up,thenewSREteamwasn’talone.SaraandMiketraveledtomeettheotherSREteamsandproductdevelopersandlearnfromthem.ThenewSREteammetwiththeKirklandandLondonteamsbyholdingvideoconferences,exchangingemail,andchattingoverIRC.Inaddition,theteamattendedweeklyproductionmeetings,readdailyon-callhandoffsandpostmortems,andbrowsedexistingservicedocumentation.AKirklandSREvisitedtogivetalksandanswerquestions.ALondonSREputtogetherathoroughsetofdisasterscenariosandranthemduringGoogle’sdisasterrecoverytrainingweek(seethesection“PreparednessandDisasterTesting”inSiteReliabilityEngineering,Chapter33). Theteamalsopracticedbeingon-callthrough“WheelofMisfortune”trainingexercises(seethesection“DisasterRolePlaying”inSiteReliabilityEngineering,Chapter28),wheretheyrole-playedrecentincidentstopracticedebuggingproductionproblems.Duringthesesessions,allSREswereencouragedtooffersuggestionsonhowtoresolvemockproductionfailures.Aftereveryonerampedup,theteamstillheldthesesessions,rotatingthrougheachteammemberasthesessionleader.Theteamrecordedtheseforfuturereference. Beforegoingon-call,theteamreviewedpreciseguidelinesabouttheresponsibilitiesofon-callengineers.Forexample: Atthestartofeachshift,theon-callengineerreadsthehandofffromthepreviousshift. Theon-callengineerminimizesuserimpactfirst,thenmakessuretheissuesarefullyaddressed. Attheendoftheshift,theon-callengineersendsahandoffemailtothenextengineeron-call. Theguidelinesalsospecifiedwhentoescalatetoothers,andhowtowritepostmortemsforlargeincidents. Finally,theteamreadandupdatedon-callplaybooks.Playbookscontainhigh-levelinstructionsonhowtorespondtoautomatedalerts.Theyexplaintheseverityandimpactofthealert,andincludedebuggingsuggestionsandpossibleactionstotaketomitigateimpactandfullyresolvethealert.InSRE,wheneveranalertiscreated,acorrespondingplaybookentryisusuallycreated.Theseguidesreducestress,themeantimetorepair(MTTR),andtheriskofhumanerror. MaintainingPlaybooks Detailsinplaybooksgooutofdateatthesamerateasproductionenvironmentchanges.Fordailyreleases,playbooksmightneedanupdateonanygivenday.Writinggooddocumentation,likeanyformofcommunication,ishard.Sohowdoyoumaintainplaybooks? SomeSREsatGoogleadvocatekeepingplaybookentriesgeneralsotheychangeslowly.Forexample,theymayhavejustoneentryforall“RPCErrorsHigh”alerts,foratrainedon-callengineertoread,inconjunctionwithanarchitecturediagramforthecurrentlyalertingservice.OtherSREsadvocateforstep-by-stepplaybookstoreducehumanvariabilityanddrivedownMTTR.Ifyourteamhasconflictingviewsaboutplaybookcontent,theplaybooksmightgetpulledinmanydirections. Thisisacontentioustopic.Ifyouagreeonnothingelse,atleastdecidewithyourteamwhatminimal,structureddetailsyourplaybooksmusthave,andtrytonoticewhenyourplaybookshaveaccumulatedalotofinformationbeyondthesestructureddetails.Pencilinaprojecttoturnnew,hard-won,productionknowledgeintoautomationormonitoringconsoles.Ifyourplaybooksareadeterministiclistofcommandsthattheon-callengineerrunseverytimeaparticularalertfires,werecommendimplementingautomation. Aftertwomonths,Sara,Mike,andtheSREtransfershadowedtheon-callshiftsoftheoutgoingKirklandSREteam.Atthreemonths,theybecametheprimaryon-call,withtheKirklandSREsasbackup.Thatway,theycouldeasilyescalatetotheKirklandSREsifneeded.Next,theNooglersshadowedthemoreexperienced,localSREsandjoinedtherotation. Gooddocumentationandthevariousstrategiesdiscussedearlierallhelpedtheteamformasolidfoundationandrampupquickly.Althoughon-callcanbestressful,theteams’confidencegrewenoughtotakeactionwithoutsecond-guessingthemselves.Therewaspsychologicalsafetyinknowingthattheirresponseswerebasedontheteam’scollectiveknowledge,andthatevenwhentheyescalated,theon-callengineerswerestillregardedascompetentengineers. Afterword WhiletheMountainViewSREswererampingup,theylearnedthattheirexperienced,sisterSREteaminLondonwouldbemovingontoanewproject,andanewteamwasbeingformedinZürichtosupporttheservicespreviouslysupportedbytheLondonSREteam.Forthissecondtransition,thesamebasicapproachtheMountainViewSREsusedprovedsuccessful.ThepreviousinvestmentbyMountainViewSREsindevelopingonboardingandtrainingmaterialshelpedthenewZürichSREteamrampup. WhiletheapproachusedbytheMountainViewSREsmadesensewhenacohortofSREswerebecomingateam,theyneededamorelightweightapproachwhenonlyonepersonjoinedtheteamatagiventime.Inanticipationoffutureturnover,theSREscreatedservicearchitecturediagramsandformalizedthebasictrainingchecklistintoaseriesofexercisesthatcouldbecompletedsemi-independentlywithminimalinvolvementfromamentor.Theseexercisesincludeddescribingthestoragelayer,performingcapacityincreases,andreviewinghowHTTPrequestsarerouted. Evernote:FindingOurFeetintheCloud Movingouron-preminfrastructuretothecloud Wedidn’tsetouttoreengineerouron-callprocess,butaswithmanythingsinlife,necessityisthemotherofinvention.PriortoDecember2016,Evernoteranonlyonon-premdatacenters,builttosupportourmonolithicapplication.Ournetworkandserversweredesignedwithaspecificarchitectureanddataflowinmind.This,combinedwithahostofotherconstraints,meantwelackedtheflexibilityneededtosupportahorizontalarchitecture.GoogleCloudPlatform(GCP)providedaconcretesolutiontoourproblem.However,westillhadonemajorhurdletosurmount:migratingallourproductionandsupportinginfrastructuretoGCP.Fast-forward70days.ThroughaHerculeaneffortandmanyremarkablefeats(forexample,movingthousandsofserversand3.5PBofdata),wewerehappilysettledinournewhome.Atthispoint,though,ourjobstillwasn’tdone:howwerewegoingtomonitor,alert,and—mostimportantly—respondtoissuesinournewenvironment? Adjustingouron-callpoliciesandprocesses Themovetothecloudunleashedthepotentialforourinfrastructuretogrowrapidly,butouron-callpoliciesandprocesseswerenotyetsetuptohandlesuchgrowth.Oncethemigrationwrappedup,wesetouttoremedytheproblem.Inourpreviousphysicaldatacenter,webuiltredundancyintonearlyeverycomponent.Thismeantthatwhilecomponentfailurewascommongivenoursize,generallynoindividualcomponentwascapableofnegativelyimpactingusers.Theinfrastructurewasverystablebecausewecontrolledit—anysmallbumpwouldinevitablybeduetoafailuresomewhereinthesystem.Ouralertingpolicieswerestructuredwiththatinmind:afewdroppedpackets,resultinginaJDBC(JavaDatabaseConnectivity)connectionexception,invariablymeantthataVM(virtualmachine)hostwasonthevergeoffailing,orthatthecontrolplaneononeofourswitcheswasonthefritz.Evenbeforeourfirstdayinthecloud,werealizedthatthistypeofalert/responsesystemwasnottenablegoingforward.Inaworldoflivemigrationsandnetworklatency,weneededtotakeamuchmoreholisticapproachtomonitoring. Reframingpagingeventsintermsoffirstprinciples,andwritingtheseprinciplesdownasourexplicitSLOs(servicelevelobjectives),helpedgivetheteamclarityregardingwhatwasimportanttoalertonandallowedustotrimthefatfromourmonitoringinfrastructure.Ourfocusonhigher-levelindicatorssuchasAPIresponsiveness,ratherthanlower-levelinfrastructuresuchasInnoDBrowlockwaitsinMySQL,meantwecouldfocusmoretimeontherealpainourusersexperienceduringanoutage.Forourteam,thismeantlesstimespentchasingtransientproblems.Thistranslatedintomoresleep,effectiveness,andultimately,jobsatisfaction. Restructuringourmonitoringandmetrics Ourprimaryon-callrotationisstaffedbyasmallbutscrappyteamofengineerswhoareresponsibleforourproductioninfrastructureandahandfulofotherbusinesssystems(forexample,stagingandbuildpipelineinfrastructure).Wehaveaweekly,24/7schedulewithawell-oiledhandoffprocedure,alongsideamorningreviewofincidentsatadailystand-up.Oursmallteamsizeandcomparativelylargescopeofresponsibilitynecessitatesthatwemakeeveryefforttokeeptheprocessburdenlight,andfocusonclosingthealert/triage/remediation/analysisloopasquicklyaspossible.Oneofthewaysweachievethisistokeepoursignal-to-noiseratiolowbymaintainingsimplebuteffectivealertingSLAs(servicelevelagreements).Weclassifyanyeventgeneratedbyourmetricsormonitoringinfrastructureintothreecategories: P1:Dealwithimmediately Shouldbeimmediatelyactionable Pagestheon-call Leadstoeventtriage IsSLO-impacting P2:Dealwiththenextbusinessday Generallyisnotcustomer-facing,orisverylimitedinscope Sendsanemailtoteamandnotifieseventstreamchannel P3:Eventisinformationalonly Informationisgatheredindashboards,passiveemail,andthelike Includescapacityplanning–relatedinformation AnyP1orP2eventhasanincidentticketattachedtoit.Theticketisusedforobvioustaskslikeeventtriageandtrackingremediationactions,aswellasforSLOimpact,numberofoccurrences,andpostmortemdoclinks,whereapplicable. Whenaneventpages(categoryP1),theon-callistaskedwithassessingtheimpacttousers.Incidentsaretriagedintoseveritiesfrom1to3.Forseverity1(Sev1)incidents,wemaintainafinitesetofcriteriatomaketheescalationdecisionasstraightforwardaspossiblefortheresponder.Oncetheeventisescalated,weassembleanincidentteamandbeginourincidentmanagementprocess.Theincidentmanagerispaged,ascribeandcommunicationsleadiselected,andourcommunicationchannelsopen.Aftertheincidentisresolved,weconductanautomaticpostmortemandsharetheresultsfarandwidewithinthecompany.ForeventsratingSev2orSev3,theon-callresponderhandlestheincidentlifecycle,includinganabbreviatedpostmortemforincidentreview. Oneofthebenefitsofkeepingourprocesslightweightisthatwecanexplicitlyfreetheon-callfromanyexpectationsofprojectwork.Thisempowersandencouragestheon-calltotakeimmediatefollow-upaction,andalsotoidentifyanymajorgapsintoolingorprocessaftercompletingthepost-incidentreview.Inthisway,weachieveaconstantcycleofimprovementandflexibilityduringeveryon-callshift,keepingpacewiththerapidrateofchangeinourenvironment.Thegoalistomakeeveryon-callshiftbetterthanthelast. Trackingourperformanceovertime WiththeintroductionofSLOs,wewantedtotrackperformanceovertime,andsharethatinformationwithstakeholderswithinthecompany.Weimplementedamonthlyservicereviewmeeting,opentoanyonewho’sinterested,toreviewanddiscussthepreviousmonthoftheservice.Wehavealsousedthisforumtoreviewouron-callburdenasabarometerofteamhealth,anddiscussremediationactionswhenweexceedourpagerbudget.ThisforumhasthedualpurposeofspreadingtheimportanceofSLOswithinthecompanyandkeepingthetechnicalorganizationaccountableformaintainingthehealthandwellnessofourserviceandteam. EngagingwithCRE ExpressingourobjectivesintermsofSLOsprovidesabasisforengagingwithGoogle’sCustomerReliabilityEngineering(CRE)team.AfterwediscussedourSLOswithCREtoseeiftheywererealisticandmeasurable,bothteamsdecidedCREwouldbepagedalongsideourownengineersforSLO-impactingevents.Itcanbedifficulttopinpointrootcausesthatarehiddenbehindlayersofcloudabstraction,sohavingaGoogleratoursidetaketheguessworkoutofblack-boxeventtriagingwashelpful.Moreimportantly,thisexercisefurtherreducedourMTTR,whichisultimatelywhatouruserscareabout. Sustainingaself-perpetuatingcycle Ratherthanspendingallourtimeinthetriage/root-causeanalysis/postmortemcycle,wenowhavemoretimeasateamtothinkabouthowwemovethebusinessforward.Specifically,thistranslatesintoprojectssuchasimprovingourmicroservicesplatformandestablishingproductionreadinesscriteriaforourproductdevelopmentteams.Thelatterincludesmanyoftheprincipleswefollowedinrestructuringouron-call,whichisparticularlyhelpfulforteamsintheirfirst“carrythepager”rodeo.Thus,weperpetuatethecycleofimprovingon-callforeveryone. PracticalImplementationDetails Sofar,we’vediscusseddetailsabouton-callsetups,bothwithinGoogleandoutsideofGoogle.Butwhataboutspecificconsiderationsofbeingon-call?Thefollowingsectionsdiscusstheseimplementationdetailsinmoredepth: Pagerload—whatitis,howitworks,andhowtomanageit Howtofactorflexibilityintoon-callschedulingtocreateahealthierwork/lifebalanceforSREs Strategiesforimprovingteamdynamics,bothwithinagivenSREteam,andwithpartnerteams AnatomyofPagerLoad Yourpagerisnoisyandit’smakingyourteamunhappy.You’vereadthroughChapter31inSiteReliabilityEngineering,andrunregularproductionmeetings,bothwithyourteamandthedeveloperteamsyousupport.Noweveryoneknowsthatyouron-callengineersareunhappy.Whatnext? Pagerloadisthenumberofpagingincidentsthatanon-callengineerreceivesoveratypicalshiftlength(suchasperdayorperweek).Anincidentmayinvolvemorethanonepage.Here,we’llwalkthroughtheimpactofvariousfactorsonpagerload,andsuggesttechniquesforminimizingfuturepagerload. AppropriateResponseTimes Engineersshouldn’thavetobeatacomputerandworkingonaproblemwithinminutesofreceivingapageunlessthereisaverygoodreasontodoso.Whileacompleteoutageofacustomer-facing,revenue-generatingservicetypicallyrequiresanimmediateresponse,youcandealwithlesssevereissues(forexample,failingbackups)withinafewhours. Werecommendcheckingyourcurrentpagingsetuptoseeifyouactuallyshouldbepagedforeverythingthatcurrentlytriggersapage.Youmaybepagingforissuesthatwouldbebetterservedbyautomatedrepair(asit'sgenerallybetterforacomputertofixaproblemthanrequiringahumantofixit)oraticket(ifit'snotactuallyhighpriority).Table8-1showssomesampleeventsandappropriateresponses. Table8-1.Examplesofrealisticresponsetimes Incidentdescription Responsetime SREimpact Revenue-impactingnetworkoutage 5minutes SREneedstobewithinarm'sreachofachargedandauthenticatedlaptopwithnetworkaccessatalltimes;cannottravel;mustheavilycoordinatewithsecondaryatalltimes Customerorderbatchprocessingsystemstuck 30minutes SREcanleavetheirhomeforaquickerrandorshortcommute;secondarydoesnotneedtoprovidecoverageduringthistime Backupsofadatabaseforapre-launchservicearefailing Ticket(responseduringworkhours) None Scenario:Ateaminoverload The(hypothetical)ConnectionSRETeam,responsibleforfrontendloadbalancingandterminatingend-userconnections,founditselfinapositionofhighpagerload.Theyhadanestablishedpagerbudgetoftwopagingincidentspershift,butforthepastyeartheyhadregularlybeenreceivingfivepagingincidentspershift.Analysisrevealedthatfullyone-thirdofshiftswereexceedingtheirpagerbudget.Membersoftheteamheroicallyrespondedtothedailyonslaughtofpagesbutcouldn’tkeepup;theresimplywasnotenoughtimeinthedaytofindtherootcauseandproperlyfixtheincomingissues.Someengineerslefttheteamtojoinlessoperationallyburdenedteams.High-qualityincidentfollow-upwasrare,sinceon-callengineersonlyhadtimetomitigateimmediateproblems. Theteam’shorizonwasn’tentirelybleak:theyhadamaturemonitoringinfrastructurethatfollowedSREbestpractices.AlertingthresholdsweresettoalignwiththeirSLO,andpagingalertsweresymptom-basedinnature,meaningtheyfiredonlywhencustomerswereimpacted.Whenseniormanagementwereapproachedwithallofthisinformation,theyagreedthattheteamwasinoperationaloverloadandreviewedtheprojectplantobringtheteambacktoahealthystate. Inlesspositivenews,overtimetheConnectionteamhadtakenownershipofsoftwarecomponentsfrommorethan10developerteamsandhadharddependenciesonGoogle’scustomer-facingedgeandbackbonenetworks.Thelargenumberofintergrouprelationshipswascomplexandhadquietlygrowndifficulttomanage. Despitetheteamfollowingbestpracticesinstructuringtheirmonitoring,manyofthepagesthattheyfacedwereoutsidetheirdirectcontrol.Forexample,ablack-boxprobemayhavefailedduetocongestioninthenetwork,causingpacketloss.Theonlyactiontheteamcouldtaketomitigatecongestioninthebackbonewastoescalatetotheteamdirectlyresponsibleforthatnetwork. Ontopoftheiroperationalburden,theteamneededtodelivernewfeaturestothefrontendsystems,whichwouldbeusedbyallGoogleservices.Tomakemattersworse,theirinfrastructurewasbeingmigratedfroma10-year-oldlegacyframeworkandclustermanagementsystemtoabetter-supportedreplacement.Theteam’sservicesweresubjecttoanunprecedentedrateofchange,andthechangesthemselvescausedasignificantportionoftheon-callload. Theteamclearlyneededtocombatthisexcessivepagerloadusingavarietyoftechniques.Thetechnicalprogrammanagerandthepeoplemanageroftheteamapproachedseniormanagementwithaprojectproposal,whichseniormanagementreviewedandapproved.Theteamturnedtheirfullattentiontoreducingtheirpagerload,andlearnedsomevaluablelessonsalongtheway. Pagerloadinputs Thefirststepintacklinghighpagerloadistodeterminewhatiscausingit.Pagerloadisinfluencedbythreemainfactors:bugs5inproduction,alerting,andhumanprocesses.Eachofthesefactorshasseveralinputs,someofwhichwediscussinmoredetailinthissection. Forproduction: Thenumberofexistingbugsinproduction Theintroductionofnewbugsintoproduction Thespeedwithwhichnewlyintroducedbugsareidentified Thespeedwithwhichbugsaremitigatedandremovedfromproduction Foralerting: Thealertingthresholdsthattriggerapagingalert Theintroductionofnewpagingalerts Thealignmentofaservice’sSLOwiththeSLOsoftheservicesuponwhichitdepends Forhumanprocesses: Therigoroffixesandfollow-uponbugs Thequalityofdatacollectedaboutpagingalerts Theattentionpaidtopagerloadtrends Human-actuatedchangestoproduction Preexistingbugs Nosystemisperfect.Therewillalwaysbebugsinproduction:inyourowncode,thesoftwareandlibrariesthatyoubuildupon,ortheinterfacesbetweenthem.Thebugsmaynotbecausingpagingalertsrightnow,buttheyaredefinitelypresent.Youcanuseafewtechniquestoidentifyorpreventbugsthathaven’tyetcausedpagingalerts: Ensuresystemsareascomplicatedastheyneedtobe,andnomore(seeSimplicity). Regularlyupdatethesoftwareorlibrariesthatyoursystemisbuiltupontotakeadvantageofbugfixes(however,seethenextsectionaboutnewbugs). Performregulardestructivetestingorfuzzing(forexample,usingNetflix’sChaosMonkey). Performregularloadtestinginadditiontointegrationandunittesting. Newbugs Ideally,theSREteamanditspartnerdeveloperteamsshoulddetectnewbugsbeforetheyevenmakeitintoproduction.Inreality,automatedtestingmissesmanybugs,whicharethenlaunchedtoproduction. Softwaretestingisalargetopicwellcoveredelsewhere(e.g.,MartinFowleronTesting).However,softwaretestingtechniquesareparticularlyusefulinreducingthenumberofbugsthatreachproduction,andtheamountoftimetheyremaininproduction: Improvetestingovertime.Inparticular,foreachbugyoudiscoverinproduction,ask“Howcouldwehavedetectedthisbugpreproduction?”Makesurethenecessaryengineeringfollow-upoccurs(seeRigoroffollow-up). Don’tignoreloadtesting,whichisoftentreatedaslowerprioritythanfunctionaltesting.Manybugsmanifestonlyunderparticularloadconditionsorwithaparticularmixofrequests. Runstaging(testingwithproduction-likebutsynthetictraffic)inaproduction-likeenvironment.WebrieflydiscussgeneratingsynthetictrafficinAlertingonSLOsofthisbook. Performcanarying(CanaryingReleases)inaproductionenvironment. Havealowtolerancetonewbugs.Followa“detect,rollback,fix,androllforward”strategyratherthana“detect,continuetorollforwarddespiteidentifyingthebug,fix,androllforwardagain”strategy.(SeeMitigationdelayformoredetails.) Thiskindofrollbackstrategyrequirespredictableandfrequentreleasessothatthecostofrollingbackanyonereleaseissmall.WediscussthisandrelatedtopicsinSiteReliabilityEngineering,in“ReleaseEngineering”. Somebugsmaymanifestonlyastheresultofchangingclientbehavior.Forexample: Bugsthatmanifestonlyunderspecificlevelsofload—forexample,Septemberback-to-schooltraffic,BlackFriday,CyberMonday,orthatweekoftheyearwhenDaylightSavingTimemeansEuropeandNorthAmericaareonehourcloser,meaningmoreofyourusersareawakeandonlinesimultaneously. Bugsthatmanifestonlywithaparticularmixofrequests—forexample,serversclosertoAsiaexperiencingamoreexpensivetrafficmixduetolanguageencodingsforAsiancharactersets. Bugsthatmanifestonlywhenusersexercisethesysteminunexpectedways—forexample,Calendarbeingusedbyanairlinereservationsystem!Therefore,itisimportanttoexpandyourtestingregimentotestbehaviorsthatdonotoccureveryday. Whenaproductionsystemisplaguedbyseveralconcurrentbugs,it’smuchmoredifficulttoidentifyifagivenpageisforanexistingornewbug.Minimizingthenumberofbugsinproductionnotonlyreducespagerload,italsomakesidentifyingandclassifyingnewbugseasier.Therefore,itiscriticaltoremoveproductionbugsfromyoursystemsasquicklyaspossible.Prioritizefixingexistingbugsabovedeliveringnewfeatures;ifthisrequirescross-teamcollaboration,seeSREEngagementModel. Architecturalorproceduralproblems,suchasautomatedhealthchecking,self-healing,andloadshedding,mayneedsignificantengineeringworktoresolve.Remember,forsimplicity’ssakewe’llconsidertheseproblems“bugs,”eveniftheirsize,theircomplexity,ortheeffortrequiredtoresolvethemissignificant. Chapter3ofSiteReliabilityEngineeringdescribeshowerrorbudgetsareausefulwaytomanagetherateatwhichnewbugsarereleasedtoproduction.Forexample,whenaservice’sSLOviolationsexceedacertainfractionofitstotalquarterlyerrorbudget—typicallyagreedinadvancebetweenthedeveloperandSREteams—newfeaturedevelopmentandfeature-relatedrolloutscanbehaltedtemporarilytofocusonstabilizingthesystemandreducingthefrequencyofpages. TheConnectionteamfromourexampleadoptedastrictpolicyrequiringeveryoutagetohaveatrackingbug.Thisenabledtheteam’stechnicalprogrammanagertoexaminetherootcauseoftheirnewbugsinaggregate.Thisdatarevealedthathumanerrorwasthesecondmostcommoncauseofnewbugsinproduction. Becausehumansareerror-prone,it’sbetterifallchangesmadetoproductionsystemsaremadebyautomationinformedby(human-developed)intentconfiguration.Beforeyoumakeachangetoproduction,automationcanperformadditionaltestingthathumanscannot.TheConnectionteamwasmakingcomplexchangestoproductionsemimanually.Notsurprisingly,theteam’smanualchangeswentwrongsometimes;theteamintroducednewbugs,whichcausedpages.Automatedsystemsmakingthesamechangeswouldhavedeterminedthatthechangeswerenotsafebeforetheyenteredproductionandbecamepagingevents.Thetechnicalprogrammanagertookthisdatatotheteamandconvincedthemtoprioritizeautomationprojects. Identificationdelay It’simportanttopromptlyidentifythecause(s)ofalertsbecausethelongerittakestoidentifytherootcauseofapage,themoreopportunityithastorecurandpageagain.Forexample,givenapagethatmanifestsonlyunderhighload,sayatdailypeak,iftheproblematiccodeorconfigurationisnotidentifiedbeforethenextdailypeak,itislikelythattheproblemwillhappenagain.Thereareseveraltechniquesyoumightusetoreduceidentificationdelays: Usegoodalertsandconsoles Ensurepageslinktorelevantmonitoringconsoles,andthatconsoleshighlightwherethesystemisoperatingoutofspecification.Intheconsole,correlateblack-boxandwhite-boxpagingalertstogether,anddothesamewiththeirassociatedgraphs.Makesureplaybooksareuptodatewithadviceonrespondingtoeachtypeofalert.On-callengineersshouldupdatetheplaybookwithfreshinformationwhenthecorrespondingpagefires. Practiceemergencyresponse Run“WheelofMisfortune”exercises(describedinSiteReliabilityEngineering)tosharegeneralandservice-specificdebuggingtechniqueswithyourcolleagues. Performsmallreleases Ifyouperformfrequent,smallerreleasesinsteadofinfrequentmonolithicchanges,correlatingbugswiththecorrespondingchangethatintroducedthemiseasier.Canaryingreleases,describedinCanaryingReleasesgivesastrongsignalaboutwhetheranewbugisduetoanewrelease. Logchanges Aggregatingchangeinformationintoasearchabletimelinemakesitsimpler(andhopefullyquicker)tocorrelatenewbugswiththecorrespondingchangethatintroducedthem.ToolsliketheSlackplug-inforJenkinscanbehelpful. Askforhelp InSiteReliabilityEngineering,“ManagingIncidents”,wetalkedaboutworkingtogethertomanagelargeoutages.Theon-callengineerisneveralone;encourageyourteamtofeelsafewhenaskingforhelp. Mitigationdelay Thelongerittakestomitigateabugonceit’sidentified,themoreopportunityithastorecurandpageagain.Considerthesetechniquesforreducingmitigationdelays: Rollbackchanges Ifthebugwasintroducedinarecentcodeorconfigurationrollout,promptlyremovethebugfromproductionwitharollback,ifsafeandappropriate(arollbackalonemaybenecessarybutisnotsufficientifthebugcauseddatacorruption,forexample).Rememberthatevena“quickfix”needstimetobetested,built,androlledout.Testingisvitaltomakingsurethequickfixactuallyfixesthebug,andthatitdoesn’tintroduceadditionalbugsorotherunintendedconsequences.Generally,itisbetterto“rollback,fix,androllforward”ratherthan“rollforward,fix,androllforwardagain.” Ifyouaimfor99.99%availability,youhaveapproximately15minutesoferrorbudgetperquarter.Thebuildstepofrollingforwardmaytakemuchlongerthan15minutes,sorollingbackimpactsyourusersmuchless. (99.999%availabilityaffordsanerrorbudgetof80secondsperquarter.Atthispoint,systemsmayneedself-healingproperties,whichisoutofscopeforthischapter.) Ifatallpossible,avoidchangesthatcan’tberolledback,suchasAPI-incompatiblechangesandlockstepreleases. Usefeatureisolation DesignyoursystemsothatiffeatureXgoeswrong,youcandisableitvia,forexample,afeatureflagwithoutaffectingfeatureY.Thisstrategyalsoimprovesreleasevelocity,andmakesdisablingfeatureXamuchsimplerdecision—youdon’tneedtocheckthatyourproductmanagersarecomfortablewithalsodisablingfeatureY. Drainrequestsaway Drainrequests(i.e.,redirectcustomerrequests)awayfromtheelementsofyoursystemthatexhibitthebug.Forexample,ifthebugistheresultofacodeorconfigrollout,andyourollouttoproductiongradually,youmayhavetheopportunitytodraintheelementsofyourinfrastructurethathavereceivedtheupdate.Thisallowsyoutomitigatethecustomerimpactinseconds,ratherthanrollingback,whichmaytakeminutesorlonger. Alerting GoogleSRE’smaximumoftwodistinctincidentsper12-hourshiftencouragesustobethoughtfulandcautiousabouthowweconfigurepagingalertsandhowweintroducenewones.SiteReliabilityEngineering,“MonitoringDistributedSystems”,describesGoogle’sapproachtodefiningthethresholdsforpagingalerts.Strictlyobservingtheseguidelinesiscriticaltomaintainingahealthyon-callrotation. Itisworthhighlightingsomekeyelementsdiscussedinthatchapter: Allalertsshouldbeimmediatelyactionable.Thereshouldbeanactionweexpectahumantotakeimmediatelyaftertheyreceivethepagethatthesystemisunabletotakeitself.Thesignal-to-noiseratioshouldbehightoensurefewfalsepositives;alowsignal-to-noiseratioraisestheriskforon-callengineerstodevelopalertfatigue. IfateamfullysubscribestoSLO-basedalerting,orpagingonlywhenerrorbudgetisburned(seethesection“Black-BoxVersusWhite-Box”inSiteReliabilityEngineering),itiscriticalthatallteamsinvolvedindevelopingandmaintainingtheserviceagreeabouttheimportanceofmeetingtheSLOandprioritizetheirworkaccordingly. IfateamfullysubscribestoSLO-basedandsymptom-basedalerting,relaxingalertthresholdsisrarelyanappropriateresponsetobeingpaged. Justlikenewcode,newalertsshouldbethoroughlyandthoughtfullyreviewed.Eachalertshouldhaveacorrespondingplaybookentry. Receivingapagecreatesanegativepsychologicalimpact.Tominimizethatimpact,onlyintroducenewpagingalertswhenyoureallyneedthem.Anyoneontheteamcanwriteanewalert,butthewholeteamreviewsproposedalertadditionsandcansuggestalternatives.Thoroughlytestnewalertsinproductiontovetfalsepositivesbeforetheyareupgradedtopagingalerts.Forexample,youmightemailthealert’sauthorwhenthealertfires,ratherthanpagingtheon-callengineer. Newalertsmayfindproblemsinproductionthatyouweren’tawareof.Afteryouaddresstheseproductionbugs,alertingwillonlypageonnewbugs,effectivelyfunctioninglikeregressiontests. Besuretorunthenewalertsintestmodelongenoughtoexperiencetypicalperiodicproductionconditions,suchasregularsoftwarerollouts,maintenanceeventsbyyourCloudprovider,weeklyloadpeaks,andsoon.Aweekoftestingisprobablyaboutright.However,thisappropriatewindowdependsonthealertandthesystem. Finally,usethealert’striggerrateduringthetestingperiodtopredicttheexpectedconsumptionofyourpagerbudgetasaresultofthenewalert.Explicitlyapproveordisallowthenewalertasateam.Ifintroducinganewpagingalertcausesyourservicetoexceeditspagingbudget,thestabilityofthesystemneedsadditionalattention. Rigoroffollow-up Aimtoidentifytherootcauseofeverypage.“Rootcauses”extendoutofthemachineandintotheteam’sprocesses.Wasanoutagecausedbyabugthatwouldhavebeencaughtbyaunittest?Therootcausemightnotbeabuginthecode,butratherabugintheteam’sprocessesaroundcodereview. Ifyouknowtherootcause,youcanfixandpreventitfromeverbotheringyouoryourcolleaguesagain.Ifyourteamcannotfigureouttherootcause,addmonitoringand/orloggingthatwillhelpyoufindtherootcauseofthepagethenexttimeitoccurs.Ifyoudon’thaveenoughinformationtoidentifythebug,youcanalwaysdosomethingtohelpdebugthepagefurthernexttime.Youshouldrarelyconcludethatapageistriggeredby“causeunknown.”Rememberthatasanon-callengineer,youareneveralone,soaskacolleaguetoreviewyourfindingsandseeifthere’sanythingyoumissed.Typically,it’seasiesttofindtherootcauseofanalertsoonafterthealerthastriggeredandfreshevidenceisavailable. Explainingawayapageas“transient,”ortakingnoactionbecausethesystem“fixeditself”orthebuginexplicably“wentaway,”invitesthebugtohappenagainandcauseanotherpage,whichcausestroubleforthenexton-callengineer. Simplyfixingtheimmediatebug(ormakinga“point”fix)missesagoldenopportunitytopreventsimilaralertsinthefuture.Usethepagingalertasanchancetosurfaceengineeringworkthatimprovesthesystemandobviatesanentireclassofpossiblefuturebugs.Dothisbyfilingaprojectbuginyourteam’sproductioncomponent,andadvocatetoprioritizeitsimplementationbygatheringdataabouthowmanyindividualbugsandpagesthisprojectwouldremove.Ifyourproposalwilltake3workingweeksor120workinghourstoimplement,andapagecostsonaverage4workinghourstoproperlyhandle,there’saclearbreak-evenpointafter30pages. Forexample,imagineasituationwheretherearetoomanyserversonthesamefailuredomain,suchasaswitchinadatacenter,causingregularmultiplesimultaneousfailures: Pointfix Rebalanceyourcurrentfootprintacrossmorefailuredomainsandstopthere. Systemicfix Useautomationtoensurethatthistypeofserver,andallothersimilarservers,arealwaysspreadacrosssufficientfailuredomains,andthattheyrebalanceautomaticallywhennecessary. Monitoring(orprevention)fix Alertpreemptivelywhenthefailuredomaindiversityisbelowtheexpectedlevel,butnotyetservice-impacting.Ideally,thealertwouldbeaticketalert,notapage,sinceitdoesn’trequireanimmediateresponse.Thesystemisstillservinghappily,albeitatalowerlevelofredundancy. Tomakesureyou’rethoroughinyourfollow-uptopagingalerts,considerthefollowingquestions: HowcanIpreventthisspecificbugfromhappeningagain? HowcanIpreventbugslikethisfromhappeningagain,bothforthissystemandothersystemsI’mresponsiblefor? Whattestscouldhavepreventedthisbugfrombeingreleasedtoproduction? Whatticketalertscouldhavetriggeredactiontopreventthebugfrombecomingcriticalbeforeitpaged? Whatinformationalalertscouldhavesurfacedthebugonaconsolebeforeitbecamecritical? HaveImaximizedtheimpactofthefixesI’mmaking? Ofcourse,it’snotenoughforanon-callengineertojustfilebugsrelatedtothepagesthatoccurontheirshift.It’sincrediblyimportantthatbugsidentifiedbytheSREteamaredealtwithswiftly,toreducethepossibilityofthemrecurring.MakesureresourceplanningforboththeSREanddeveloperteamsconsidertheeffortrequiredtorespondtobugs. WerecommendreservingafractionofSREanddeveloperteamtimeforrespondingtoproductionbugsastheyarise.Forexample,aGoogleon-callertypicallydoesn’tworkonprojectsduringtheiron-callshift.Instead,theyworkonbugsthatimprovethehealthofthesystem.Makesurethatyourteamroutinelyprioritizesproductionbugsaboveotherprojectwork.SREmanagersandtechleadsshouldmakesurethatproductionbugsarepromptlydealtwith,andescalatetothedeveloperteamdecisionmakerswhennecessary. Whenapagingeventisseriousenoughtowarrantapostmortem,it’sevenmoreimportanttofollowthismethodologytocatalogandtrackfollow-upactionitems.(SeePostmortemCulture:LearningfromFailureformoredetails.) Dataquality Onceyouidentifybugsinyoursystemthatcausedpages,anumberofquestionsnaturallyarise: Howdoyouknowwhichbugtofixfirst? Howdoyouknowwhichcomponentinyoursystemcausedmostofyourpages? Howdoyoudeterminewhatrepetitive,manualactionon-callengineersaretakingtoresolvethepages? Howdoyoutellhowmanyalertswithunidentifiedrootcausesremain? Howdoyoutellwhichbugsaretruly,notjustanecdotally,theworst? Theanswerissimple:collectdata! Whenbuildingupyourdatacollectionprocesses,youmighttrackandmonitorthepatternsinon-callload,butthiseffortdoesn’tscale.It’sfarmoresustainabletofileaplaceholderbugforeachpagingalertinyourbugtrackingsystem(e.g.,Jira,IssueTracker),andfortheon-callengineertocreatealinkbetweenthepagingalertsfromyourmonitoringsystemandtherelevantbuginthebugtrackingsystem,asandwhentheyrealizethateachalertissymptomaticofapreexistingissue.Youwillendupwithalistofas-yet-not-understoodbugsinonecolumn,andalistofallofthepagesthateachbugisbelievedtohavecausedinthenext. Onceyouhavestructureddataaboutthecausesofthepages,youcanbegintoanalyzethatdataandproducereports.Thosereportscananswerquestionssuchas: Whichbugscausethemostpages?Ideallywe’drollbackandfixbugsimmediately,butsometimes,findingtherootcauseanddeployingthefixtakesalongtime,andsometimessilencingkeyalertsisn’tareasonableoption.Forexample,theaforementionedConnectionSRETeammightexperienceongoingnetworkcongestionthatisn’timmediatelyresolvablebutstillneedstobetracked.Collectingdataonwhichproductionissuesarecausingthemostpagesandstresstotheteamsupportsdata-drivenconversationsaboutprioritizingyourengineeringeffortsystematically. Whichcomponentofthesystemisthecauseofmostpages(paymentsgateway,authenticationmicroservice,etc.)? Whencorrelatedwithyourothermonitoringdata,doparticularpagescorrespondtoothersignals(peaksinrequestrate,numberofconcurrentcustomersessions,numberofsignups,numberofwithdrawals,etc.)? Tyingstructureddatatobugsandtherootcausesofyourpageshasotherbenefits: Youcanautomaticallypopulatealistofexistingbugs(thatis,knownissues),whichmaybeusefulforyoursupportteam. Youcanautomaticallyprioritizefixingbugsbasedonthenumberofpageseachbugcauses. Thequalityofthedatayoucollectwilldeterminethequalityofthedecisionseitherhumansorautomatacanmake.Toensurehigh-qualitydata,considerthefollowingtechniques: Defineanddocumentyourteam’spolicyandexpectationsondatacollectionforpages. Setupnonpagingalertsfromthemonitoringsystemtohighlightwherepageswerenothandledaccordingtothoseexpectations.Managersandtechleadsshouldmakesurethattheexpectationsaremet. Teammatesshouldfollowupwitheachotherwhenhandoffsdon’tadheretoexpectations.Positivecommentssuchas,“Maybethiscouldberelatedtobug123,”“I’vefiledabugwithyourfindingssowecanfollowupinmoredetail,”or“ThislooksalotlikewhathappenedonmyshiftlastWednesday:”powerfullyreinforcetheexpectedbehaviorsandensurethatyoumaximizeopportunitiesforimprovement.Noonewantstobepagedforthesameissuethatpagedtheirteammateinthepreviousshift. Vigilance Alltoooften,teamsfallintooperationaloverloadbyathousandcuts.Toavoidboilingthefrog,itisimportanttopayattentiontothehealthofon-callengineersovertime,andensurethatproductionhealthisconsistentlyandcontinuouslyprioritizedbybothSREanddeveloperteams. Thefollowingtechniquescanhelpateamkeepawatchfuleyeonpagerload: Atproductionmeetings(seethesection“Communications:ProductionMeetings”inSiteReliabilityEngineering,Chapter31),regularlytalkabouttrendsinpagerloadbasedonthestructureddatacollected.We’vefounda21-daytrailingaveragetobeuseful. Setupticketalerts,possiblytargetedattechleadsormanagers,forwhenpagerloadcrossesa“warning”thresholdthatyourteamagreesonbeforehand. HoldregularmeetingsbetweentheSREteamanddeveloperteamtodiscussthecurrentstateofproductionandtheoutstandingproductionbugsthatarepagingSRE. On-CallFlexibility ShiftLength Anon-callrotationthathastohandleoneormorepagesperdaymustbestructuredinasustainableway:werecommendlimitingshiftlengthsto12hours.Shortershiftsarebetterforthementalhealthofyourengineers.Teammembersruntheriskofexhaustionwhenshiftsrunlong,andwhenpeoplearetired,theymakemistakes.Mosthumanssimplycan'tproducehigh-qualityworkifthey'reon-callcontinuously.Manycountrieshavelawsaboutmaximumworkinghours,breaks,andworkingconditions. Whilespreadingon-callshiftsacrossateam'sdaylighthoursisideal,a12-hourshiftsystemdoesn'tnecessitateagloballydistributedteam.Beingon-callovernightfor12hoursispreferabletobeingon-callfor24hoursormore.Youcanmake12-hourshiftsworkeveninasinglelocation.Forexample,insteadofaskingasingleengineertobeon-callfor24hoursadayacrossanentireweek-longshift,itwouldbebetterfortwoengineerstosplitaweekofon-call,withonepersonon-callduringthedayandoneon-callovernight. Inourexperience,24hoursofon-calldutywithoutreprieveisn'tasustainablesetup.Whilenotideal,occasionalovernight12-hourshiftsatleastensurebreaksforyourengineers.Anotheroptionistoshortenshiftstolastlessthanaweek—somethinglike3dayson,4daysoff. Scenario:Achangeinpersonalcircumstances Imagineyouareamemberofanon-callteamforalargeservicethathasa24/7follow-the-sunmodelsplitacrosstwosites.Thearrangementworkswellforyou.Whileyou’renotthrilledaboutthepossibilityofa6a.m.page,youarehappywiththeworkyouandtheteamaredoingtokeeptheoperationalloadmanageablewhileimprovingthereliabilityoftheservice. Alliswell…untilonedayyourealizethattheon-callscheduleandthedemandsofyourpersonallifearestartingtoclash.Therearemanypotentialreasonswhy—forexample,becomingaparent,needingtotravelonshortnoticeandtakealeavefromwork,orillness. Youneedyouron-calldutiestocoexistwithyournewpersonalschedule. Manyteamsandorganizationsfacethischallengeastheymature.People’sneedschangeovertime,andmaintainingahealthybalanceofdiverseteammatebackgroundsleadstoanon-callrotationcharacterizedbydiverseneeds.Thekeytokeepingahealthy,fair,andequitablebalanceofon-callworkandpersonallifeisflexibility. Thereareanumberofwaysthatyoucanapplyflexibilitytoon-callrotationstomeettheneedsofteammemberswhilestillensuringcoverageforyourservicesorproducts.Itisimpossibletowritedownacomprehensive,one-size-fits-allsetofguidelines.Weencourageembracingflexibilityasaprincipleratherthansimplyadoptingtheexampleslistedhere. Automateon-callscheduling Asteamsgrow,accountingforschedulingconstraints—vacationplans,distributionofon-callweekdaysversusweekends,individualpreferences,religiousrequirements,andsoon—becomesincreasinglydifficult.Youcan’tmanagethistaskmanually;it’shardtofindanysolutionatall,muchlessafairone. “Fairness”doesn’tmeanacompletelyuniformdistributionofeachtypeofshiftacrossteammembers.Differentpeoplehavedifferentneedsanddifferentpreferences.Therefore,it’simportantfortheteamtosharethosepreferencesandtrytomeettheminanintelligentway.Teamcompositionandpreferencesdictatewhetheryourteamprefersauniformdistribution,oramorecustomizedwayofmeetingschedulingpreferences. Usinganautomatedtooltoscheduleon-callshiftsmakesthistaskmucheasier.Thistoolshouldhaveafewbasiccharacteristics: Itshouldrearrangeon-callshiftstoaccommodatethechangingneedsofteammembers. Itshouldautomaticallyrebalanceon-callloadinresponsetoanychanges. Itshoulddoitsbesttoensurefairnessbyfactoringinpersonalpreferencessuchas“noprimaryduringweekendsinApril,”aswellashistoricalinformationsuchasrecenton-callloadperengineer. Sothaton-callengineerscanplanaroundtheiron-callshifts,itmustneverchangeanalreadygeneratedschedule. Schedulegenerationcanbeeitherfullyautomatedorscheduledbyahuman.Likewise,someteamsprefertohavemembersexplicitlysignoffontheschedule,whileothersarecomfortablewithafullyautomatedprocess.Youmightopttodevelopyourowntoolin-houseifyourneedsarecomplex,butthereareanumberofcommercialandopensourcesoftwarepackagesthatcanaidinautomatingon-callscheduling. Planforshort-termswaps Requestsforshort-termchangesintheon-callschedulehappenfrequently.NoonecanpromiseonMondaythattheywon’thavethefluonThursday.Oryoumightneedtorunanunforeseenurgenterrandinthemiddleofyouron-callshift. Youmayalsowanttofacilitateon-callswapsfornonurgentreasons—forexample,toallowon-callerstoattendsportstrainingsessions.Inthissituation,teammemberscanswapasubsetoftheon-callday(forexample,halfofSunday).Nonurgentswapsaretypicallybest-effort. TeamswithastrictpagerresponseSLOneedtotakecommutecoverageintoaccount.IfyourpagerresponseSLOis5minutes,andyourcommuteis30minutes,youneedtomakesurethatsomeoneelsecanrespondtoemergencieswhileyougettowork. Toachievethesegoalsinflexibility,werecommendgivingteammemberstheabilitytoupdatetheon-callrotation.Also,haveadocumentedpolicyinplacedescribinghowswapsshouldwork.Decentralizationoptionsrangefromafullycentralizedpolicy,whereonlythemanagercanchangetheschedule,toafullydecentralizedone,whereanyteammembercanchangethepolicyindependently.Inourexperience,institutingpeerreviewofchangesprovidesagoodtradeoffbetweensafetyandflexibility. Planforlong-termbreaks Sometimesteammembersneedtostopservingintheon-callrotationbecauseofchangesinpersonalcircumstancesorburnout.It’simportantthatteamsarestructuredtoallowon-callerstotemporarilyleavetherotation. Ideally,teamsizeshouldallowfora(temporary)staffreductionwithoutcausingtherestoftheteamtosuffertoomuchoperationalload.Inourexperience,youneedabareminimumoffivepeoplepersitetosustainon-callinamultisite,24/7configuration,andeightpeopleinasingle-site,24/7configuration.Therefore,itissafetoassumeeachsitewillneedoneextraengineerasprotectionagainststaffreduction,bringingtheminimumstaffingtosixengineerspersite(multisite)orninepersite(single-site). Planforpart-timeworkschedules Beingon-callwithpart-timeworkingschedulesmayseemincompatible,butwe’vefoundthaton-callandpart-timeworkarrangementsarecompatibleifyoutakecertainprecautions.Thefollowingdiscussionassumesthatifamemberofyouron-callrotationworkspart-time,they’llbeunavailableforon-callshiftsoutsideoftheirpart-timeworkingweek. Therearetwomainmodelsofpart-timeworking: Workingareducedamountoffulldaysperweek—forexample,four8-hourdaysaweek,insteadoffive Workingareducedamountoftimeeachday—forexample,6hoursaday,insteadof8hoursaday Bothmodelsarecompatiblewithon-callwork,butrequiredifferentadjustmentstoon-callscheduling. Thefirstmodeleasilycoexistswithon-callwork,especiallyifthenonworkingday(s)areconstantovertime.Inresponse,youcanadoptanon-callshiftlengthoffewerthansevendaysaweek(forexample,MondaythroughThursday,orFridaythroughSunday)andconfiguretheautomatedschedulernottoschedulethepart-timeengineer(s)tobeon-callonthedaystheydon’twork. Thesecondmodelispossibleinacoupleways: Spliton-callhourswithanotherengineer,sothatnooneison-callwhentheyarenotsupposedtobe.Forexample,ifanon-callengineerneedstoworkfrom9a.m.to4p.m.,youcanassignthefirsthalfoftheshift(9a.m.to3p.m.)tothem.Rotatethesecondhalf(3p.m.to9p.m.)withintheteamthesamewayyourotateotheron-callshifts. Thepart-timeengineercanworkfullhoursontheiron-calldays,whichmaybefeasibleiftheon-callshiftisnottoofrequent. AsmentionedinChapter11ofSiteReliabilityEngineering,GoogleSREcompensatessupportoutsideofregularhourswithareducedhourlyrateofpayortimeoff,accordingtolocallaborlawandregulations.Takeapart-timeengineer’sreducedscheduleintoaccountwhendeterminingon-callcompensation. Inordertomaintainaproperbalancebetweenprojecttimeandon-calltime,engineersworkingreducedhoursshouldreceiveaproportionatelysmalleramountofon-callwork.Largerteamsabsorbthisadditionalon-callloadmoreeasilythansmallerteams. On-CallTeamDynamics Ourfirstbooktoucheduponhowstressfactorslikehighpagerloadandtimepressurecanforceon-callengineerstoadoptdecisionstrategiesbasedonintuitionandheuristicsratherthanreasonanddata(seethesection“FeelingSafe”inChapter11ofthatbook).Workingfromthisdiscussionofteampsychology,howdoyougoaboutbuildingateamwithpositivedynamics?Consideranon-callteamwiththefollowingsetofhypotheticalproblems. Scenario:Acultureof“survivetheweek” Acompanybeginswithacoupleoffoundersandahandfulofemployees,allfeaturedevelopers.Everyoneknowseveryoneelse,andeveryonetakespagers. Thecompanygrowsbigger.On-calldutyislimitedtoasmallersetofmoreexperiencedfeaturedeveloperswhoknowthesystembetter. Thecompanygrowsevenbigger.Theyaddanopsroletotacklereliability.Thisteamisresponsibleforproductionhealth,andthejobroleisfocusedonoperations,notcoding.Theon-callbecomesajointrotationbetweenfeaturedevelopersandops.Featuredevelopershavethefinalsayinmaintainingtheservice,andopsinputislimitedtooperationaltasks.Bythistime,thereare30engineersintheon-callrotation:25featuredevelopersand5ops,alllocatedatthesamesite. Theteamisplaguedbyhighpagervolume.Despitefollowingtherecommendationsdescribedearlierinthischaptertominimizepagerload,theteamissufferingfromlowmorale.Becausethefeaturedevelopersprioritizedevelopingnewfeatures,on-callfollow-uptakesalongtimetoimplement. Tomakemattersworse,thefeaturedevelopersareconcernedabouttheirownsubsystem’shealth.Onefeaturedeveloperinsistsonpagingbyerrorrateratherthanerrorratiofortheirmission-criticalmodule,despitecomplaintsfromothersontheteam.Thesealertsarenoisy,andreturnmanyfalsepositivesorunactionablepages. Othermembersoftheon-callrotationaren’tespeciallybotheredbythehighpagervolume.Sure,therearealotofpages,butmostofthemdon’ttakemuchtimetoresolve.Asoneon-callengineerputsit:“Itakeaquicklookatthepagesubjectandknowtheyareduplicates.SoIjustignorethem.” Soundfamiliar? SomeGoogleteamsexperiencedsimilarproblemsduringtheirearlierdaysofmaturity.Ifnothandledcarefully,theseproblemshavethepotentialtotearthefeaturedeveloperandopsteamsapartandhinderon-calloperation.There’snosilverbullettosolvetheseproblems,butwefoundacoupleofapproachesparticularlyhelpful.Whileyourmethodologymaydiffer,youroverallgoalshouldbethesame:buildpositiveteamdynamics,andcarefullyavoidtailspin. Proposalone:Empoweryouropsengineers YoucanremodeltheoperationsorganizationaccordingtotheguidelinesoutlinedinthisbookandSiteReliabilityEngineering,perhapsevenincludingachangeofname(SRE,orsimilar)toindicatethechangeofrole.Simplyretitlingyouropsorganizationisnotapanacea,butitcanbehelpfulincommunicatinganactualchangeinresponsibilitiesawayfromtheoldops-centricmodel.MakeitcleartotheteamandtheentirecompanythatSREsownthesiteoperation.Thisincludesdefiningasharedroadmapforreliability,drivingthefullresolutionofissues,maintainingmonitoringrules,andsoon.Featuredevelopersarenecessarycollaboratorsbutdon’towntheseendeavors. Toreturntoourhypotheticalteam,thisannouncementusheredinthefollowingoperationalchanges: ActionitemsareassignedonlytothefiveDevOpsengineers—whoarenowSREs.SREsworkwithsubjectexperts—manyofthemfeaturedevelopers—toaccomplishthesetasks.SREstakeonthepreviouslymentioned“errorrateversuserrorratio”debatebynegotiatingachangeinalertingwiththefeaturedevelopers. SREsareencouragedtodiveintothecodetomakethechangesthemselves,ifpossible.Theysendcodereviewstothesubjectexperts.ThishasthebenefitofbuildingasenseofownershipamongSREs,aswellasupgradingtheirskillsandauthorityforfutureoccasions. Withthisarrangement,featuredevelopersareexplicitcollaboratorsonreliabilityfeatures,andSREsaregiventheresponsibilitytoownandimprovethesite. Proposaltwo:Improveteamrelations Anotherpossiblesolutionistobuildstrongerteambondsbetweenteammembers.Googledesignatesa“funbudget”specificallyfororganizingoffsiteactivitiestostrengthenteambonds. We’vefoundthatmorerobustteamrelationshipscreateaspiritofincreasedunderstandingandcollaborationamongteammates.Asaresult,engineersaremorelikelytofixbugs,finishactionitems,andhelpouttheircolleagues.Forexample,sayyouturnedoffanightlypipelinejob,butforgottoturnoffthemonitoringthatcheckedifthepipelineransuccessfully.Asaresult,youaccidentallypageacolleagueat3a.m.Ifyou’vespentalittletimewiththatcolleague,you’dfeelmuchworseaboutwhathappened,andstrivetobeconsideratebybeingmorecarefulinthefuture.Thementalityof“Iprotectmycolleagues”translatestoamoreproductiveworkatmosphere. We’vealsofoundthatmakingallmembersoftheon-callrotationsittogether,regardlessofjobtitleandfunctionarea,helpsimproveteamrelationstremendously.Encourageteamstoeatlunchwitheachother.Don’tunderestimatethepowerofrelativelystraightforwardchangeslikethese.Itplaysdirectlyintoteamdynamics. Conclusion SREon-callisdifferentthantraditionalopsroles.Ratherthanfocusingsolelyonday-to-dayoperations,SREfullyownstheproductionenvironment,andseekstobetteritthroughdefiningappropriatereliabilitythresholds,developingautomation,andundertakingstrategicengineeringprojects.On-calliscriticalforsiteoperations,andhandlingitrightiscrucialtothecompany’sbottomline. On-callisasourceofmuchtension,bothindividuallyandcollectively.Butifyou’vestaredintotheeyesofthemonsterlongenough,thereiswisdomtobefound.Thischapterillustratessomeofthelessonsabouton-callthatwelearnedthehardway;wehopethatourexperiencecanhelpothersavoidortacklesimilarissues. Ifyouron-callteamisdrowninginendlessalerts,werecommendtakingastepbacktoobservethesituationfromahigherlevel.ComparenoteswithotherSREandpartnerteams.Onceyou’vegatheredthenecessaryinformation,addresstheproblemsinasystematicway.Thoughtfullystructuringon-callistimewellspentforon-callengineers,on-callteams,andthewholecompany. 1Notethatthisexampleisoftenaredflagsituationfororganizationsthataren’tactuallypracticingDevOps,inwhichcase,anamechangewon’tfixmorestructuralproblems. 2One“incident”isdefinedasone“problem,”nomatterhowmanyalertshavebeenfiredforthesame“problem.”Oneshiftis12hours. 3ThereismoreonthistopicinSeekingSREbyDavidBlank-Edelman(O’Reilly). 4SREteamsatGooglearepairedacrosstimezonesforservicecontinuity. 5A“bug”inthiscontextisanyundesirablesystembehaviorresultingfromsoftwareorconfigurationerror.Logicerrorsincode,incorrectconfigurationofabinary,incorrectcapacityplanning,misconfiguredloadbalancers,ornewlydiscoveredvulnerabilitiesareallvalidexamplesof“productionbugs”thatcontributetopagerload.



請為這篇文章評分?