Chapter 11 - Being On-Call - Site Reliability Engineering

文章推薦指數: 80 %
投票人數:10人

Being an SRE on-call typically means assuming responsibility for user-facing, revenue-critical systems or for the infrastructure required to keep these systems ... TableofContents Foreword Preface PartI-Introduction 1.Introduction 2.TheProductionEnvironmentatGoogle,fromtheViewpointofanSRE PartII-Principles 3.EmbracingRisk 4.ServiceLevelObjectives 5.EliminatingToil 6.MonitoringDistributedSystems 7.TheEvolutionofAutomationatGoogle 8.ReleaseEngineering 9.Simplicity PartIII-Practices 10.PracticalAlerting 11.BeingOn-Call 12.EffectiveTroubleshooting 13.EmergencyResponse 14.ManagingIncidents 15.PostmortemCulture:LearningfromFailure 16.TrackingOutages 17.TestingforReliability 18.SoftwareEngineeringinSRE 19.LoadBalancingattheFrontend 20.LoadBalancingintheDatacenter 21.HandlingOverload 22.AddressingCascadingFailures 23.ManagingCriticalState:DistributedConsensusforReliability 24.DistributedPeriodicSchedulingwithCron 25.DataProcessingPipelines 26.DataIntegrity:WhatYouReadIsWhatYouWrote 27.ReliableProductLaunchesatScale PartIV-Management 28.AcceleratingSREstoOn-CallandBeyond 29.DealingwithInterrupts 30.EmbeddinganSREtoRecoverfromOperationalOverload 31.CommunicationandCollaborationinSRE 32.TheEvolvingSREEngagementModel PartV-Conclusions 33.LessonsLearnedfromOtherIndustries 34.Conclusion AppendixA.AvailabilityTable AppendixB.ACollectionofBestPracticesforProductionServices AppendixC.ExampleIncidentStateDocument AppendixD.ExamplePostmortem AppendixE.LaunchCoordinationChecklist AppendixF.ExampleProductionMeetingMinutes Bibliography BeingOn-Call WrittenbyAndreaSpadaccini56EditedbyKavitaGuliani Beingon-callisacriticaldutythatmanyoperationsandengineeringteamsmustundertakeinordertokeeptheirservicesreliableandavailable.However,thereareseveralpitfallsintheorganizationofon-callrotationsandresponsibilitiesthatcanleadtoseriousconsequencesfortheservicesandfortheteamsifnotavoided.Thischapterdescribestheprimarytenetsoftheapproachtoon-callthatGoogle’sSiteReliabilityEngineers(SREs)havedevelopedoveryears,andexplainshowthatapproachhasledtoreliableservicesandsustainableworkloadovertime. Introduction Severalprofessionsrequireemployeestoperformsomesortofon-callduty,whichentailsbeingavailableforcallsduringbothworkingandnonworkinghours.IntheITcontext,on-callactivitieshavehistoricallybeenperformedbydedicatedOpsteamstaskedwiththeprimaryresponsibilityofkeepingtheservice(s)forwhichtheyareresponsibleingoodhealth. ManyimportantservicesinGoogle,e.g.,Search,Ads,andGmail,havededicatedteamsofSREsresponsiblefortheperformanceandreliabilityoftheseservices.Thus,SREsareon-callfortheservicestheysupport.TheSREteamsarequitedifferentfrompurelyoperationalteamsinthattheyplaceheavyemphasisontheuseofengineeringtoapproachproblems.Theseproblems,whichtypicallyfallintheoperationaldomain,existatascalethatwouldbeintractablewithoutsoftwareengineeringsolutions. Toenforcethistypeofproblemsolving,GooglehirespeoplewithadiversebackgroundinsystemsandsoftwareengineeringintoSREteams.WecaptheamountoftimeSREsspendonpurelyoperationalworkat50%;atminimum,50%ofanSRE’stimeshouldbeallocatedtoengineeringprojectsthatfurtherscaletheimpactoftheteamthroughautomation,inadditiontoimprovingtheservice. LifeofanOn-CallEngineer Thissectiondescribesthetypicalactivitiesofanon-callengineerandprovidessomebackgroundfortherestofthechapter. Astheguardiansofproductionsystems,on-callengineerstakecareoftheirassignedoperationsbymanagingoutagesthataffecttheteamandperformingand/orvettingproductionchanges. Whenon-call,anengineerisavailabletoperformoperationsonproductionsystemswithinminutes,accordingtothepagingresponsetimesagreedtobytheteamandthebusinesssystemowners.Typicalvaluesare5minutesforuser-facingorotherwisehighlytime-criticalservices,and30minutesforlesstime-sensitivesystems.Thecompanyprovidesthepage-receivingdevice,whichistypicallyaphone.Googlehasflexiblealertdeliverysystemsthatcandispatchpagesviamultiplemechanisms(email,SMS,robotcall,app)acrossmultipledevices. Responsetimesarerelatedtodesiredserviceavailability,asdemonstratedbythefollowingsimplisticexample:ifauser-facingsystemmustobtain4ninesofavailabilityinagivenquarter(99.99%),theallowedquarterlydowntimeisaround13minutes(AvailabilityTable).Thisconstraintimpliesthatthereactiontimeofon-callengineershastobeintheorderofminutes(strictlyspeaking,13minutes).ForsystemswithmorerelaxedSLOs,thereactiontimecanbeontheorderoftensofminutes. Assoonasapageisreceivedandacknowledged,theon-callengineerisexpectedtotriagetheproblemandworktowarditsresolution,possiblyinvolvingotherteammembersandescalatingasneeded. Nonpagingproductionevents,suchaslowerpriorityalertsorsoftwarereleases,canalsobehandledand/orvettedbytheon-callengineerduringbusinesshours.Theseactivitiesarelessurgentthanpagingevents,whichtakepriorityoveralmosteveryothertask,includingprojectwork.Formoreinsightoninterruptsandothernon-pagingeventsthatcontributetooperationalload,seeDealingwithInterrupts. Manyteamshavebothaprimaryandasecondaryon-callrotation.Thedistributionofdutiesbetweentheprimaryandthesecondaryvariesfromteamtoteam.Oneteammightemploythesecondaryasafall-throughforthepagestheprimaryon-callmisses.Anotherteammightspecifythattheprimaryon-callhandlesonlypages,whilethesecondaryhandlesallothernon-urgentproductionactivities. Inteamsforwhichasecondaryrotationisnotstrictlyrequiredfordutydistribution,itiscommonfortworelatedteamstoserveassecondaryon-callforeachother,withfall-throughhandlingduties.Thissetupeliminatestheneedforanexclusivesecondaryon-callrotation. Therearemanywaystoorganizeon-callrotations;fordetailedanalysis,refertothe"Oncall"chapterof[Lim14]. BalancedOn-Call SREteamshavespecificconstraintsonthequantityandqualityofon-callshifts.Thequantityofon-callcanbecalculatedbythepercentoftimespentbyengineersonon-callduties.Thequalityofon-callcanbecalculatedbythenumberofincidentsthatoccurduringanon-callshift. SREmanagershavetheresponsibilityofkeepingtheon-callworkloadbalancedandsustainableacrossthesetwoaxes. BalanceinQuantity Westronglybelievethatthe"E"in"SRE"isadefiningcharacteristicofourorganization,sowestrivetoinvestatleast50%ofSREtimeintoengineering:oftheremainder,nomorethan25%canbespenton-call,leavinguptoanother25%onothertypesofoperational,nonprojectwork. Usingthe25%on-callrule,wecanderivetheminimumnumberofSREsrequiredtosustaina24/7on-callrotation.Assumingthattherearealwaystwopeopleon-call(primaryandsecondary,withdifferentduties),theminimumnumberofengineersneededforon-calldutyfromasingle-siteteamiseight:assumingweek-longshifts,eachengineerison-call(primaryorsecondary)foroneweekeverymonth.Fordual-siteteams,areasonableminimumsizeofeachteamissix,bothtohonorthe25%ruleandtoensureasubstantialandcriticalmassofengineersfortheteam. Ifaserviceentailsenoughworktojustifygrowingasingle-siteteam,weprefertocreateamulti-siteteam.Amulti-siteteamisadvantageousfortworeasons: Nightshiftshavedetrimentaleffectsonpeople’shealth[Dur05],andamulti-site"followthesun"rotationallowsteamstoavoidnightshiftsaltogether. Limitingthenumberofengineersintheon-callrotationensuresthatengineersdonotlosetouchwiththeproductionsystems(seeATreacherousEnemy:OperationalUnderload). However,multi-siteteamsincurcommunicationandcoordinationoverhead.Therefore,thedecisiontogomulti-siteorsingle-siteshouldbebaseduponthetrade-offseachoptionentails,theimportanceofthesystem,andtheworkloadeachsystemgenerates. BalanceinQuality Foreachon-callshift,anengineershouldhavesufficienttimetodealwithanyincidentsandfollow-upactivitiessuchaswritingpostmortems[Loo10].Let’sdefineanincidentasasequenceofeventsandalertsthatarerelatedtothesamerootcauseandwouldbediscussedaspartofthesamepostmortem.We’vefoundthatonaverage,dealingwiththetasksinvolvedinanon-callincident—root-causeanalysis,remediation,andfollow-upactivitieslikewritingapostmortemandfixingbugs—takes6hours.Itfollowsthatthemaximumnumberofincidentsperdayis2per12-houron-callshift.Inordertostaywithinthisupperbound,thedistributionofpagingeventsshouldbeveryflatovertime,withalikelymedianvalueof0:ifagivencomponentorissuecausespageseveryday(medianincidents/day>1),itislikelythatsomethingelsewillbreakatsomepoint,thuscausingmoreincidentsthanshouldbepermitted. Ifthislimitistemporarilyexceeded,e.g.,foraquarter,correctivemeasuresshouldbeputinplacetomakesurethattheoperationalloadreturnstoasustainablestate(seeOperationalOverloadandEmbeddinganSREtoRecoverfromOperationalOverload). Compensation Adequatecompensationneedstobeconsideredforout-of-hourssupport.Differentorganizationshandleon-callcompensationindifferentways;Googleofferstime-off-in-lieuorstraightcashcompensation,cappedatsomeproportionofoverallsalary.Thecompensationcaprepresents,inpractice,alimitontheamountofon-callworkthatwillbetakenonbyanyindividual.Thiscompensationstructureensuresincentivizationtobeinvolvedinon-calldutiesasrequiredbytheteam,butalsopromotesabalancedon-callworkdistributionandlimitspotentialdrawbacksofexcessiveon-callwork,suchasburnoutorinadequatetimeforprojectwork. FeelingSafe Asmentionedearlier,SREteamssupportGoogle’smostcriticalsystems.BeinganSREon-calltypicallymeansassumingresponsibilityforuser-facing,revenue-criticalsystemsorfortheinfrastructurerequiredtokeepthesesystemsupandrunning.SREmethodologyforthinkingaboutandtacklingproblemsisvitalfortheappropriateoperationofservices. Modernresearchidentifiestwodistinctwaysofthinkingthatanindividualmay,consciouslyorsubconsciously,choosewhenfacedwithchallenges[Kah11]: Intuitive,automatic,andrapidaction Rational,focused,anddeliberatecognitivefunctions Whenoneisdealingwiththeoutagesrelatedtocomplexsystems,thesecondoftheseoptionsismorelikelytoproducebetterresultsandleadtowell-plannedincidenthandling. Tomakesurethattheengineersareintheappropriateframeofmindtoleveragethelattermindset,it’simportanttoreducethestressrelatedtobeingon-call.Theimportanceandtheimpactoftheservicesandtheconsequencesofpotentialoutagescancreatesignificantpressureontheon-callengineers,damagingthewell-beingofindividualteammembersandpossiblypromptingSREstomakeincorrectchoicesthatcanendangertheavailabilityoftheservice.Stresshormoneslikecortisolandcorticotropin-releasinghormone(CRH)areknowntocausebehavioralconsequences—includingfear—thatcanimpaircognitivefunctionsandcausesuboptimaldecisionmaking[Chr09]. Undertheinfluenceofthesestresshormones,themoredeliberatecognitiveapproachistypicallysubsumedbyunreflectiveandunconsidered(butimmediate)action,leadingtopotentialabuseofheuristics.Heuristicsareverytemptingbehaviorswhenoneison-call.Forexample,whenthesamealertpagesforthefourthtimeintheweek,andthepreviousthreepageswereinitiatedbyanexternalinfrastructuresystem,itisextremelytemptingtoexerciseconfirmationbiasbyautomaticallyassociatingthisfourthoccurrenceoftheproblemwiththepreviouscause. Whileintuitionandquickreactionscanseemlikedesirabletraitsinthemiddleofincidentmanagement,theyhavedownsides.Intuitioncanbewrongandisoftenlesssupportablebyobviousdata.Thus,followingintuitioncanleadanengineertowastetimepursuingalineofreasoningthatisincorrectfromthestart.Quickreactionsaredeep-rootedinhabit,andhabitualresponsesareunconsidered,whichmeanstheycanbedisastrous.Theidealmethodologyinincidentmanagementstrikestheperfectbalanceoftakingstepsatthedesiredpacewhenenoughdataisavailabletomakeareasonabledecisionwhilesimultaneouslycriticallyexaminingyourassumptions. It’simportantthaton-callSREsunderstandthattheycanrelyonseveralresourcesthatmaketheexperienceofbeingon-calllessdauntingthanitmayseem.Themostimportanton-callresourcesare: Clearescalationpaths Well-definedincident-managementprocedures Ablamelesspostmortemculture([Loo10],[All12]) ThedeveloperteamsofSRE-supportedsystemsusuallyparticipateina24/7on-callrotation,anditisalwayspossibletoescalatetothesepartnerteamswhennecessary.Theappropriateescalationofoutagesisgenerallyaprincipledwaytoreacttoseriousoutageswithsignificantunknowndimensions. Whenoneishandlingincidents,iftheissueiscomplexenoughtoinvolvemultipleteamsorif,aftersomeinvestigation,itisnotyetpossibletoestimateanupperboundfortheincident’stimespan,itcanbeusefultoadoptaformalincident-managementprotocol.GoogleSREusestheprotocoldescribedinManagingIncidents,whichoffersaneasy-to-followandwell-definedsetofstepsthataidanon-callengineertorationallypursueasatisfactoryincidentresolutionwithalltherequiredhelp.Thisprotocolisinternallysupportedbyaweb-basedtoolthatautomatesmostoftheincidentmanagementactions,suchashandingoffrolesandrecordingandcommunicatingstatusupdates.Thistoolallowsincidentmanagerstofocusondealingwiththeincident,ratherthanspendingtimeandcognitiveeffortonmundaneactionssuchasformattingemailsorupdatingseveralcommunicationchannelsatonce. Finally,whenanincidentoccurs,it’simportanttoevaluatewhatwentwrong,recognizewhatwentwell,andtakeactiontopreventthesameerrorsfromrecurringinthefuture.SREteamsmustwritepostmortemsaftersignificantincidentsanddetailafulltimelineoftheeventsthatoccurred.Byfocusingoneventsratherthanthepeople,thesepostmortemsprovidesignificantvalue.Ratherthanplacingblameonindividuals,theyderivevaluefromthesystematicanalysisofproductionincidents.Mistakeshappen,andsoftwareshouldmakesurethatwemakeasfewmistakesaspossible.Recognizingautomationopportunitiesisoneofthebestwaystopreventhumanerrors[Loo10]. AvoidingInappropriateOperationalLoad AsmentionedinBalancedOn-Call,SREsspendatmost50%oftheirtimeonoperationalwork.Whathappensifoperationalactivitiesexceedthislimit? OperationalOverload TheSREteamandleadershipareresponsibleforincludingconcreteobjectivesinquarterlyworkplanninginordertomakesurethattheworkloadreturnstosustainablelevels.TemporarilyloaninganexperiencedSREtoanoverloadedteam,discussedinEmbeddinganSREtoRecoverfromOperationalOverload,canprovideenoughbreathingroomsothattheteamcanmakeheadwayinaddressingissues. Ideally,symptomsofoperationaloverloadshouldbemeasurable,sothatthegoalscanbequantified(e.g.,numberofdailytickets<5,pagingeventspershift<2). Misconfiguredmonitoringisacommoncauseofoperationaloverload.Pagingalertsshouldbealignedwiththesymptomsthatthreatenaservice’sSLOs.Allpagingalertsshouldalsobeactionable.Low-priorityalertsthatbothertheon-callengineereveryhour(ormorefrequently)disruptproductivity,andthefatiguesuchalertsinducecanalsocauseseriousalertstobetreatedwithlessattentionthannecessary.SeeDealingwithInterruptsforfurtherdiscussion. Itisalsoimportanttocontrolthenumberofalertsthattheon-callengineersreceiveforasingleincident.Sometimesasingleabnormalconditioncangenerateseveralalerts,soit’simportanttoregulatethealertfan-outbyensuringthatrelatedalertsaregroupedtogetherbythemonitoringoralertingsystem.If,foranyreason,duplicateoruninformativealertsaregeneratedduringanincident,silencingthosealertscanprovidethenecessaryquietfortheon-callengineertofocusontheincidentitself.Noisyalertsthatsystematicallygeneratemorethanonealertperincidentshouldbetweakedtoapproacha1:1alert/incidentratio.Doingsoallowstheon-callengineertofocusontheincidentinsteadoftriagingduplicatealerts. SometimesthechangesthatcauseoperationaloverloadarenotunderthecontroloftheSREteams.Forexample,theapplicationdevelopersmightintroducechangesthatcausethesystemtobemorenoisy,lessreliable,orboth.Inthiscase,itisappropriatetoworktogetherwiththeapplicationdeveloperstosetcommongoalstoimprovethesystem. Inextremecases,SREteamsmayhavetheoptionto"givebackthepager"—SREcanaskthedeveloperteamtobeexclusivelyon-callforthesystemuntilitmeetsthestandardsoftheSREteaminquestion.Givingbackthepagerdoesn’thappenveryfrequently,becauseit’salmostalwayspossibletoworkwiththedeveloperteamtoreducetheoperationalloadandmakeagivensystemmorereliable.Insomecases,though,complexorarchitecturalchangesspanningmultiplequartersmightberequiredtomakeasystemsustainablefromanoperationalpointofview.Insuchcases,theSREteamshouldnotbesubjecttoanexcessiveoperationalload.Instead,itisappropriatetonegotiatethereorganizationofon-callresponsibilitieswiththedevelopmentteam,possiblyroutingsomeorallpagingalertstothedeveloperon-call.Suchasolutionistypicallyatemporarymeasure,duringwhichtimetheSREanddeveloperteamsworktogethertogettheserviceinshapetobeon-boardedbytheSREteamagain. Thepossibilityofrenegotiatingon-callresponsibilitiesbetweenSREandproductdevelopmentteamsatteststothebalanceofpowersbetweentheteams.57Thisworkingrelationshipalsoexemplifieshowthehealthytensionbetweenthesetwoteamsandthevaluesthattheyrepresent—reliabilityversusfeaturevelocity—istypicallyresolvedbygreatlybenefitingtheserviceand,byextension,thecompanyasawhole. ATreacherousEnemy:OperationalUnderload Beingon-callforaquietsystemisblissful,butwhathappensifthesystemistooquietorwhenSREsarenoton-calloftenenough?AnoperationalunderloadisundesirableforanSREteam.Beingoutoftouchwithproductionforlongperiodsoftimecanleadtoconfidenceissues,bothintermsofoverconfidenceandunderconfidence,whileknowledgegapsarediscoveredonlywhenanincidentoccurs. Tocounteractthiseventuality,SREteamsshouldbesizedtoalloweveryengineertobeon-callatleastonceortwiceaquarter,thusensuringthateachteammemberissufficientlyexposedtoproduction."WheelofMisfortune"exercises(discussedinAcceleratingSREstoOn-CallandBeyond)arealsousefulteamactivitiesthatcanhelptohoneandimprovetroubleshootingskillsandknowledgeoftheservice.Googlealsohasacompany-wideannualdisasterrecoveryeventcalledDiRT(DisasterRecoveryTraining)thatcombinestheoreticalandpracticaldrillstoperformmultidaytestingofinfrastructuresystemsandindividualservices;see[Kri12]. Conclusions Theapproachtoon-calldescribedinthischapterservesasaguidelineforallSREteamsinGoogleandiskeytofosteringasustainableandmanageableworkenvironment.Google’sapproachtoon-callhasenabledustouseengineeringworkastheprimarymeanstoscaleproductionresponsibilitiesandmaintainhighreliabilityandavailabilitydespitetheincreasingcomplexityandnumberofsystemsandservicesforwhichSREsareresponsible. Whilethisapproachmightnotbeimmediatelyapplicabletoallcontextsinwhichengineersneedtobeon-callforITservices,webelieveitrepresentsasolidmodelthatorganizationscanadoptinscalingtomeetagrowingvolumeofon-callwork. 56Anearlierversionofthischapterappearedasanarticlein;login:(October2015,vol.40,no.5). 57FormorediscussiononthenaturaltensionbetweenSREandproductdevelopmentteams,seeIntroduction.



請為這篇文章評分?