Chapter 8 - On-Call - Google - Site Reliability Engineering
文章推薦指數: 80 %
In SRE, whenever an alert is created, a corresponding playbook entry is usually created. These guides reduce stress, the mean time to repair (MTTR), ...
TableofContents
ForewordI
ForewordII
Preface
1.HowSRERelatestoDevOps
PartI-Foundations
2.ImplementingSLOs
3.SLOEngineeringCaseStudies
4.Monitoring
5.AlertingonSLOs
6.EliminatingToil
7.Simplicity
PartII-Practices
8.On-Call
9.IncidentResponse
10.PostmortemCulture:LearningfromFailure
11.ManagingLoad
12.IntroducingNon-AbstractLargeSystemDesign
13.DataProcessingPipelines
14.ConfigurationDesignandBestPractices
15.ConfigurationSpecifics
16.CanaryingReleases
PartIII-Processes
17.IdentifyingandRecoveringfromOverload
18.SREEngagementModel
19.SRE:ReachingBeyondYourWalls
20.SRETeamLifecycles
21.OrganizationalChangeManagementinSRE
Conclusion
AppendixA.ExampleSLODocument
AppendixB.ExampleErrorBudgetPolicy
AppendixC.ResultsofPostmortemAnalysis
Index
AbouttheEditors
Colophon
On-Call
ByOllieCook,SaraSmollett,AndreaSpadaccini,CaraDonnelly,JianMa,andGarrettPlasky(Evernote)withStephenThorneandJessieYang
Beingon-callmeansbeingavailableduringasetperiodoftime,andbeingreadytorespondtoproductionincidentsduringthattimewithappropriateurgency.SiteReliabilityEngineers(SREs)areoftenrequiredtotakepartinon-callrotations.Duringon-callshifts,SREsdiagnose,mitigate,fix,orescalateincidentsasneeded.Inaddition,SREsareregularlyresponsiblefornonurgentproductionduties.
AtGoogle,beingon-callisoneofthedefiningcharacteristicsofSRE.SREteamsmitigateincidents,repairproductionproblems,andautomateoperationaltasks.SincemostofourSREteamshavenotyetfullyautomatedalltheiroperationaltasks,escalationsneedhumanpointsofcontact—on-callengineers.Dependingonhowcriticalthesupportedsystemsare,orthestateofdevelopmentthesystemsarein,notallSREteamsmayneedtobeon-call.Inourexperience,mostSREteamsstaffon-callshifts.
On-callisalargeandcomplextopic,saddledwithmanyconstraintsandalimitedmarginfortrialanderror.Chapter11ofourfirstbook(SiteReliabilityEngineering),“BeingOn-Call,”alreadyexploredthistopic.Thischapteraddressesspecificfeedbackandquestionswereceivedaboutthatchapter.Theseincludethefollowing:
“WearenotGoogle;we’remuchsmaller.Wedon’thaveasmanypeopleintherotation,andwedon’thavesitesindifferenttimezones.Whatyoudescribedinyourfirstbookisirrelevanttome.”
“WehaveamixtureofdevelopersandDevOpsforon-callrotation.What’sthebestwaytoorganizethem?Splittheresponsibilities?”
“Ouron-callengineergetspagedaboutahundredtimesinatypical24-hourshift.Alotofpagesgetignored,whiletherealproblemsareburiedunderthepile.Whereshouldwestart?”
“Wehaveahighturnoverrateforon-callrotations.Howdoyouaddresstheknowledgegapwithintheteam?”
“WewanttoreorgourDevOpsteamintoSRE.1What’sthedifferencebetweenSREon-call,DevOpson-call,anddeveloperson-call?Pleasebespecific,becausetheDevOpsteamisveryconcernedaboutthis.”
Weofferpracticaladviceforthesesituations.GoogleisalargecompanywithamatureSREorganization,butmuchofwhatwe’velearnedovertheyearscanbeappliedtoanycompanyororganization,regardlessofsizeormaturity.Googlehashundredsofon-callrotationsacrossservicesofallsizes,andvariouson-callsetupsfromsimpletocomplicated.On-callisnotexclusivelyanSREfunction:manydeveloperteamsaredirectlyon-callfortheirservice.Eachon-callsetupmeetstheneedofaparticularservice.
Thischapterdescribeson-callsetupsbothwithinGoogleandoutsideofGoogle.Whileyoursetupandsituationwilllikelydifferfromourspecificexamples,theessentialconceptswecoverarewidelyapplicable.
Wethendelveintotheanatomyofpagerload,explainingwhatcausespagerload.Wesuggeststrategiestooptimizeon-callsetupandminimizethatload.
Finally,wesharetwoexamplesofpracticesinsideGoogle:on-callflexibilityandon-callteamdynamics.Thesepracticesshowthatnomatterhowmathematicallysoundanon-callsetupis,youcannotsolelyrelyonlogisticsoftheon-callsetup.Incentivesandhumannatureplayanimportantrole,andshouldalsobetakenintoaccount.
Recapof“BeingOn-Call”ChapterofFirstSREBook
SiteReliabilityEngineering,in“BeingOn-Call”,explainstheprinciplesbehindon-callrotationsatGoogle.Thissectiondiscussesthemainpointsofthatchapter.
AtGoogle,theoverallgoalofbeingon-callistoprovidecoverageforcriticalservices,whilemakingsurethatweneverachievereliabilityattheexpenseofanon-callengineer’shealth.Asaresult,SREteamsstriveforbalance.SREworkshouldbeahealthymixofduties:on-callandprojectwork.SpecifyingthatSREsspendatleast50%oftheirtimeonprojectworkmeansthatteamshavetimetotackletheprojectsrequiredtostrategicallyaddressanyproblemsfoundinproduction.Teamstaffingmustbeadequatetoensuretimeforprojectwork.
Wetargetamaximumoftwoincidentsperon-callshift,2toensureadequatetimeforfollow-up.Ifthepagerloadgetstoohigh,correctiveactioniswarranted.(Weexplorepagerloadlaterinthischapter.)
Psychologicalsafety3isvitalforeffectiveon-callrotations.Sincebeingon-callcanbedauntingandhighlystressful,on-callengineersshouldbefullysupportedbyaseriesofproceduresandescalationpathstomaketheirliveseasier.
On-callusuallyimpliessomeamountofout-of-hourswork.Webelievethisworkshouldbecompensated.Whiledifferentcompaniesmaychoosetohandlethisindifferentways,Googleofferstime-off-in-lieuorcashcompensation,cappedatsomeproportionoftheoverallsalary.Thecompensationschemeprovidesanincentiveforbeingpartofon-call,andensuresthatengineersdonottakeontoomanyon-callshiftsforeconomicreasons.
ExampleOn-CallSetupsWithinGoogleandOutsideGoogle
Thissectiondescribesreal-worldexamplesofon-callsetupsatGoogleandEvernote,aCaliforniacompanythatdevelopsacross-platformappthathelpsindividualsandteamscreate,assemble,andshareinformation.Foreachcompany,weexplorethereasoningbehindon-callsetups,generalon-callphilosophy,andon-callpractices.
Google:FormingaNewTeam
Initialscenario
Afewyearsago,Sara,anSREatGoogleMountainView,startedanewSREteamthatneededtobeon-callwithinthreemonths.Toputthisinperspective,mostSREteamsatGoogledonotexpectnewhirestobereadyforon-callbeforethreetoninemonths.ThenewMountainViewSREteamwouldsupportthreeGoogleAppsservicesthatwerepreviouslysupportedbyanSREteaminKirkland,Washington(atwo-hourflightfromMountainView).TheKirklandteamhadasisterSREteaminLondon,whichwouldcontinuetosupporttheseservicesalongsidethenewMountainViewSREteam,anddistributedproductdevelopmentteams.4
ThenewMountainViewSREteamcametogetherquickly,assemblingsevenpeople:
Sara,anSREtechlead
Mike,anexperiencedSREfromanotherSREteam
AtransferfromaproductdevelopmentteamwhowasnewtoSRE
Fournewhires(“Nooglers”)
Evenwhenateamismature,goingon-callfornewservicesisalwayschallenging,andthenewMountainViewSREteamwasarelativelyjuniorteam.Nonetheless,thenewteamwasabletoonboardtheserviceswithoutsacrificingservicequalityorprojectvelocity.Theymadeimmediateimprovementstotheservices,includingloweringmachinecostsby40%,andfullyautomatingreleaserolloutswithcanaryingandothersafetychecks.Thenewteamalsocontinuedtodeliverreliableservices,targeting99.98%availability,orroughly26minutesofdowntimeperquarter.
HowdidthenewSREteambootstrapthemselvestoaccomplishsomuch?Throughstarterprojects,mentoring,andtraining.
Trainingroadmap
AlthoughthenewSREteamdidn’tknowmuchabouttheirservices,SaraandMikewerefamiliarwithGoogle’sproductionenvironmentandSRE.AsthefourNooglerscompletedcompanyorientation,SaraandMikecompiledachecklistoftwodozenfocusareasforpeopletopracticebeforegoingon-call,suchas:
Administeringproductionjobs
Understandingdebugginginfo
“Draining”trafficawayfromacluster
Rollingbackabadsoftwarepush
Blockingorrate-limitingunwantedtraffic
Bringingupadditionalservingcapacity
Usingthemonitoringsystems(foralertinganddashboards)
Describingthearchitecture,variouscomponents,anddependenciesoftheservices
TheNooglersfoundsomeofthisinformationontheirownbyresearchingexistingdocumentationandcodelabs(guided,hands-oncodingtutorials)andgainedunderstandingonrelevanttopicsbyworkingontheirstarterprojects.WhenateammemberlearnedaboutspecifictopicsrelevanttotheNooglers’starterprojects,thatpersonledashort,impromptusessiontosharethatinformationwiththerestoftheteam.SaraandMikecoveredtheremainingtopics.Theteamalsoheldlabsessionstoperformcommondebuggingandmitigationtaskstohelpeveryonebuildmusclememoryandgainconfidenceintheirabilities.
Inadditiontothechecklist,thenewSREteamranaseriesof“deepdives”todigintotheirservices.Theteambrowsedmonitoringconsoles,identifiedrunningjobs,andtrieddebuggingrecentpages.SaraandMikeexplainedthatanengineerdidn’tneedyearsofexpertisewitheachoftheservicestobecomereasonablyproficient.Theycoachedtheteamtoexploreaservicefromfirstprinciples,andencouragedNooglerstobecomefamiliarwiththeservices.Theywereopenaboutthelimitsoftheirknowledge,andtaughtotherswhentoaskforhelp.
Throughouttheramp-up,thenewSREteamwasn’talone.SaraandMiketraveledtomeettheotherSREteamsandproductdevelopersandlearnfromthem.ThenewSREteammetwiththeKirklandandLondonteamsbyholdingvideoconferences,exchangingemail,andchattingoverIRC.Inaddition,theteamattendedweeklyproductionmeetings,readdailyon-callhandoffsandpostmortems,andbrowsedexistingservicedocumentation.AKirklandSREvisitedtogivetalksandanswerquestions.ALondonSREputtogetherathoroughsetofdisasterscenariosandranthemduringGoogle’sdisasterrecoverytrainingweek(seethesection“PreparednessandDisasterTesting”inSiteReliabilityEngineering,Chapter33).
Theteamalsopracticedbeingon-callthrough“WheelofMisfortune”trainingexercises(seethesection“DisasterRolePlaying”inSiteReliabilityEngineering,Chapter28),wheretheyrole-playedrecentincidentstopracticedebuggingproductionproblems.Duringthesesessions,allSREswereencouragedtooffersuggestionsonhowtoresolvemockproductionfailures.Aftereveryonerampedup,theteamstillheldthesesessions,rotatingthrougheachteammemberasthesessionleader.Theteamrecordedtheseforfuturereference.
Beforegoingon-call,theteamreviewedpreciseguidelinesabouttheresponsibilitiesofon-callengineers.Forexample:
Atthestartofeachshift,theon-callengineerreadsthehandofffromthepreviousshift.
Theon-callengineerminimizesuserimpactfirst,thenmakessuretheissuesarefullyaddressed.
Attheendoftheshift,theon-callengineersendsahandoffemailtothenextengineeron-call.
Theguidelinesalsospecifiedwhentoescalatetoothers,andhowtowritepostmortemsforlargeincidents.
Finally,theteamreadandupdatedon-callplaybooks.Playbookscontainhigh-levelinstructionsonhowtorespondtoautomatedalerts.Theyexplaintheseverityandimpactofthealert,andincludedebuggingsuggestionsandpossibleactionstotaketomitigateimpactandfullyresolvethealert.InSRE,wheneveranalertiscreated,acorrespondingplaybookentryisusuallycreated.Theseguidesreducestress,themeantimetorepair(MTTR),andtheriskofhumanerror.
MaintainingPlaybooks
Detailsinplaybooksgooutofdateatthesamerateasproductionenvironmentchanges.Fordailyreleases,playbooksmightneedanupdateonanygivenday.Writinggooddocumentation,likeanyformofcommunication,ishard.Sohowdoyoumaintainplaybooks?
SomeSREsatGoogleadvocatekeepingplaybookentriesgeneralsotheychangeslowly.Forexample,theymayhavejustoneentryforall“RPCErrorsHigh”alerts,foratrainedon-callengineertoread,inconjunctionwithanarchitecturediagramforthecurrentlyalertingservice.OtherSREsadvocateforstep-by-stepplaybookstoreducehumanvariabilityanddrivedownMTTR.Ifyourteamhasconflictingviewsaboutplaybookcontent,theplaybooksmightgetpulledinmanydirections.
Thisisacontentioustopic.Ifyouagreeonnothingelse,atleastdecidewithyourteamwhatminimal,structureddetailsyourplaybooksmusthave,andtrytonoticewhenyourplaybookshaveaccumulatedalotofinformationbeyondthesestructureddetails.Pencilinaprojecttoturnnew,hard-won,productionknowledgeintoautomationormonitoringconsoles.Ifyourplaybooksareadeterministiclistofcommandsthattheon-callengineerrunseverytimeaparticularalertfires,werecommendimplementingautomation.
Aftertwomonths,Sara,Mike,andtheSREtransfershadowedtheon-callshiftsoftheoutgoingKirklandSREteam.Atthreemonths,theybecametheprimaryon-call,withtheKirklandSREsasbackup.Thatway,theycouldeasilyescalatetotheKirklandSREsifneeded.Next,theNooglersshadowedthemoreexperienced,localSREsandjoinedtherotation.
Gooddocumentationandthevariousstrategiesdiscussedearlierallhelpedtheteamformasolidfoundationandrampupquickly.Althoughon-callcanbestressful,theteams’confidencegrewenoughtotakeactionwithoutsecond-guessingthemselves.Therewaspsychologicalsafetyinknowingthattheirresponseswerebasedontheteam’scollectiveknowledge,andthatevenwhentheyescalated,theon-callengineerswerestillregardedascompetentengineers.
Afterword
WhiletheMountainViewSREswererampingup,theylearnedthattheirexperienced,sisterSREteaminLondonwouldbemovingontoanewproject,andanewteamwasbeingformedinZürichtosupporttheservicespreviouslysupportedbytheLondonSREteam.Forthissecondtransition,thesamebasicapproachtheMountainViewSREsusedprovedsuccessful.ThepreviousinvestmentbyMountainViewSREsindevelopingonboardingandtrainingmaterialshelpedthenewZürichSREteamrampup.
WhiletheapproachusedbytheMountainViewSREsmadesensewhenacohortofSREswerebecomingateam,theyneededamorelightweightapproachwhenonlyonepersonjoinedtheteamatagiventime.Inanticipationoffutureturnover,theSREscreatedservicearchitecturediagramsandformalizedthebasictrainingchecklistintoaseriesofexercisesthatcouldbecompletedsemi-independentlywithminimalinvolvementfromamentor.Theseexercisesincludeddescribingthestoragelayer,performingcapacityincreases,andreviewinghowHTTPrequestsarerouted.
Evernote:FindingOurFeetintheCloud
Movingouron-preminfrastructuretothecloud
Wedidn’tsetouttoreengineerouron-callprocess,butaswithmanythingsinlife,necessityisthemotherofinvention.PriortoDecember2016,Evernoteranonlyonon-premdatacenters,builttosupportourmonolithicapplication.Ournetworkandserversweredesignedwithaspecificarchitectureanddataflowinmind.This,combinedwithahostofotherconstraints,meantwelackedtheflexibilityneededtosupportahorizontalarchitecture.GoogleCloudPlatform(GCP)providedaconcretesolutiontoourproblem.However,westillhadonemajorhurdletosurmount:migratingallourproductionandsupportinginfrastructuretoGCP.Fast-forward70days.ThroughaHerculeaneffortandmanyremarkablefeats(forexample,movingthousandsofserversand3.5PBofdata),wewerehappilysettledinournewhome.Atthispoint,though,ourjobstillwasn’tdone:howwerewegoingtomonitor,alert,and—mostimportantly—respondtoissuesinournewenvironment?
Adjustingouron-callpoliciesandprocesses
Themovetothecloudunleashedthepotentialforourinfrastructuretogrowrapidly,butouron-callpoliciesandprocesseswerenotyetsetuptohandlesuchgrowth.Oncethemigrationwrappedup,wesetouttoremedytheproblem.Inourpreviousphysicaldatacenter,webuiltredundancyintonearlyeverycomponent.Thismeantthatwhilecomponentfailurewascommongivenoursize,generallynoindividualcomponentwascapableofnegativelyimpactingusers.Theinfrastructurewasverystablebecausewecontrolledit—anysmallbumpwouldinevitablybeduetoafailuresomewhereinthesystem.Ouralertingpolicieswerestructuredwiththatinmind:afewdroppedpackets,resultinginaJDBC(JavaDatabaseConnectivity)connectionexception,invariablymeantthataVM(virtualmachine)hostwasonthevergeoffailing,orthatthecontrolplaneononeofourswitcheswasonthefritz.Evenbeforeourfirstdayinthecloud,werealizedthatthistypeofalert/responsesystemwasnottenablegoingforward.Inaworldoflivemigrationsandnetworklatency,weneededtotakeamuchmoreholisticapproachtomonitoring.
Reframingpagingeventsintermsoffirstprinciples,andwritingtheseprinciplesdownasourexplicitSLOs(servicelevelobjectives),helpedgivetheteamclarityregardingwhatwasimportanttoalertonandallowedustotrimthefatfromourmonitoringinfrastructure.Ourfocusonhigher-levelindicatorssuchasAPIresponsiveness,ratherthanlower-levelinfrastructuresuchasInnoDBrowlockwaitsinMySQL,meantwecouldfocusmoretimeontherealpainourusersexperienceduringanoutage.Forourteam,thismeantlesstimespentchasingtransientproblems.Thistranslatedintomoresleep,effectiveness,andultimately,jobsatisfaction.
Restructuringourmonitoringandmetrics
Ourprimaryon-callrotationisstaffedbyasmallbutscrappyteamofengineerswhoareresponsibleforourproductioninfrastructureandahandfulofotherbusinesssystems(forexample,stagingandbuildpipelineinfrastructure).Wehaveaweekly,24/7schedulewithawell-oiledhandoffprocedure,alongsideamorningreviewofincidentsatadailystand-up.Oursmallteamsizeandcomparativelylargescopeofresponsibilitynecessitatesthatwemakeeveryefforttokeeptheprocessburdenlight,andfocusonclosingthealert/triage/remediation/analysisloopasquicklyaspossible.Oneofthewaysweachievethisistokeepoursignal-to-noiseratiolowbymaintainingsimplebuteffectivealertingSLAs(servicelevelagreements).Weclassifyanyeventgeneratedbyourmetricsormonitoringinfrastructureintothreecategories:
P1:Dealwithimmediately
Shouldbeimmediatelyactionable
Pagestheon-call
Leadstoeventtriage
IsSLO-impacting
P2:Dealwiththenextbusinessday
Generallyisnotcustomer-facing,orisverylimitedinscope
Sendsanemailtoteamandnotifieseventstreamchannel
P3:Eventisinformationalonly
Informationisgatheredindashboards,passiveemail,andthelike
Includescapacityplanning–relatedinformation
AnyP1orP2eventhasanincidentticketattachedtoit.Theticketisusedforobvioustaskslikeeventtriageandtrackingremediationactions,aswellasforSLOimpact,numberofoccurrences,andpostmortemdoclinks,whereapplicable.
Whenaneventpages(categoryP1),theon-callistaskedwithassessingtheimpacttousers.Incidentsaretriagedintoseveritiesfrom1to3.Forseverity1(Sev1)incidents,wemaintainafinitesetofcriteriatomaketheescalationdecisionasstraightforwardaspossiblefortheresponder.Oncetheeventisescalated,weassembleanincidentteamandbeginourincidentmanagementprocess.Theincidentmanagerispaged,ascribeandcommunicationsleadiselected,andourcommunicationchannelsopen.Aftertheincidentisresolved,weconductanautomaticpostmortemandsharetheresultsfarandwidewithinthecompany.ForeventsratingSev2orSev3,theon-callresponderhandlestheincidentlifecycle,includinganabbreviatedpostmortemforincidentreview.
Oneofthebenefitsofkeepingourprocesslightweightisthatwecanexplicitlyfreetheon-callfromanyexpectationsofprojectwork.Thisempowersandencouragestheon-calltotakeimmediatefollow-upaction,andalsotoidentifyanymajorgapsintoolingorprocessaftercompletingthepost-incidentreview.Inthisway,weachieveaconstantcycleofimprovementandflexibilityduringeveryon-callshift,keepingpacewiththerapidrateofchangeinourenvironment.Thegoalistomakeeveryon-callshiftbetterthanthelast.
Trackingourperformanceovertime
WiththeintroductionofSLOs,wewantedtotrackperformanceovertime,andsharethatinformationwithstakeholderswithinthecompany.Weimplementedamonthlyservicereviewmeeting,opentoanyonewho’sinterested,toreviewanddiscussthepreviousmonthoftheservice.Wehavealsousedthisforumtoreviewouron-callburdenasabarometerofteamhealth,anddiscussremediationactionswhenweexceedourpagerbudget.ThisforumhasthedualpurposeofspreadingtheimportanceofSLOswithinthecompanyandkeepingthetechnicalorganizationaccountableformaintainingthehealthandwellnessofourserviceandteam.
EngagingwithCRE
ExpressingourobjectivesintermsofSLOsprovidesabasisforengagingwithGoogle’sCustomerReliabilityEngineering(CRE)team.AfterwediscussedourSLOswithCREtoseeiftheywererealisticandmeasurable,bothteamsdecidedCREwouldbepagedalongsideourownengineersforSLO-impactingevents.Itcanbedifficulttopinpointrootcausesthatarehiddenbehindlayersofcloudabstraction,sohavingaGoogleratoursidetaketheguessworkoutofblack-boxeventtriagingwashelpful.Moreimportantly,thisexercisefurtherreducedourMTTR,whichisultimatelywhatouruserscareabout.
Sustainingaself-perpetuatingcycle
Ratherthanspendingallourtimeinthetriage/root-causeanalysis/postmortemcycle,wenowhavemoretimeasateamtothinkabouthowwemovethebusinessforward.Specifically,thistranslatesintoprojectssuchasimprovingourmicroservicesplatformandestablishingproductionreadinesscriteriaforourproductdevelopmentteams.Thelatterincludesmanyoftheprincipleswefollowedinrestructuringouron-call,whichisparticularlyhelpfulforteamsintheirfirst“carrythepager”rodeo.Thus,weperpetuatethecycleofimprovingon-callforeveryone.
PracticalImplementationDetails
Sofar,we’vediscusseddetailsabouton-callsetups,bothwithinGoogleandoutsideofGoogle.Butwhataboutspecificconsiderationsofbeingon-call?Thefollowingsectionsdiscusstheseimplementationdetailsinmoredepth:
Pagerload—whatitis,howitworks,andhowtomanageit
Howtofactorflexibilityintoon-callschedulingtocreateahealthierwork/lifebalanceforSREs
Strategiesforimprovingteamdynamics,bothwithinagivenSREteam,andwithpartnerteams
AnatomyofPagerLoad
Yourpagerisnoisyandit’smakingyourteamunhappy.You’vereadthroughChapter31inSiteReliabilityEngineering,andrunregularproductionmeetings,bothwithyourteamandthedeveloperteamsyousupport.Noweveryoneknowsthatyouron-callengineersareunhappy.Whatnext?
Pagerloadisthenumberofpagingincidentsthatanon-callengineerreceivesoveratypicalshiftlength(suchasperdayorperweek).Anincidentmayinvolvemorethanonepage.Here,we’llwalkthroughtheimpactofvariousfactorsonpagerload,andsuggesttechniquesforminimizingfuturepagerload.
AppropriateResponseTimes
Engineersshouldn’thavetobeatacomputerandworkingonaproblemwithinminutesofreceivingapageunlessthereisaverygoodreasontodoso.Whileacompleteoutageofacustomer-facing,revenue-generatingservicetypicallyrequiresanimmediateresponse,youcandealwithlesssevereissues(forexample,failingbackups)withinafewhours.
Werecommendcheckingyourcurrentpagingsetuptoseeifyouactuallyshouldbepagedforeverythingthatcurrentlytriggersapage.Youmaybepagingforissuesthatwouldbebetterservedbyautomatedrepair(asit'sgenerallybetterforacomputertofixaproblemthanrequiringahumantofixit)oraticket(ifit'snotactuallyhighpriority).Table8-1showssomesampleeventsandappropriateresponses.
Table8-1.Examplesofrealisticresponsetimes
Incidentdescription
Responsetime
SREimpact
Revenue-impactingnetworkoutage
5minutes
SREneedstobewithinarm'sreachofachargedandauthenticatedlaptopwithnetworkaccessatalltimes;cannottravel;mustheavilycoordinatewithsecondaryatalltimes
Customerorderbatchprocessingsystemstuck
30minutes
SREcanleavetheirhomeforaquickerrandorshortcommute;secondarydoesnotneedtoprovidecoverageduringthistime
Backupsofadatabaseforapre-launchservicearefailing
Ticket(responseduringworkhours)
None
Scenario:Ateaminoverload
The(hypothetical)ConnectionSRETeam,responsibleforfrontendloadbalancingandterminatingend-userconnections,founditselfinapositionofhighpagerload.Theyhadanestablishedpagerbudgetoftwopagingincidentspershift,butforthepastyeartheyhadregularlybeenreceivingfivepagingincidentspershift.Analysisrevealedthatfullyone-thirdofshiftswereexceedingtheirpagerbudget.Membersoftheteamheroicallyrespondedtothedailyonslaughtofpagesbutcouldn’tkeepup;theresimplywasnotenoughtimeinthedaytofindtherootcauseandproperlyfixtheincomingissues.Someengineerslefttheteamtojoinlessoperationallyburdenedteams.High-qualityincidentfollow-upwasrare,sinceon-callengineersonlyhadtimetomitigateimmediateproblems.
Theteam’shorizonwasn’tentirelybleak:theyhadamaturemonitoringinfrastructurethatfollowedSREbestpractices.AlertingthresholdsweresettoalignwiththeirSLO,andpagingalertsweresymptom-basedinnature,meaningtheyfiredonlywhencustomerswereimpacted.Whenseniormanagementwereapproachedwithallofthisinformation,theyagreedthattheteamwasinoperationaloverloadandreviewedtheprojectplantobringtheteambacktoahealthystate.
Inlesspositivenews,overtimetheConnectionteamhadtakenownershipofsoftwarecomponentsfrommorethan10developerteamsandhadharddependenciesonGoogle’scustomer-facingedgeandbackbonenetworks.Thelargenumberofintergrouprelationshipswascomplexandhadquietlygrowndifficulttomanage.
Despitetheteamfollowingbestpracticesinstructuringtheirmonitoring,manyofthepagesthattheyfacedwereoutsidetheirdirectcontrol.Forexample,ablack-boxprobemayhavefailedduetocongestioninthenetwork,causingpacketloss.Theonlyactiontheteamcouldtaketomitigatecongestioninthebackbonewastoescalatetotheteamdirectlyresponsibleforthatnetwork.
Ontopoftheiroperationalburden,theteamneededtodelivernewfeaturestothefrontendsystems,whichwouldbeusedbyallGoogleservices.Tomakemattersworse,theirinfrastructurewasbeingmigratedfroma10-year-oldlegacyframeworkandclustermanagementsystemtoabetter-supportedreplacement.Theteam’sservicesweresubjecttoanunprecedentedrateofchange,andthechangesthemselvescausedasignificantportionoftheon-callload.
Theteamclearlyneededtocombatthisexcessivepagerloadusingavarietyoftechniques.Thetechnicalprogrammanagerandthepeoplemanageroftheteamapproachedseniormanagementwithaprojectproposal,whichseniormanagementreviewedandapproved.Theteamturnedtheirfullattentiontoreducingtheirpagerload,andlearnedsomevaluablelessonsalongtheway.
Pagerloadinputs
Thefirststepintacklinghighpagerloadistodeterminewhatiscausingit.Pagerloadisinfluencedbythreemainfactors:bugs5inproduction,alerting,andhumanprocesses.Eachofthesefactorshasseveralinputs,someofwhichwediscussinmoredetailinthissection.
Forproduction:
Thenumberofexistingbugsinproduction
Theintroductionofnewbugsintoproduction
Thespeedwithwhichnewlyintroducedbugsareidentified
Thespeedwithwhichbugsaremitigatedandremovedfromproduction
Foralerting:
Thealertingthresholdsthattriggerapagingalert
Theintroductionofnewpagingalerts
Thealignmentofaservice’sSLOwiththeSLOsoftheservicesuponwhichitdepends
Forhumanprocesses:
Therigoroffixesandfollow-uponbugs
Thequalityofdatacollectedaboutpagingalerts
Theattentionpaidtopagerloadtrends
Human-actuatedchangestoproduction
Preexistingbugs
Nosystemisperfect.Therewillalwaysbebugsinproduction:inyourowncode,thesoftwareandlibrariesthatyoubuildupon,ortheinterfacesbetweenthem.Thebugsmaynotbecausingpagingalertsrightnow,buttheyaredefinitelypresent.Youcanuseafewtechniquestoidentifyorpreventbugsthathaven’tyetcausedpagingalerts:
Ensuresystemsareascomplicatedastheyneedtobe,andnomore(seeSimplicity).
Regularlyupdatethesoftwareorlibrariesthatyoursystemisbuiltupontotakeadvantageofbugfixes(however,seethenextsectionaboutnewbugs).
Performregulardestructivetestingorfuzzing(forexample,usingNetflix’sChaosMonkey).
Performregularloadtestinginadditiontointegrationandunittesting.
Newbugs
Ideally,theSREteamanditspartnerdeveloperteamsshoulddetectnewbugsbeforetheyevenmakeitintoproduction.Inreality,automatedtestingmissesmanybugs,whicharethenlaunchedtoproduction.
Softwaretestingisalargetopicwellcoveredelsewhere(e.g.,MartinFowleronTesting).However,softwaretestingtechniquesareparticularlyusefulinreducingthenumberofbugsthatreachproduction,andtheamountoftimetheyremaininproduction:
Improvetestingovertime.Inparticular,foreachbugyoudiscoverinproduction,ask“Howcouldwehavedetectedthisbugpreproduction?”Makesurethenecessaryengineeringfollow-upoccurs(seeRigoroffollow-up).
Don’tignoreloadtesting,whichisoftentreatedaslowerprioritythanfunctionaltesting.Manybugsmanifestonlyunderparticularloadconditionsorwithaparticularmixofrequests.
Runstaging(testingwithproduction-likebutsynthetictraffic)inaproduction-likeenvironment.WebrieflydiscussgeneratingsynthetictrafficinAlertingonSLOsofthisbook.
Performcanarying(CanaryingReleases)inaproductionenvironment.
Havealowtolerancetonewbugs.Followa“detect,rollback,fix,androllforward”strategyratherthana“detect,continuetorollforwarddespiteidentifyingthebug,fix,androllforwardagain”strategy.(SeeMitigationdelayformoredetails.)
Thiskindofrollbackstrategyrequirespredictableandfrequentreleasessothatthecostofrollingbackanyonereleaseissmall.WediscussthisandrelatedtopicsinSiteReliabilityEngineering,in“ReleaseEngineering”.
Somebugsmaymanifestonlyastheresultofchangingclientbehavior.Forexample:
Bugsthatmanifestonlyunderspecificlevelsofload—forexample,Septemberback-to-schooltraffic,BlackFriday,CyberMonday,orthatweekoftheyearwhenDaylightSavingTimemeansEuropeandNorthAmericaareonehourcloser,meaningmoreofyourusersareawakeandonlinesimultaneously.
Bugsthatmanifestonlywithaparticularmixofrequests—forexample,serversclosertoAsiaexperiencingamoreexpensivetrafficmixduetolanguageencodingsforAsiancharactersets.
Bugsthatmanifestonlywhenusersexercisethesysteminunexpectedways—forexample,Calendarbeingusedbyanairlinereservationsystem!Therefore,itisimportanttoexpandyourtestingregimentotestbehaviorsthatdonotoccureveryday.
Whenaproductionsystemisplaguedbyseveralconcurrentbugs,it’smuchmoredifficulttoidentifyifagivenpageisforanexistingornewbug.Minimizingthenumberofbugsinproductionnotonlyreducespagerload,italsomakesidentifyingandclassifyingnewbugseasier.Therefore,itiscriticaltoremoveproductionbugsfromyoursystemsasquicklyaspossible.Prioritizefixingexistingbugsabovedeliveringnewfeatures;ifthisrequirescross-teamcollaboration,seeSREEngagementModel.
Architecturalorproceduralproblems,suchasautomatedhealthchecking,self-healing,andloadshedding,mayneedsignificantengineeringworktoresolve.Remember,forsimplicity’ssakewe’llconsidertheseproblems“bugs,”eveniftheirsize,theircomplexity,ortheeffortrequiredtoresolvethemissignificant.
Chapter3ofSiteReliabilityEngineeringdescribeshowerrorbudgetsareausefulwaytomanagetherateatwhichnewbugsarereleasedtoproduction.Forexample,whenaservice’sSLOviolationsexceedacertainfractionofitstotalquarterlyerrorbudget—typicallyagreedinadvancebetweenthedeveloperandSREteams—newfeaturedevelopmentandfeature-relatedrolloutscanbehaltedtemporarilytofocusonstabilizingthesystemandreducingthefrequencyofpages.
TheConnectionteamfromourexampleadoptedastrictpolicyrequiringeveryoutagetohaveatrackingbug.Thisenabledtheteam’stechnicalprogrammanagertoexaminetherootcauseoftheirnewbugsinaggregate.Thisdatarevealedthathumanerrorwasthesecondmostcommoncauseofnewbugsinproduction.
Becausehumansareerror-prone,it’sbetterifallchangesmadetoproductionsystemsaremadebyautomationinformedby(human-developed)intentconfiguration.Beforeyoumakeachangetoproduction,automationcanperformadditionaltestingthathumanscannot.TheConnectionteamwasmakingcomplexchangestoproductionsemimanually.Notsurprisingly,theteam’smanualchangeswentwrongsometimes;theteamintroducednewbugs,whichcausedpages.Automatedsystemsmakingthesamechangeswouldhavedeterminedthatthechangeswerenotsafebeforetheyenteredproductionandbecamepagingevents.Thetechnicalprogrammanagertookthisdatatotheteamandconvincedthemtoprioritizeautomationprojects.
Identificationdelay
It’simportanttopromptlyidentifythecause(s)ofalertsbecausethelongerittakestoidentifytherootcauseofapage,themoreopportunityithastorecurandpageagain.Forexample,givenapagethatmanifestsonlyunderhighload,sayatdailypeak,iftheproblematiccodeorconfigurationisnotidentifiedbeforethenextdailypeak,itislikelythattheproblemwillhappenagain.Thereareseveraltechniquesyoumightusetoreduceidentificationdelays:
Usegoodalertsandconsoles
Ensurepageslinktorelevantmonitoringconsoles,andthatconsoleshighlightwherethesystemisoperatingoutofspecification.Intheconsole,correlateblack-boxandwhite-boxpagingalertstogether,anddothesamewiththeirassociatedgraphs.Makesureplaybooksareuptodatewithadviceonrespondingtoeachtypeofalert.On-callengineersshouldupdatetheplaybookwithfreshinformationwhenthecorrespondingpagefires.
Practiceemergencyresponse
Run“WheelofMisfortune”exercises(describedinSiteReliabilityEngineering)tosharegeneralandservice-specificdebuggingtechniqueswithyourcolleagues.
Performsmallreleases
Ifyouperformfrequent,smallerreleasesinsteadofinfrequentmonolithicchanges,correlatingbugswiththecorrespondingchangethatintroducedthemiseasier.Canaryingreleases,describedinCanaryingReleasesgivesastrongsignalaboutwhetheranewbugisduetoanewrelease.
Logchanges
Aggregatingchangeinformationintoasearchabletimelinemakesitsimpler(andhopefullyquicker)tocorrelatenewbugswiththecorrespondingchangethatintroducedthem.ToolsliketheSlackplug-inforJenkinscanbehelpful.
Askforhelp
InSiteReliabilityEngineering,“ManagingIncidents”,wetalkedaboutworkingtogethertomanagelargeoutages.Theon-callengineerisneveralone;encourageyourteamtofeelsafewhenaskingforhelp.
Mitigationdelay
Thelongerittakestomitigateabugonceit’sidentified,themoreopportunityithastorecurandpageagain.Considerthesetechniquesforreducingmitigationdelays:
Rollbackchanges
Ifthebugwasintroducedinarecentcodeorconfigurationrollout,promptlyremovethebugfromproductionwitharollback,ifsafeandappropriate(arollbackalonemaybenecessarybutisnotsufficientifthebugcauseddatacorruption,forexample).Rememberthatevena“quickfix”needstimetobetested,built,androlledout.Testingisvitaltomakingsurethequickfixactuallyfixesthebug,andthatitdoesn’tintroduceadditionalbugsorotherunintendedconsequences.Generally,itisbetterto“rollback,fix,androllforward”ratherthan“rollforward,fix,androllforwardagain.”
Ifyouaimfor99.99%availability,youhaveapproximately15minutesoferrorbudgetperquarter.Thebuildstepofrollingforwardmaytakemuchlongerthan15minutes,sorollingbackimpactsyourusersmuchless.
(99.999%availabilityaffordsanerrorbudgetof80secondsperquarter.Atthispoint,systemsmayneedself-healingproperties,whichisoutofscopeforthischapter.)
Ifatallpossible,avoidchangesthatcan’tberolledback,suchasAPI-incompatiblechangesandlockstepreleases.
Usefeatureisolation
DesignyoursystemsothatiffeatureXgoeswrong,youcandisableitvia,forexample,afeatureflagwithoutaffectingfeatureY.Thisstrategyalsoimprovesreleasevelocity,andmakesdisablingfeatureXamuchsimplerdecision—youdon’tneedtocheckthatyourproductmanagersarecomfortablewithalsodisablingfeatureY.
Drainrequestsaway
Drainrequests(i.e.,redirectcustomerrequests)awayfromtheelementsofyoursystemthatexhibitthebug.Forexample,ifthebugistheresultofacodeorconfigrollout,andyourollouttoproductiongradually,youmayhavetheopportunitytodraintheelementsofyourinfrastructurethathavereceivedtheupdate.Thisallowsyoutomitigatethecustomerimpactinseconds,ratherthanrollingback,whichmaytakeminutesorlonger.
Alerting
GoogleSRE’smaximumoftwodistinctincidentsper12-hourshiftencouragesustobethoughtfulandcautiousabouthowweconfigurepagingalertsandhowweintroducenewones.SiteReliabilityEngineering,“MonitoringDistributedSystems”,describesGoogle’sapproachtodefiningthethresholdsforpagingalerts.Strictlyobservingtheseguidelinesiscriticaltomaintainingahealthyon-callrotation.
Itisworthhighlightingsomekeyelementsdiscussedinthatchapter:
Allalertsshouldbeimmediatelyactionable.Thereshouldbeanactionweexpectahumantotakeimmediatelyaftertheyreceivethepagethatthesystemisunabletotakeitself.Thesignal-to-noiseratioshouldbehightoensurefewfalsepositives;alowsignal-to-noiseratioraisestheriskforon-callengineerstodevelopalertfatigue.
IfateamfullysubscribestoSLO-basedalerting,orpagingonlywhenerrorbudgetisburned(seethesection“Black-BoxVersusWhite-Box”inSiteReliabilityEngineering),itiscriticalthatallteamsinvolvedindevelopingandmaintainingtheserviceagreeabouttheimportanceofmeetingtheSLOandprioritizetheirworkaccordingly.
IfateamfullysubscribestoSLO-basedandsymptom-basedalerting,relaxingalertthresholdsisrarelyanappropriateresponsetobeingpaged.
Justlikenewcode,newalertsshouldbethoroughlyandthoughtfullyreviewed.Eachalertshouldhaveacorrespondingplaybookentry.
Receivingapagecreatesanegativepsychologicalimpact.Tominimizethatimpact,onlyintroducenewpagingalertswhenyoureallyneedthem.Anyoneontheteamcanwriteanewalert,butthewholeteamreviewsproposedalertadditionsandcansuggestalternatives.Thoroughlytestnewalertsinproductiontovetfalsepositivesbeforetheyareupgradedtopagingalerts.Forexample,youmightemailthealert’sauthorwhenthealertfires,ratherthanpagingtheon-callengineer.
Newalertsmayfindproblemsinproductionthatyouweren’tawareof.Afteryouaddresstheseproductionbugs,alertingwillonlypageonnewbugs,effectivelyfunctioninglikeregressiontests.
Besuretorunthenewalertsintestmodelongenoughtoexperiencetypicalperiodicproductionconditions,suchasregularsoftwarerollouts,maintenanceeventsbyyourCloudprovider,weeklyloadpeaks,andsoon.Aweekoftestingisprobablyaboutright.However,thisappropriatewindowdependsonthealertandthesystem.
Finally,usethealert’striggerrateduringthetestingperiodtopredicttheexpectedconsumptionofyourpagerbudgetasaresultofthenewalert.Explicitlyapproveordisallowthenewalertasateam.Ifintroducinganewpagingalertcausesyourservicetoexceeditspagingbudget,thestabilityofthesystemneedsadditionalattention.
Rigoroffollow-up
Aimtoidentifytherootcauseofeverypage.“Rootcauses”extendoutofthemachineandintotheteam’sprocesses.Wasanoutagecausedbyabugthatwouldhavebeencaughtbyaunittest?Therootcausemightnotbeabuginthecode,butratherabugintheteam’sprocessesaroundcodereview.
Ifyouknowtherootcause,youcanfixandpreventitfromeverbotheringyouoryourcolleaguesagain.Ifyourteamcannotfigureouttherootcause,addmonitoringand/orloggingthatwillhelpyoufindtherootcauseofthepagethenexttimeitoccurs.Ifyoudon’thaveenoughinformationtoidentifythebug,youcanalwaysdosomethingtohelpdebugthepagefurthernexttime.Youshouldrarelyconcludethatapageistriggeredby“causeunknown.”Rememberthatasanon-callengineer,youareneveralone,soaskacolleaguetoreviewyourfindingsandseeifthere’sanythingyoumissed.Typically,it’seasiesttofindtherootcauseofanalertsoonafterthealerthastriggeredandfreshevidenceisavailable.
Explainingawayapageas“transient,”ortakingnoactionbecausethesystem“fixeditself”orthebuginexplicably“wentaway,”invitesthebugtohappenagainandcauseanotherpage,whichcausestroubleforthenexton-callengineer.
Simplyfixingtheimmediatebug(ormakinga“point”fix)missesagoldenopportunitytopreventsimilaralertsinthefuture.Usethepagingalertasanchancetosurfaceengineeringworkthatimprovesthesystemandobviatesanentireclassofpossiblefuturebugs.Dothisbyfilingaprojectbuginyourteam’sproductioncomponent,andadvocatetoprioritizeitsimplementationbygatheringdataabouthowmanyindividualbugsandpagesthisprojectwouldremove.Ifyourproposalwilltake3workingweeksor120workinghourstoimplement,andapagecostsonaverage4workinghourstoproperlyhandle,there’saclearbreak-evenpointafter30pages.
Forexample,imagineasituationwheretherearetoomanyserversonthesamefailuredomain,suchasaswitchinadatacenter,causingregularmultiplesimultaneousfailures:
Pointfix
Rebalanceyourcurrentfootprintacrossmorefailuredomainsandstopthere.
Systemicfix
Useautomationtoensurethatthistypeofserver,andallothersimilarservers,arealwaysspreadacrosssufficientfailuredomains,andthattheyrebalanceautomaticallywhennecessary.
Monitoring(orprevention)fix
Alertpreemptivelywhenthefailuredomaindiversityisbelowtheexpectedlevel,butnotyetservice-impacting.Ideally,thealertwouldbeaticketalert,notapage,sinceitdoesn’trequireanimmediateresponse.Thesystemisstillservinghappily,albeitatalowerlevelofredundancy.
Tomakesureyou’rethoroughinyourfollow-uptopagingalerts,considerthefollowingquestions:
HowcanIpreventthisspecificbugfromhappeningagain?
HowcanIpreventbugslikethisfromhappeningagain,bothforthissystemandothersystemsI’mresponsiblefor?
Whattestscouldhavepreventedthisbugfrombeingreleasedtoproduction?
Whatticketalertscouldhavetriggeredactiontopreventthebugfrombecomingcriticalbeforeitpaged?
Whatinformationalalertscouldhavesurfacedthebugonaconsolebeforeitbecamecritical?
HaveImaximizedtheimpactofthefixesI’mmaking?
Ofcourse,it’snotenoughforanon-callengineertojustfilebugsrelatedtothepagesthatoccurontheirshift.It’sincrediblyimportantthatbugsidentifiedbytheSREteamaredealtwithswiftly,toreducethepossibilityofthemrecurring.MakesureresourceplanningforboththeSREanddeveloperteamsconsidertheeffortrequiredtorespondtobugs.
WerecommendreservingafractionofSREanddeveloperteamtimeforrespondingtoproductionbugsastheyarise.Forexample,aGoogleon-callertypicallydoesn’tworkonprojectsduringtheiron-callshift.Instead,theyworkonbugsthatimprovethehealthofthesystem.Makesurethatyourteamroutinelyprioritizesproductionbugsaboveotherprojectwork.SREmanagersandtechleadsshouldmakesurethatproductionbugsarepromptlydealtwith,andescalatetothedeveloperteamdecisionmakerswhennecessary.
Whenapagingeventisseriousenoughtowarrantapostmortem,it’sevenmoreimportanttofollowthismethodologytocatalogandtrackfollow-upactionitems.(SeePostmortemCulture:LearningfromFailureformoredetails.)
Dataquality
Onceyouidentifybugsinyoursystemthatcausedpages,anumberofquestionsnaturallyarise:
Howdoyouknowwhichbugtofixfirst?
Howdoyouknowwhichcomponentinyoursystemcausedmostofyourpages?
Howdoyoudeterminewhatrepetitive,manualactionon-callengineersaretakingtoresolvethepages?
Howdoyoutellhowmanyalertswithunidentifiedrootcausesremain?
Howdoyoutellwhichbugsaretruly,notjustanecdotally,theworst?
Theanswerissimple:collectdata!
Whenbuildingupyourdatacollectionprocesses,youmighttrackandmonitorthepatternsinon-callload,butthiseffortdoesn’tscale.It’sfarmoresustainabletofileaplaceholderbugforeachpagingalertinyourbugtrackingsystem(e.g.,Jira,IssueTracker),andfortheon-callengineertocreatealinkbetweenthepagingalertsfromyourmonitoringsystemandtherelevantbuginthebugtrackingsystem,asandwhentheyrealizethateachalertissymptomaticofapreexistingissue.Youwillendupwithalistofas-yet-not-understoodbugsinonecolumn,andalistofallofthepagesthateachbugisbelievedtohavecausedinthenext.
Onceyouhavestructureddataaboutthecausesofthepages,youcanbegintoanalyzethatdataandproducereports.Thosereportscananswerquestionssuchas:
Whichbugscausethemostpages?Ideallywe’drollbackandfixbugsimmediately,butsometimes,findingtherootcauseanddeployingthefixtakesalongtime,andsometimessilencingkeyalertsisn’tareasonableoption.Forexample,theaforementionedConnectionSRETeammightexperienceongoingnetworkcongestionthatisn’timmediatelyresolvablebutstillneedstobetracked.Collectingdataonwhichproductionissuesarecausingthemostpagesandstresstotheteamsupportsdata-drivenconversationsaboutprioritizingyourengineeringeffortsystematically.
Whichcomponentofthesystemisthecauseofmostpages(paymentsgateway,authenticationmicroservice,etc.)?
Whencorrelatedwithyourothermonitoringdata,doparticularpagescorrespondtoothersignals(peaksinrequestrate,numberofconcurrentcustomersessions,numberofsignups,numberofwithdrawals,etc.)?
Tyingstructureddatatobugsandtherootcausesofyourpageshasotherbenefits:
Youcanautomaticallypopulatealistofexistingbugs(thatis,knownissues),whichmaybeusefulforyoursupportteam.
Youcanautomaticallyprioritizefixingbugsbasedonthenumberofpageseachbugcauses.
Thequalityofthedatayoucollectwilldeterminethequalityofthedecisionseitherhumansorautomatacanmake.Toensurehigh-qualitydata,considerthefollowingtechniques:
Defineanddocumentyourteam’spolicyandexpectationsondatacollectionforpages.
Setupnonpagingalertsfromthemonitoringsystemtohighlightwherepageswerenothandledaccordingtothoseexpectations.Managersandtechleadsshouldmakesurethattheexpectationsaremet.
Teammatesshouldfollowupwitheachotherwhenhandoffsdon’tadheretoexpectations.Positivecommentssuchas,“Maybethiscouldberelatedtobug123,”“I’vefiledabugwithyourfindingssowecanfollowupinmoredetail,”or“ThislooksalotlikewhathappenedonmyshiftlastWednesday:
延伸文章資訊
- 1Google SRE book - Dan Luu
Nat Welch (a former Google SRE) responded to this by saying that you can build confidence through...
- 2Chapter 8 - On-Call - Google - Site Reliability Engineering
In SRE, whenever an alert is created, a corresponding playbook entry is usually created. These gu...
- 3Google's Site Reliability Engineering Playbook - Karma Advisory
Google's Site Reliability Engineering Playbook. by Krishan Patel | Apr 6, ... Read on landing.goo...
- 4The Essential Guide to SRE - Blameless
SRE is a practice first coined by Google in 2003 that seeks to create systems and ... To create y...
- 5Writing Runbook Documentation When You're An SRE
As The Site Reliability Workbook says, playbooks “reduce stress, ... as the Site Reliability Engi...