Google SRE book - Dan Luu

2024-09-21

文章推薦指數： 80 %

投票人數：10人

Chapter 1: Introduction · Chapter 2: The production environment at Google, from the viewpoint of an SRE · Chapter 3: Embracing risk · Chapter 4: Service level ... GoogleSREbookI'mtryingsomeexperimentaltiersonPatreontoseeifIcangettosubstack-likelevelsoffinancialsupportforthisblogwithoutmovingtosubstack!ThebookstartswithastoryaboutatimeMargaretHamiltonbroughtheryoungdaughterwithhertoNASA,backinthedaysoftheApolloprogram.Duringasimulationmission,herdaughtercausedthemissiontocrashbypressingsomekeysthatcausedaprelaunchprogramtorunduringthesimulatedmission.Hamiltonsubmittedachangerequesttoadderrorcheckingcodetopreventtheerrorfromhappeningagain,buttherequestwasrejectedbecausetheerrorcaseshouldneverhappen.Onthenextmission,Apollo8,thatexacterrorconditionoccurredandapotentiallyfatalproblemthatcouldhavebeenpreventedwithatrivialchecktookNASA’sengineers9hourstoresolve.Thissoundsfamiliar--I’velosttrackofthenumberofdevpost-mortemsthathavethesamebasicstructure.Thisisanexperimentinnote-takingformeintwoways.First,Inormallytakepenandpapernotesandthenscantheminforposterity.Second,Inormallydon’tpostmynotesonline,butI’vebeeninspiredtotrythisbyJamieBrandon’snotesonbookshe’sread.Myhandwrittennotesareaseriesofbulletpoints,whichmaynottranslatewellintomarkdown.Oneissueisthatmymarkdownrendererdoesn’thandlemorethanonelevelofnesting,sothingswillgetartificiallyflattened.Thereareprobablymoreissues.Let’sfindoutwhattheyare!Incaseit'snotobvious,asidesfrommeareinitalics.Chapter1:IntroductionEverythinginthischapteriscoveredinmuchmoredetaillater.Twoapproachestohiringpeopletomanagesystemstability:Traditionalapproach:sysadminsAssembleexistingcomponentsanddeploytoproduceaserviceRespondtoeventsandupdatesastheyoccurGrowteamtoabsorbincreasedworkasservicegrowsProsEasytoimplementbecauseit’sstandardLargetalentpooltohirefromLotsofavailablesoftwareConsManualinterventionforchangemanagementandeventhandlingcausessizeofteamtoscalewithloadonsystemOpsisfundamentallyatoddswithdev,whichcancausepathologicalresistancetochanges,whichcausessimilarlypathologicalresponsefromdevs,whichreclassify“launches”as“incrementalupdates”,“flagflips”,etc.Google’sapproach:SREsHavesoftwareengineersdooperationsCandidatesshouldbeabletopassornearlypassnormaldevhiringbar,andmayhavesomeadditionalskillsthatarerareamongdevs(e.g.,L1-L3networkingorUNIXsysteminternals).CareerprogresscomparabletodevcareertrackResultsSREswouldbeboredbydoingtasksbyhandHavetheskillsetnecessarytoautomatetasksDothesameworkasanoperationsteam,butwithautomationinsteadofmanuallaborToavoidmanuallabortrapthatcausesteamsizetoscalewithserviceload,Googleplacesa50%capontheamountof“ops”workforSREsUpperbound.ActualamountofopsworkisexpectedtobemuchlowerProsCheapertoscaleCircumventsdevs/opssplitConsHardtohireforMaybeunorthodoxinwaysthatrequiremanagementsupport(e.g.,productteammaypushbackagainstdecisiontostopreleasesforthequarterbecausetheerrorbudgetisdepleted)Idon’treallyunderstandhowthisisanexampleofcircumventingthedev/opssplit.Icanseehowit’strueinonesense,buttheexampleofstoppingallreleasesbecauseanerrorbudgetgothitdoesn’tseemfundamentallydifferentfromthe“sysadmin”examplewhereteamspushbackagainstlaunches.ItseemsthatSREshavemorepoliticalcapitaltospendandthat,inthespecificexamplesgiven,theSREsmightbemorereasonable,butthere’snoreasontothinkthatsysadminscan’tbereasonable.TenetsofSRESREteamresponsibleforlatency,performance,efficiency,changemanagement,monitoring,emergencyresponse,andcapacityplanningEnsuringadurablefocusonengineering50%opscapmeansthatextraopsworkisredirectedtoproductteamsonoverflowProvidesfeedbackmechanismtoproductteamsaswellaskeepsloaddownTargetmax2eventsper8-12houron-callshiftPostmortemsforallseriousincidents,eveniftheydidn’ttriggerapageBlamelesspostmortems2eventspershiftisthemax,butwhat’stheaverage?Howmanyon-calleventsareexpectedtogetsentfromtheSREteamtothedevteamperweek?Howdoyougetfromablamefulpostmortemculturetoablamelesspostmortemculture?Nowthateveryoneknowsthatyoushouldhaveblamelesspostmortems,everyonewillclaimtodothem.Sortoflikehavinggoodtestinganddeploymentpractices.I’vebeenluckytobeonanoncallrotationthat’snevergottenpaged,butwhenItalktofolkswhojoinedrecentlyandareoncall,theyhavenotsogreatstoriesoffingerpointing,trashtalk,andblameshifting.Thefactthateveryoneknowsyou’resupposedtobeblamelessseemstomakeithardertocalloutblamefulness,noteasier.MovefastwithoutbreakingSLOErrorbudget.100%isthewrongreliabilitytargetforbasicallyeverythingGoingfrom59sto100%reliabilityisn’tnoticeabletomostusersandrequirestremendouseffortSetagoalthatacknowledgesthetrade-offandleavesanerrorbudgetErrorbudgetcanbespentonanything:launchingfeatures,etc.Errorbudgetallowsfordiscussionabouthowphasedrolloutsand1%experimentscanmaintaintolerablelevelsoferrorsGoalofSREteamisn’t“zerooutages”--SREandproductdevsareincentivealignedtospendtheerrorbudgettogetmaximumfeaturevelocityIt’snotexplicitlystated,butforteamsthatneedto“movefast”,consistentlycominginwayundertheerrorbudgetcouldbetakenasasignthattheteamisspendingtoomucheffortonreliability.Ilikethisideaalot,butwhenIdiscussedthiswithJessicaKerr,shepushedbackonthisideabecausemaybeyou’rejustunderyourerrorbudgetbecauseyougotluckyandasinglereallybadeventcanwipeoutyourerrorbudgetforthenextdecade.Followupquestion:howcanyoubeconfidentenoughinyourriskmodelthatyoucanpurposefullyconsumeerrorbudgettomovefasterwithoutworryingthatadownstream(intime)badeventwillputyouoverbudget?NatWelch(aformerGoogleSRE)respondedtothisbysayingthatyoucanbuildconfidencethroughsimulateddisastersandothertesting.MonitoringMonitoringshouldneverrequireahumantointerpretanypartofthealertingdomainThreevalidkindsofmonitoringoutputAlerts:humanneedstotakeactionimmediatelyTickets:humanneedstotakeactioneventuallyLogging:noactionneededNotethat,forexample,graphsareatypeoflogEmergencyResponseReliabilityisafunctionofMTTF(mean-time-to-failure)andMTTR(mean-time-to-recovery)Forevaluatingresponses,wecareaboutMTTRHumansaddlatencySystemsthatdon’trequirehumanstorespondwillhavehigheravailabilityduetolowerMTTRHavinga“playbook”produces3xlowerMTTRHavingherogeneralistswhocanrespondtoeverythingworks,buthavingplaybooksworksbetterIpersonallyagree,butboydowelikeouroncallheros.Iwonderhowwecanfosteracultureofdocumentation.Changemanagement70%ofoutagesduetochangesinalivesystem.Mitigation:ImplementprogressiverolloutsMonitoringRollbackRemovehumansfromtheloop,avoidstandardhumanproblemsonrepetitivetasksDemandforecastingandcapacityplanningStraightforward,butasurprisingnumberofteams/servicesdon’tdoitProvisioningAddingcapacityriskierthanloadshifting,sinceitofteninvolvesspinningupnewinstances/locations,makingsignificantchangestoexistingsystems(configfiles,loadbalancers,etc.)Expensiveenoughthatitshouldbedoneonlywhennecessary;mustbedonequicklyIfyoudon’tknowwhatyouactuallyneedandoverprovisionthatcostsmoneyEfficiencyandperformanceLoadslowsdownsystemsSREsprovisiontomeetcapacitytargetwithaspecificresponsetimegoalEfficiency==moneyChapter2:TheproductionenvironmentatGoogle,fromtheviewpointofanSRENonotesonthischapterbecauseI’malreadyprettyfamiliarwithit.TODO:maybegobackandreadthischapterinmoredetail.Chapter3:EmbracingriskEx:ifauserisonasmartphonewith99%reliability,theycan’ttellthedifferencebetween99.99%and99.999%reliabilityManagingriskReliabilityisn’tlinearincost.Itcaneasilycost100xmoretogetoneadditionalincrementofreliabilityCostassociatedwithredundantequipmentCostofbuildingoutfeaturesforreliabilityasopposedto“normal”featuresGoal:makesystemsreliableenough,butnottooreliable!MeasuringserviceriskStandardpractice:identifymetrictorepresentpropertyofsystemtooptimizePossiblemetric=uptime/(uptime+downtime)Problematicforagloballydistributedservice.Whatdoesuptimereallymean?Aggregateavailability=successfulrequests/totalrequestsObv,notallrequestsareequal,butaggregateavailabilityisanokfirstorderapproximationUsuallysetquarterlytargetsRisktoleranceofservicesUsuallynotobjectivelyobviousSREsworkwithproductownerstotranslatebusinessobjectivesintoexplicitobjectivesIdentifyingrisktoleranceofconsumerservicesTODO:maybereadthisindetailonsecondpassIdentifyingrisktoleranceofinfrastructureservicesTargetavailabilityRunningex:BigtableSomeconsumerservicesservedatadirectlyfromBigtable--needlowlatencyandhighreliabilitySometeamsusebigtableasabackingstoreforofflineanalysis--caremoreaboutthroughputthanreliabilityTooexpensivetomeetallneedsgenericallyEx:BigtableinstanceLow-latencyBigtableuserwantslowqueuedepthThroughputorientedBigtableuserwantsmoderatetohighqueuedepthSuccessandfailurearediametricallyopposedinthesetwocases!CostPartitioninfraandofferdifferentlevelsofserviceInadditiontoobv.benefits,allowsservicetoexternalizethecostofprovidingdifferentlevelsofservice(e.g.,expectlatencyorientedservicetobemoreexpensivethanthroughputorientedservice)MotivationforerrorbudgetsNonotesonthisbecauseIalreadybelieveallofthis.Maybegobackandre-readthisifinvolvedindebateaboutthis.Chapter4:ServicelevelobjectivesNote:skippingnotesonterminologysection.Ex:ChubbyplannedoutagesGooglefoundthatChubbywasconsistentlyoveritsSLO,andthatglobalChubbyoutageswouldcauseunusuallybadoutagesatGoogleChubbywassoreliablethatteamswereincorrectlyassumingthatitwouldneverbedownandfailingtodesignsystemsthataccountforfailuresinChubbySolution:takeChubbydowngloballywhenit’stoofaraboveitsSLOforaquarterto“show”teamsthatChubbycangodownWhatdoyouandyouruserscareabout?Toomanyindicators:hardtopayattentionToofewindicators:mightignoreimportantbehaviorDifferentclassesofservicesshouldhavedifferentindicatorsUser-facing:availability,latency,throughputStorage:latency,availability,durabilityBigdata:throughput,end-to-endlatencyAllsystemscareaboutcorrectnessCollectingindicatorsCanoftendonaturallyfromserver,butclient-sidemetricssometimesneeded.AggregationUsedistributionsandnotaveragesUserstudiesshowthatpeopleusuallyprefersloweraveragewithbettertaillatencyStandardizeoncommondefs,e.g.,averageover1minute,averageovertasksincluster,etc.Canhaveexceptions,buthavingreasonabledefaultsmakesthingseasierChoosingtargetsDon’tpicktargetbasedoncurrentperformanceCurrentperformancemayrequireheroiceffortKeepitsimpleAvoidabsolutesUnreasonabletotalkabout“infinite”scaleor“always”availableMinimizenumberofSLOsPerfectioncanwaitCanalwaysredefineSLOsovertimeSLOssetexpectationsKeepasafetymargin(internalSLOscanbedefinedmorelooselythanexternalSLOs)Don’toverachieveSeeChubbyexample,aboveAnotherexampleismakingsurethatthesystemisn’ttoofastunderlightloadsChapter5:EliminatingtoilCarlaGeisser:"Ifahumanoperatorneedstotouchyoursystemduringnormaloperations,youhaveabug.Thedefinitionofnormalchangesasyoursystemsgrow."Def:ToilNotjust“workIdon’twanttodo”ManualRepetitiveAutomatableTacticalNoenduringvalueO(n)withservicegrowthInsurveys,find33%toilonaverageNumberscanbeaslowas0%andashighas80%Toil>50%isasignthatthemanagershouldspreadtoilloadmoreevenlyIstoilalwaysbad?PredictableandrepetitivetaskscanbecalmingCanproduceasenseofaccomplishment,canbelow-risk/low-stressactivitiesSectiononwhytoilisbad.Skippingnotetakingforthatsection.Chapter6:MonitoringdistributedsystemsWhymonitor?Analyzelong-termtrendsCompareovertimeordoexperimentsAlertingBuildingdashboardsDebuggingAsAlexClemmeriswonttosay,ourproblemisn’tthatwemovetooslowly,it’sthatwebuildthewrongthing.Iwonderhowwecouldgetfromwherewearetodaytohavingenoughinstrumentationtobeabletomakeinformeddecisionswhenbuildingnewsystems.SettingreasonableexpectationsMonitoringisnon-trivial10-12personSREteamtypicallyhas1-2peoplebuildingandmaintainingmonitoringNumberhasdecreasedovertimeduetoimprovementsintooling/libs/centralizedmonitoringinfraGeneraltrendtowardssimpler/fastermonitoringsystems,withbettertoolsforposthocanalysisAvoid“magic”systemsLimitedsuccesswithcomplexdependencyhierarchies(e.g.,“ifDBslow,alertforDB,otherwisealertforwebsite”).Usedmostly(only?)forverystablepartsofsystemRulesthatgeneratealertsforhumansshouldbesimpletounderstandandrepresentaclearfailureAvoidingmagicincludesavoidingML?Lotsofwhite-boxmonitoringSomeblack-boxmonitoringforcriticalstuffFourgoldensignalsLatencyTrafficErrorsSaturationInterestingexamplesfromBigtableandGmailfromchapternottranscribed.Alotofinformationontheimportanceofkeepingalertssimplealsonottranscribed.ThelongrunThere’softenatensionbetweenlong-runandshort-runavailabilityCansometimesfixunreliablesystemsthroughheroiceffort,butthat’saburnoutriskandalsoafailureriskTakingacontrolledhitinshort-termreliabilityisusuallythebettertradeChapter7:EvolutionofautomationatGoogle“Automationisaforcemultiplier,notapanacea”ValueofautomationConsistencyExtensibilityMTTRFasternon-repairactionsTimesavingsMultipleinterestingcasestudiesandexplanationsskippedinnotes.Chapter8:ReleaseengineeringThisisaspecificjobfunctionatGoogleReleaseengineerroleReleaseengineersworkwithSWEsandSREstodefinehowsoftwareisreleasedAllowsdevteamstofocusondevworkDefinebestpracticesCompilerflags,formatsforbuildIDtags,etc.ReleasesautomatedModelsvarybetweenteamsCouldbe“pushongreen”anddeployeverybuildCouldbehourlybuildsanddeploysetc.HermeticbuildsBuildingsamerevnumbershouldalwaysgiveidenticalresultsSelf-contained--thisincludesversioningeverythingdownthecompilerusedCancherry-pickfixesagainstanoldrevtofixproductionsoftwareVirtuallyallchangesrequirecodereviewBranchingAllcodeinmainbranchReleasesarebranchedoffFixescangofrommastertobranchBranchesnevermergedbackTestingCIReleaseprocesscreatesanaudittrailthatrunstestsandshowsthattestspassedConfigmanagementDeceptivelysimple,cancauseinstabilityManypossibleschemes(allinvolvestoringconfiginsourcecontrolandhavingstrictconfigreview)Usemainlineforconfig--configmaintainedatheadandappliedimmediatelyOriginallyusedforBorg(andpre-Borgsystems)Binaryreleasesandconfigchangesdecoupled!IncludeconfigfilesandbinariesinsamepackageSimpleTightlycouplesbinaryandconfig--okforprojectswithfewconfigfilesorwherefewconfigschangePackageconfiginto“configurationpackages”SamehermeticprincipleasforcodeReleaseengineeringshouldn’tbeanafterthought!BudgetresourcesatbeginningofdevcycleChapter9:SimplicityStabilityvs.agilityCanmakethingsstablebyfreezing--needtobalancethetwoReliablesystemscanincreaseagilityReliablerolloutsmakeiteasiertolinkchangestobugsVirtueofboring!Essentialvs.accidentalcomplexitySREsshouldpushbackwhenaccidentalcomplexityisintroducedCodeisaliabilityRemovedeadcodeorotherbloatMinimalAPIsSmallerAPIseasiertotest,morereliableModularityAPIversioningSameascode,whereyou’davoidmisc/utilclassesReleasesSmallreleaseseasiertomeasureCan’ttellwhathappenedifwereleased100changestogetherChapter10:Alteringfromtime-seriesdataBorgmonSimilar-ishtoPrometheusCommondataformatforloggingDatausedforbothdashboardsandalertsFormalizedalegacydataformat,“varz”,whichallowedmetricstobeviewedviaHTTPToviewmetricsmanually,gotohttp://foo:80/varzAddingametriconlyrequiresasingledeclarationincodelowuser-costtoaddnewmetricBorgmonfetches/varzfromeachtargetperiodicallyAlsoincludessyntheticdatalikehealthcheck,ifnamewasresolved,etc.,TimeseriesarenaDatastoredin-memory,withcheckpointingtodiskFixedsizedallocationGCexpiresoldestentrieswhenfullconceptuallya2-darraywithtimeononeaxisanditemsontheotheraxis24bytesforadatapoint->1Muniquetimeseriesfor12hoursat1-minuteintervals=17GBBorgmonrulesAlgebraicexpressionsComputetime-seriesfromothertime-seriesRulesevaluatedinparallelonathreadpoolCountersvs.gaugesDef:countersarenon-decreasingDef:cantakeanyvalueCounterspreferredtogaugesbecausegaugescanloseinformationdependingonsamplingintervalAlteringBorgmonrulescantriggeralertsHaveminimumdurationtoprevent“flapping”Usuallysettotwodurationcyclessothatmissedcollectionsdon’ttriggeranalertScalingBorgmoncantaketime-seriesdatafromotherBorgmon(usesbinarystreamingprotocolinsteadofthetext-basedvarzprotocol)CanhavemultipletiersoffiltersProberBlack-boxmonitoringthatmonitorswhattheuserseesCanbequeriedwithvarzordirectlysendalertstoAltertmanagerConfigurationSeparationbetweendefinitionofrulesandtargetsbeingmonitoredChapter11:Beingon-callTypicalresponsetime5minforuser-facingorothertime-criticaltasks30minforlesstime-sensitivestuffResponsetimeslinkedtoSLOsEx:99.99%foraquarteris13minutesofdowntime;clearlycan’thaveresponsetimeabove13minutesServiceswithlooserSLOscanhaveresponsetimesinthe10sofminutes(ormore?)Primaryvssecondaryon-callWorkdistributionvariesbyteamInsome,secondarycanbebackupforprimaryInothers,secondaryhandlesnon-urgent/non-pagingevents,primaryhandlespagesBalancedon-callDef:quantity:percentoftimeon-callDef:quality:numberofincidentsthatoccurwhileoncallThisisgreat.Weshoulddothis.Peoplesometimesgetreallyroughon-callrotationsafewtimesinarowandconsideringtheinfrequencyofon-callrotationsthere’snoreasontoexpectthatthisshouldrandomlybalanceoutoverthecourseofayearortwo.Balanceinquantity>=50%ofSREtimegoesintoengineeringOfremainder,nomorethan25%spenton-callPrefermulti-siteteamsNightshiftsarebadforhealth,multi-siteteamsalloweliminationofnightshiftsBalanceinqualityOnaverage,dealingwithanincident(inclroot-causeanalysis,remediation,writingpostmortem,fixingbug,etc.)takes6hours.=>shouldn’thavemorethan2incidentsina12-houron-callshiftTostaywithinupperbound,wantveryflatdistributionofpages,withmedianvalueof0Compensation--extrapayforbeingon-call(time-offorcash)Chapter12:EffectivetroubleshootingNonotesforthischapter.Chapter13:EmergencyresponseTest-inducedemergencySREsbreaksystemstoseewhathappensEx:wanttoflushouthiddendependenciesonadistributedMySQLdatabasePlan:blockaccessto1/100ofDBsResponse:dependentservicesreportthatthey’reunabletoaccesskeysystemsSREresponse:SREabortsexercise,triestorollbackpermissionschangeRollbackattemptfailsAttempttorestoreaccesstoreplicasworksNormaloperationrestoredin1hourWhatwentwell:dependentteamsescalatedissuesimmediately,wereabletorestoreaccessWhatwelearned:hadaninsufficientunderstandingofthesystemanditsinteractionwithothersystems,failedtofollowincidentresponsethatwouldhaveinformedcustomersofoutage,hadn’ttestedrollbackproceduresintestenvChange-inducedemergencyChangescancausefailures!Ex:configchangetoabusepreventioninfrapushedonFridaytriggeredcrash-loopbugAlmostallexternallyfacingsystemsdependonthis,becomeunavailableManyinternalsystemsalsohavedependencyandbecomeunavailableAlertsstartfiringwithsecondsWithin5minutesofconfigpush,engineerwhopushedchangerolledbackchangeandservicesstartedrecoveringWhatwentwell:monitoringfiredimmediately,incidentmanagementworkedwell,out-of-bandcommunicationssystemskeptpeopleuptodateeventhoughmanysystemsweredown,luck(engineerwhopushedchangewasfollowingreal-timecommschannels,whichisn’tpartofthereleaseprocedure)Whatwelearned:pushtocanarydidn’ttriggersameissuebecauseitdidn’thitaspecificconfigkeywordcombination;pushwasconsideredlow-riskandwentthroughlessstringentcanaryprocess,alertingwastoonoisyduringoutageProcess-inducedemergencyNonotesonprocess-inducedexample.Chapter14:ManagingincidentsThisisanareawhereweseemtoactuallybeprettygood.Nonotesonthischapter.Chapter15:Postmortemculture:learningfromfailureI'minstrongagreementwithmostofthischapter.Nonotes.Chapter16:TrackingoutagesEscalator:centralizedsystemthattracksACKstoalerts,notifiesotherpeopleifnecessary,etc.Outalator:givestime-interleavedviewofnotificationsformultiplequeuesAlsosavesrelatedemailandallowsmarkingsomemessagesas“important”,cancollapsenon-importantmessages,etc.OurversionofEscalatorseemsfine.WecouldreallyusesomethinglikeOutalator,though.Chapter17:TestingforreliabilityPreachingtothechoir.Nonotesonthissection.Wecouldreallydoalotbetterhere,though.Chapter18:SoftwareengineeringinSREEx:Auxon,capacityplanningautomationtoolBackground:traditionalcapacityplanningcycle1)collectdemandforecasts(quarterstoyearsinadvance)2)Planallocations3)Reviewplan4)DeployandconfigresourcesTraditionalapproachconsManythingscanaffectplan:increaseinefficiency,increaseinadoptionrate,clusterdeliverydateslips,etc.EvensmallchangesrequirerecheckingallocationplanLargechangesmayrequiretotalrewriteofplanLaborintensiveanderrorproneGooglesolution:intent-basedcapacityplanningSpecifyrequirements,notimplementationEncoderequirementsandautogenerateacapacityplanInadditiontosavinglabor,solverscandobetterthanhumangeneratedsolutions=>costsavingsLadderofexamplesofincreasinglyintentbasedplanning1)Want50coresinclustersX,Y,andZ--whythoseresourcesinthoseclusters?2)Want50-corefootprintinany3clustersinregion--whythatmanyresourcesandwhy3?3)WanttomeetdemandwithN+2redundancy--whyN+2?4)Want59sofreliability.Couldfind,forexample,thatN+2isn’tsufficientFoundthatgreatestgainsarefromgoingto(3)Somesophisticatedservicesmaygofor(4)PuttingconstraintsintotoolsallowstradeoffstobeconsistentacrossfleetAsopposedtomakingindividualadhocdecisionsAuxoninputsRequirements(e.g.,“servicemustbeN+2percontinent”,“frontendserversnomorethan50msawayfrombackendservers”DependenciesBudgetprioritiesPerformancedata(howaservicescales)Demandforecastdata(notethatserviceslikeColossushavederivedforecastsfromdependentservices)Resourcesupply&pricingInputsgointosolver(mixed-integerorlinearprogrammingsolver)NonotesonwhySREsoftware,howtospinupagroup,etc.TODO:re-readbackhalfofthischapterandtakenotesifit’severdirectlyrelevantforme.Chapter19:LoadbalancingatthefrontendNonotesonthissection.Seemsprettysimilartowhatwehaveintermsofhigh-levelgoals,andthechapterdoesn’tgointolow-leveldetails.It’snotablethattheydo[redacted]differentlyfromus,though.Formoreinfoonlower-leveldetails,there’stheMaglevpaper.Chapter20:LoadbalancinginthedatacenterFlowcontrolNeedtoavoidunhealthytasksNaiveflowcontrolforunhealthytasksTracknumberofrequeststoabackendTreatbackendasunhealthywhenthresholdisreachedCons:generallyterribleHealth-basedflowcontrolBackendtaskcanbeinoneofthreestates:{healthy,refusingconnections,lameduck}Lameduckstatecanstilltakeconnections,butsendsbackpressurerequesttoallclientsLameduckstatesimplifiescleanshutdownDef:subsetting:limitingpoolofbackendtasksthataclienttaskcaninteractwithClientsinRPCsystemmaintainpoolofconnectionstobackendsUsingpoolreduceslatencycomparedtodoingsetup/teardownwhenneededInactiveconnectionsarerelativelycheap,butnotfree,evenin“inactive”mode(reducedhealthchecks,UDPinsteadofTCP,etc.)ChoosingthecorrectsubsetTyp:20-100,choosebaseonworkloadSubsetselection:randomBadutilizationSubsetselection:roundrobinOrderispermuted;eachroundhasitsownpermutationLoadbalancingSubsetselectionisforconnectionbalancing,butwestillneedtobalanceloadLoadbalancing:roundrobinInpractice,observe2xdifferencebetweenmostloadedandleastloadInpractice,mostexpensiverequestcanbe1000xmoreexpensivethancheapestrequestInaddition,there’srandomunpredictablevariationinrequestsLoadbalancing:least-loadedroundrobinExactlywhatitsoundslike:round-robinamongleastloadedbackendsLoadappearstobemeasuredintermsofconnectioncount;maynotalwaysbethebestmetricThisisperclient,notglobally,soit’spossibletosendrequeststoabackendwithmanyrequestsfromotherclientsInpractice,forlargservices,findthatmost-loadedtaskusestwiceasmuchCPUasleast-loaded;similartonormalroundrobinLoadbalancing:weightedroundrobinSameasabove,butweightwithotherfactorsInpractice,muchbetterloaddistributionthanleast-loadedroundrobinIwonderwhatHerokumeantwhentheyrespondedtoRapGeniusbysaying“afterextensiveresearchandexperimentation,wehaveyettofindeitheratheoreticalmodelorapracticalimplementationthatbeatsthesimplicityandrobustnessofrandomroutingtowebbackendsthatcansupportmultipleconcurrentconnections”.Chapter21:HandlingoverloadEvenwith“good”loadbalancing,systemswillbecomeoverloadedTypicalstrategyistoservedegradedresponses,butunderveryhighloadthatmaynotbepossibleModelingcapacityasQPSorasafunctionofrequests(e.g.,howmanykeystherequestsread)isfailureproneThesegenerallychangeslowly,butcanchangerapidly(e.g.,becauseofasinglecheckin)Bettersolution:measuredirectlyavailableresourcesCPUutilizationisusuallyagoodsignalforprovisioningWithGC,memorypressureturnsintoCPUutilizationWithothersystems,canprovisionotherresourcessuchthatCPUislikelytobelimitingfactorIncaseswhereover-provisioningCPUistooexpensive,takeotherresourcesintoaccountHowmuchdoesitcosttogenerallyover-provisionCPUlikethat?Client-sidethrottlingBackendsstartrejectingrequestswhencustomerhitsquotaRequestsstilluseresources,evenwhenrejected--withoutthrottling,backendscanspendmostoftheirresourcesonrejectingrequestsCriticalitySeemstobeprioritybutwithadifferentname?First-classnotioninRPCsystemClient-sidethrottlingkeepsseparatestatsforeachlevelofcriticalityBydefault,criticalityispropagatedthroughsubsequentRPCsHandlingoverloadederrorsShedloadtootherDCsifDCisoverloadedShedloadtootherbackendsifDCisokbutsomebackendsareoverloadedClientsretrywhentheygetanoverloadedresponsePer-requestretrybudget(3)Per-clientretrybudget(10%)Failedretriesfromclientcause“overloaded;don’tretry”responsetobereturnedupstreamHavinga“don’tretry”responseis“obvious”,butrelativelyrareinpractice.Alotofrealsystemshaveaproblemwithfailedretriescausingmoreretriesupthestack.Thisisespeciallytruewhencrossingahardware/softwareboundary(e.g.,filesystemreadcausesmanyretriesonDVD/SSD/spinningdisk,fails,andthengetsretriedatthefilesystemlevel),butseemstobegenerallytrueinpuresoftwaretoo.Chapter22:AddressingcascadingfailuresTypicalfailurescenarios?ServeroverloadEx:havetwoserversOnegetsoverloaded,failingOtheronenowgetsalltrafficandalsofailsResourceexhaustionCPU/memory/threads/filedescriptors/etc.Ex:dependenciesamongresources1)JavafrontendhaspoorlytunedGCparams2)FrontendrunsoutofCPUduetoGC3)CPUexhaustionslowsdownrequests4)IncreasedqueuedepthusesmoreRAM5)Fixedmemoryallocationforentirefrontendmeansthatlessmemoryisavailableforcaching6)Lowerhitrate7)Morerequestsintobackend8)BackendrunsoutofCPUorthreads9)Healthchecksfail,startingcascadingfailureDifficulttodeterminecauseduringoutageNote:policiesthatavoidserversthatserveerrorscanmakethingsworsefewerbackendsavailable,whichgettoomanyrequests,whichthenbecomeunavailablePreventingserveroverloadLoadtest!MusthaverealisticenvironmentServedegradedresultsFailcheaplyandearlywhenoverloadedHavehigher-levelsystemsrejectrequests(atreverseproxy,loadbalancer,andontasklevel)PerformcapacityplanningQueuemanagementQueuesdonothinginsteadystateQueuedreqsconsumememoryandincreaselatencyIftrafficissteady-ish,bettertokeepsmallqueuesize(say,50%orlessofthreadpoolsize)Ex:GmailusesqueuelessserverswithfailoverwhenthreadsarefullForburstyworkloads,queuesizeshouldbefunctionof#threads,timeperreq,size/freqofburstsSeealso,adaptiveLIFOandCoDelGracefuldegradationNotethatit’simportanttotestgracefuldegradationpath,maybebyrunningasmallsetofserversnearoverloadregularly,sincethispathisrarelyexercisedundernormalcircumstancesBesttokeepsimpleandeasytounderstandRetriesAlwaysuserandomizedexponentialbackoffSeepreviouschapterononlyretryingatasinglelevelConsiderhavingaserver-wideretrybudgetDeadlinesDon’tdoworkwheredeadlinehasbeenmissed(commonthemeforcascadingfailure)Ateachstage,checkthatdeadlinehasn’tbeenhitDeadlinesshouldbepropagated(e.g.,eventhroughRPCs)BimodallatencyEx:problemwithlongdeadlineSayfrontendhas10servers,100threadseach(1kthreadsoftotalcap)Normaloperation:1kQPS,reqstake100ms=>100workerthreadsoccupied(1kQPS*.1s)Say5%ofoperationsdon’tcompleteandthere’sa100sdeadlineThatconsumes5kthreads(50QPS*100s)Frontendoversubscribedby5x.Successrate=1k/(5k+95)=19.6%=>80.4%errorrateUsingdeadlinesinsteadoftimeoutsisgreat.Weshouldreallybemoresystematicaboutthis.Notallowingsystemstofillupwithpointlesszombierequestsbysettingreasonabledeadlinesis“obvious”,butalotofrealsystemsseemtohavearbitrarytimeoutsatniceroundhumannumbers(30s,60s,100s,etc.)insteadofdeadlinesthatareassignedwithload/cascadingfailuresinmind.Trytoavoidintra-layercommunicationSimpler,avoidspossiblecascadingfailurepathsTestingforcascadingfailuresLoadtestcomponents!LoadtestingbothrevealsbreakingandpointferretsoutcomponentsthatwilltotallyfalloverunderloadMakesuretotesteachcomponentseparatelyTestnon-criticalbackends(e.g.,makesurethatspellingsuggestionsforsearchdon’timpedethecriticalpath)ImmediatestepstoaddresscascadingfailuresIncreaseresourcesTemporarilystophealthcheckfailures/deathsRestartservers(onlyifthatwouldhelp--e.g.,inGCdeathspiralordeadlock)Droptraffic--drastic,lastresortEnterdegradedmode--requireshavingbuiltthisintoservicepreviouslyEliminatebatchloadEliminatebadtrafficChapter23:DistributedconsensusforreliabilityHowdoweagreeonquestionslike…Whichprocessistheleaderofagroupofprocesses?Whatisthesetofprocessesinagroup?Hasamessagebeensuccessfullycommittedtoadistributedqueue?Doesaprocessholdaparticularlease?What’sthevalueinadatastoreforaparticularkey?Ex1:split-brainServicehasreplicatedfileserversindifferentracksMustavoidwritingsimultaneouslytobothfileserversinasettoavoiddatacorruptionEachpairoffileservershasoneleader&onefollowerServersmonitoreachotherviaheartbeatsIfoneservercan’tcontacttheother,itsendsaSTONITH(shoottheothernodeinthehead)Butwhathappensifthenetworkissloworpacketsgetdropped?WhathappensifbothserversissueSTONITH?Thisremindsmeofoneofmyfavoritedistributeddatabasepostmortems.Thedatabaseisconfiguredasaring,whereeachnodetalkstoandreplicatesdataintoa“neighborhood”of5servers.Ifsomemachinesintheneighborhoodgodown,otherserversjointheneighborhoodanddatagetsreplicatedappropriately.Soundsgood,butinthecasewhereaservergoesbadanddecidesthatnodataexistsandallofitsneighborsarebad,itcanreturnresultsfasterthananyofitsneighbors,aswellastellitsneighborsthatthey’reallbad.Becausethebadserverhasnodatait’sveryfastandcanreportthatitsneighborsarebadfasterthanitsneighborscanreportthatit’sbad.Whoops!Ex2:failoverrequireshumaninterventionAhighlyshardedDBhasaprimaryforeachshard,whichreplicatestoasecondaryinanotherDCExternalhealthchecksdecideiftheprimaryshouldfailovertoitssecondaryIftheprimarycan’tseethesecondary,itmakesitselfunavailabletoavoidtheproblemsfrom“Ex1”ThisincreasesoperationalloadProblemsarecorrelatedandthisisrelativelylikelytorunintoproblemswhenpeoplearebusywithotherissuesIfthere’sanetworkissues,there’snoreasontothinkthatahumanwillhaveabetterviewintothestateoftheworldthanmachinesinthesystemEx3:faultygroup-membershipalgorithmsWhatitsoundslike.NonotesonthispartImpossibilityresultsCAP:Pisimpossibleinrealnetworks,sochooseCorAFLP:asyncdistributedconsensuscan’tgauranteeprogresswithunreliablenetworkPaxosSequenceofproposals,whichmayormaynotbeacceptedbythemajorityofprocessesNotaccepted=>failsSequencenumberperproposal,mustbeuniqueacrosssystemProposalProposersendsseqnumbertoacceptorsAcceptoragreesifithasn’tseenahigherseqnumberProposerscantryagainwithhigherseqnumberIfproposerrecvsagreementfrommajority,itcommitsbysendingcommitmessagewithvalueAcceptorsmustjournaltopersistentstoragewhentheyacceptPatternsDistributedconsensusalgorithmsarealow-levelprimitiveReliablereplicatedstatemachinesFundamentalbuildingblockfordataconfig/storage,locking,leaderelection,etc.Seethesepapers:Schnieder,Aguilera,Amir&KirschReliablerepliacteddataandconfigstoresNondistributed-consensus-basedsystemsoftenusetimestamps:problematicbecauseclocksynchronycan'tbegauranteedSeeSpannerpaperforanexampleofusingdistributedconsensusLeaderelectionEquivalenttodistributedconsensusWhereworkoftheleadercanperformedperformedbyoneprocessorsharded,leaderelectionpatternallowswritingdistributedsystemasifitwereasimpleprogramUsedby,forexample,GFSandColussusDistributedcoordinationandlockingservicesBarrierused,forexample,inMapReducetomakesurethatMapisfinishedbeforeReduceproceedsDistributedqueuesandmessagingQueues:cantoleratefailuresfromworkernodes,butsystemneedstoensurethatclaimedtasksareprocessedCanuseleasesinsteadofremovalfromqueueUsingRSMmeansthatsystemcancontinueprocessingevenwhenqueuegoesdownPerformanceConventionalwisdomthatconsensusalgorithmscan'tbeusedforhigh-throughputlow-latencysystemsisfalseDistributedconsensusatthecoreofmanyGooglesystemsScalemakesthisworseforGooglethanmostothercompanies,butitstillworksMulti-PaxosStrongleaderprocess:unlessaleaderhasnotyetbeenelectedorafailureoccurs,onlyoneroundtriprequiredtoreachconsensusNotethatanotherprocessinthegroupcanproposeatanytimeCanpingpongbackandforthandpseudo-livelockNotunqiquetomulti-paxos,StandardsolutionsaretoelectaproposerprocessoruserotatingproposerScalingread-heavyworkloadsEx:PhotonallowsreadsfromanyreplicaReadfromstalereplicarequresextrawork,butdoesn'tproducebadincorrectresultsTogauranteereadsareuptodate,dooneofthefollowing:1)Performaread-onlyconsensusoperation2)Readdatafromreplicathat'sguaranteedtobemost-up-to-date(stableleadercanprovidethisguarantee)3)UsequorumleasesQuorumleasesReplicascanbegrantedleaseoversome(orall)datainthesystemFastPaxosDesignedtobefasteroverWANEachclientcansendProposetoeachmemberofagroupofacceptorsdirectly,insteadofthroughaleaderNotnecessarilyfasterthanclassicPaxos--ifRTTtoacceptorsislong,we'vetradedonemessageacrossslowlinkplusNinparallelacrossfastlinkforNacrossslowlinkStableleaders"Almostalldistributedconsensussystemsthathavebeendesignedwithperformanceinminduseeitherthesinglestableleaderpatternorasystemofrotatingleadership"TODO:finishthischapter?Chapter24:DistributedcronTODO:gobackandreadinmoredetail,takenotes.Chapter25:DataprocessingpipelinesExamplesofthisareMapReduceorFlumeConvenientandeasytoreasonaboutthehappycase,butfragileInitialinstallisusuallyokbecauseworkersizing,chunking,parametersarecarefullytunedOvertime,loadchanges,causesproblemsChapter26:DataintegrityDefinitionnotnecessarilyobviousIfaninterfacebugcausesGmailtofailtodisplaymessages,that’sthesameasthedatabeinggonefromtheuser’sstandpoint99.99%uptimemeans1hourofdowntimeperyear.Probablyokformostapps99.99%goodbytesina2GBfilemeans200Kcorrupt.ProbablynotokformostappsBackupisnon-trivialMayhavemixtureoftransactionalandnon-transactionalbackupandrestoreDifferentversionsofbusinesslogicmightbeliveatonceIfservicesareindependentlyversioned,maybehavemanycombinationsofversionsReplicasaren’tsufficient--replicasmaysynccorruptionStudyof19datarecoveryeffortsatGoogleMostcommonuser-visibledatalosscausedbydeletionorlossofreferentialintegrityduetosoftwarebugsHardestcaseswerelow-gradecorruptiondiscoveredweekstomonthslaterDefenseindepthFirstlayer:softdeletionUsersshouldbeabletodeletetheirdataButthatmeansthatuserswillbeabletoaccidentallydeletetheirdataAlso,accounthijacking,etc.AccidentallydeletioncanalsohappenduetobugsSoftdeletiondelaysactualdeletionforsomeperiodoftimeSecondlayer:backupsNeedtofigureouthowmuchdatait’soktoloseduringrecovery,howlongrecoverycantake,andhowfarbackbackupsneedtogoWantbackupstogobackforever,sincecorruptioncangounnoticedformonths(orlonger)ButchangestocodeandschemacanmakerecoveryofolderbackupsexpensiveGoogleusuallyhas30to90daywindow,dependingontheserviceThirdlayer:earlydetectionOut-of-bandintegritychecksHardtodothisright!CorrectchangescancausecheckerstofailButlooseningcheckscancausefailurestogetmissedNonotesonthetwointerestingcasestudiescovered.Chapter27:ReliableproductlaunchesatscaleNonotesonthischapterinparticular.Alotofthismaterialiscoveredbyoratleastimpliedbymaterialinotherchapters.Probablyworthatleastlookingatexamplechecklistitemsandactionitemsbeforethinkingaboutlaunchstrategy,though.AlsoseeappendixE,launchcoordinationchecklist.Chapters28-32:VariouschaptersonmanagementNonotesonthese.NotesonthenotesIlikethisbookalot.Ifyoucareaboutbuildingreliablesystems,readingthroughthisbookandseeingwhattheteamsaroundyoudon’tdoseemslikeagoodexercise.Thatbeingsaid,thebookisn'tperfect.Thetwobigdownsidesformestemfromthesameissue:thisisoneofthosebooksthat'sacollectionofchaptersbydifferentpeople.Someoftheeditorsarebetterthanothers,meaningthatsomeofthechaptersareclearerthanothersandthatbecausethechaptersseemdesignedtobereadableasstandalonechapters,there'safairamountofredundancyinthebookifyoujustreaditstraightthrough.Dependingonhowyouplantousethebook,thatcanbeapositive,butit'sanegativetome.Butevenincludinghedownsides,I'dsaythatthisisthemostvaluabletechnicalbookI'vereadinthepastyearandI'vecoveredprobably20%ofthecontentinthissetofnotes.Ifyoureallylikethesenotes,you'llprobablywanttoreadthefullbook.Ifyoufoundthissetofnoteswaytoodry,maybetrythismuchmoreentertainingsetofnotesonatotallydifferentbook.Ifyoufoundthistoonlybeslightlytoodry,maybetrythissetofnotesonclassesoferrorscommonlyseeninpostmortems.Inanycase,I’dappreciatefeedbackonthesenotes.Writingupnotesisanexperimentforme.Ifpeoplefindtheseuseful,I'lltrytowriteupnotesonbooksIreadmoreoften.Ifnot,Imighttryadifferentapproachtowritingupnotesorsomeotherkindofpostentirely.←SomeprogrammingblogstoconsiderreadingWeonlyhirethetrendiest→ArchiveSupportthissite(patreon)Twitter

請為這篇文章評分？

延伸文章資訊

SRE books - Google - Site Reliability Engineering

Discover Site Reliability Engineering, learn about building and maintaining reliable engineering ...

Site Reliability Engineering: How Google Runs Production ...

Chris Jones is a Site Reliability Engineer for Google App Engine, a cloud platform-as-a-service p...

Google SRE book - Dan Luu

Chapter 1: Introduction · Chapter 2: The production environment at Google, from the viewpoint of ...

SRE Book (@srebook) / Twitter

SRE Book. @srebook. Documenting the care and feeding of production software systems ... Google's ...

Site Reliability Engineering: How Google Runs Production ...

In this collection of essays and articles, key members of Google's Site Reliability Team explain ...

Google SRE book - Dan Luu

文章推薦指數： 80 %

請為這篇文章評分？

延伸文章資訊

最新文章

相關網站資訊

跆拳道拳法

遊戲裝備英文

跆拳道基本動作

健身房

槓鈴

雪山入山證

排雲山莊

山域嚮導資格檢定辦法

打跆拳道英文

跆拳道英文簡寫

體操英文