Google - Site Reliability Engineering

文章推薦指數: 80 %
投票人數:10人

Thus, Google SRE relies on on-call playbooks, in addition to exercises such as the "Wheel of Misfortune," to prepare engineers to react to on-call events. TableofContents Foreword Preface PartI-Introduction 1.Introduction 2.TheProductionEnvironmentatGoogle,fromtheViewpointofanSRE PartII-Principles 3.EmbracingRisk 4.ServiceLevelObjectives 5.EliminatingToil 6.MonitoringDistributedSystems 7.TheEvolutionofAutomationatGoogle 8.ReleaseEngineering 9.Simplicity PartIII-Practices 10.PracticalAlerting 11.BeingOn-Call 12.EffectiveTroubleshooting 13.EmergencyResponse 14.ManagingIncidents 15.PostmortemCulture:LearningfromFailure 16.TrackingOutages 17.TestingforReliability 18.SoftwareEngineeringinSRE 19.LoadBalancingattheFrontend 20.LoadBalancingintheDatacenter 21.HandlingOverload 22.AddressingCascadingFailures 23.ManagingCriticalState:DistributedConsensusforReliability 24.DistributedPeriodicSchedulingwithCron 25.DataProcessingPipelines 26.DataIntegrity:WhatYouReadIsWhatYouWrote 27.ReliableProductLaunchesatScale PartIV-Management 28.AcceleratingSREstoOn-CallandBeyond 29.DealingwithInterrupts 30.EmbeddinganSREtoRecoverfromOperationalOverload 31.CommunicationandCollaborationinSRE 32.TheEvolvingSREEngagementModel PartV-Conclusions 33.LessonsLearnedfromOtherIndustries 34.Conclusion AppendixA.AvailabilityTable AppendixB.ACollectionofBestPracticesforProductionServices AppendixC.ExampleIncidentStateDocument AppendixD.ExamplePostmortem AppendixE.LaunchCoordinationChecklist AppendixF.ExampleProductionMeetingMinutes Bibliography Introduction WrittenbyBenjaminTreynorSloss6EditedbyBetsyBeyer Hopeisnotastrategy. TraditionalSREsaying Itisatruthuniversallyacknowledgedthatsystemsdonotrunthemselves.How,then,shouldasystem—particularlyacomplexcomputingsystemthatoperatesatalargescale—berun? TheSysadminApproachtoServiceManagement Historically,companieshaveemployedsystemsadministratorstoruncomplexcomputingsystems. Thissystemsadministrator,orsysadmin,approachinvolvesassemblingexistingsoftwarecomponentsanddeployingthemtoworktogethertoproduceaservice.Sysadminsarethentaskedwithrunningtheserviceandrespondingtoeventsandupdatesastheyoccur.Asthesystemgrowsincomplexityandtrafficvolume,generatingacorrespondingincreaseineventsandupdates,thesysadminteamgrowstoabsorbtheadditionalwork.Becausethesysadminrolerequiresamarkedlydifferentskillsetthanthatrequiredofaproduct’sdevelopers,developersandsysadminsaredividedintodiscreteteams:"development"and"operations"or"ops." Thesysadminmodelofservicemanagementhasseveraladvantages.Forcompaniesdecidinghowtorunandstaffaservice,thisapproachisrelativelyeasytoimplement:asafamiliarindustryparadigm,therearemanyexamplesfromwhichtolearnandemulate.Arelevanttalentpoolisalreadywidelyavailable.Anarrayofexistingtools,softwarecomponents(offtheshelforotherwise),andintegrationcompaniesareavailabletohelprunthoseassembledsystems,soanovicesysadminteamdoesn’thavetoreinventthewheelanddesignasystemfromscratch. Thesysadminapproachandtheaccompanyingdevelopment/opssplithasanumberofdisadvantagesandpitfalls.Thesefallbroadlyintotwocategories:directcostsandindirectcosts. Directcostsareneithersubtlenorambiguous.Runningaservicewithateamthatreliesonmanualinterventionforbothchangemanagementandeventhandlingbecomesexpensiveastheserviceand/ortraffictotheservicegrows,becausethesizeoftheteamnecessarilyscaleswiththeloadgeneratedbythesystem. Theindirectcostsofthedevelopment/opssplitcanbesubtle,butareoftenmoreexpensivetotheorganizationthanthedirectcosts.Thesecostsarisefromthefactthatthetwoteamsarequitedifferentinbackground,skillset,andincentives.Theyusedifferentvocabularytodescribesituations;theycarrydifferentassumptionsaboutbothriskandpossibilitiesfortechnicalsolutions;theyhavedifferentassumptionsaboutthetargetlevelofproductstability.Thesplitbetweenthegroupscaneasilybecomeoneofnotjustincentives,butalsocommunication,goals,andeventually,trustandrespect.Thisoutcomeisapathology. Traditionaloperationsteamsandtheircounterpartsinproductdevelopmentthusoftenendupinconflict,mostvisiblyoverhowquicklysoftwarecanbereleasedtoproduction.Attheircore,thedevelopmentteamswanttolaunchnewfeaturesandseethemadoptedbyusers.Attheircore,theopsteamswanttomakesuretheservicedoesn’tbreakwhiletheyareholdingthepager.Becausemostoutagesarecausedbysomekindofchange—anewconfiguration,anewfeaturelaunch,oranewtypeofusertraffic—thetwoteams’goalsarefundamentallyintension. Bothgroupsunderstandthatitisunacceptabletostatetheirinterestsinthebaldestpossibleterms("Wewanttolaunchanything,anytime,withouthindrance"versus"Wewon’twanttoeverchangeanythinginthesystemonceitworks").Andbecausetheirvocabularyandriskassumptionsdiffer,bothgroupsoftenresorttoafamiliarformoftrenchwarfaretoadvancetheirinterests.Theopsteamattemptstosafeguardtherunningsystemagainsttheriskofchangebyintroducinglaunchandchangegates.Forexample,launchreviewsmaycontainanexplicitcheckforeveryproblemthathasevercausedanoutageinthepast—thatcouldbeanarbitrarilylonglist,withnotallelementsprovidingequalvalue.Thedevteamquicklylearnshowtorespond.Theyhavefewer"launches"andmore"flagflips,""incrementalupdates,"or"cherrypicks."Theyadopttacticssuchasshardingtheproductsothatfewerfeaturesaresubjecttothelaunchreview. Google’sApproachtoServiceManagement:SiteReliabilityEngineering Conflictisn’taninevitablepartofofferingasoftwareservice.Googlehaschosentorunoursystemswithadifferentapproach:ourSiteReliabilityEngineeringteamsfocusonhiringsoftwareengineerstorunourproductsandtocreatesystemstoaccomplishtheworkthatwouldotherwisebeperformed,oftenmanually,bysysadmins. WhatexactlyisSiteReliabilityEngineering,asithascometobedefinedatGoogle?Myexplanationissimple:SREiswhathappenswhenyouaskasoftwareengineertodesignanoperationsteam.WhenIjoinedGooglein2003andwastaskedwithrunninga"ProductionTeam"ofsevenengineers,myentirelifeuptothatpointhadbeensoftwareengineering.SoIdesignedandmanagedthegroupthewayIwouldwantittoworkifIworkedasanSREmyself.ThatgrouphassincematuredtobecomeGoogle’spresent-daySREteam,whichremainstruetoitsoriginsasenvisionedbyalifelongsoftwareengineer. AprimarybuildingblockofGoogle’sapproachtoservicemanagementisthecompositionofeachSREteam.Asawhole,SREscanbebrokendownintotwomaincategories. 50–60%areGoogleSoftwareEngineers,ormoreprecisely,peoplewhohavebeenhiredviathestandardprocedureforGoogleSoftware Engineers.Theother40–50%arecandidateswhowereveryclosetotheGoogleSoftwareEngineeringqualifications(i.e.,85–99%oftheskillsetrequired),andwhoinadditionhadasetoftechnicalskillsthatisusefultoSREbutisrareformostsoftwareengineers.Byfar,UNIXsysteminternalsandnetworking(Layer1toLayer3)expertisearethetwomostcommontypesofalternatetechnicalskillsweseek. CommontoallSREsisthebeliefinandaptitudefordevelopingsoftwaresystemstosolvecomplexproblems.WithinSRE,wetrackthecareerprogressofbothgroupsclosely,andhavetodatefoundnopracticaldifferenceinperformancebetweenengineersfromthetwotracks.Infact,thesomewhatdiversebackgroundoftheSREteamfrequentlyresultsinclever,high-qualitysystemsthatareclearlytheproductofthesynthesisofseveralskillsets. TheresultofourapproachtohiringforSREisthatweendupwithateamofpeoplewho(a)willquicklybecomeboredbyperformingtasksbyhand,and(b)havetheskillsetnecessarytowritesoftwaretoreplacetheirpreviouslymanualwork,evenwhenthesolutioniscomplicated.SREsalsoendupsharingacademicandintellectualbackgroundwiththerestofthedevelopmentorganization.Therefore,SREisfundamentallydoingworkthathashistoricallybeendonebyanoperationsteam,butusingengineerswithsoftwareexpertise,andbankingonthefactthattheseengineersareinherentlybothpredisposedto,andhavetheabilityto,designandimplementautomationwithsoftwaretoreplacehumanlabor. Bydesign,itiscrucialthatSREteamsarefocusedonengineering.Withoutconstantengineering,operationsloadincreasesandteamswillneedmorepeoplejusttokeeppacewiththeworkload.Eventually,atraditionalops-focusedgroupscaleslinearlywithservicesize:iftheproductssupportedbytheservicesucceed,theoperationalloadwillgrowwithtraffic.Thatmeanshiringmorepeopletodothesametasksoverandoveragain. Toavoidthisfate,theteamtaskedwithmanagingaserviceneedstocodeoritwilldrown.Therefore,Googleplacesa50%capontheaggregate"ops"workforallSREs—tickets,on-call,manualtasks,etc.ThiscapensuresthattheSREteamhasenoughtimeintheirscheduletomaketheservicestableandoperable.Thiscapisanupperbound;overtime,lefttotheirowndevices,theSREteamshouldendupwithverylittleoperationalloadandalmostentirelyengageindevelopmenttasks,becausetheservicebasicallyrunsandrepairsitself:wewantsystemsthatareautomatic,notjustautomated.Inpractice,scaleandnewfeatureskeepSREsontheirtoes. Google’sruleofthumbisthatanSREteammustspendtheremaining50%ofitstimeactuallydoingdevelopment.Sohowdoweenforcethatthreshold?Inthefirstplace,wehavetomeasurehowSREtimeisspent.Withthatmeasurementinhand,weensurethattheteamsconsistentlyspendinglessthan50%oftheirtimeondevelopmentworkchangetheirpractices.Oftenthismeansshiftingsomeoftheoperationsburdenbacktothedevelopmentteam,oraddingstafftotheteamwithoutassigningthatteamadditionaloperationalresponsibilities.ConsciouslymaintainingthisbalancebetweenopsanddevelopmentworkallowsustoensurethatSREshavethebandwidthtoengageincreative,autonomousengineering,whilestillretainingthewisdomgleanedfromtheoperationssideofrunningaservice. We’vefoundthatGoogleSRE’sapproachtorunninglarge-scalesystemshasmanyadvantages.BecauseSREsaredirectlymodifyingcodeintheirpursuitofmakingGoogle’ssystemsrunthemselves,SREteamsarecharacterizedbybothrapidinnovationandalargeacceptanceofchange.Suchteamsarerelativelyinexpensive—supportingthesameservicewithanops-orientedteamwouldrequireasignificantlylargernumberofpeople.Instead,thenumberofSREsneededtorun,maintain,andimproveasystemscalessublinearlywiththesizeofthesystem.Finally,notonlydoesSREcircumventthedysfunctionalityofthedev/opssplit,butthisstructurealsoimprovesourproductdevelopmentteams:easytransfersbetweenproductdevelopmentandSREteamscross-traintheentiregroup,andimproveskillsofdeveloperswhootherwisemayhavedifficultylearninghowtobuildamillion-coredistributedsystem. Despitethesenetgains,theSREmodelischaracterizedbyitsowndistinctsetofchallenges.OnecontinualchallengeGooglefacesishiringSREs:notonlydoesSREcompeteforthesamecandidatesastheproductdevelopmenthiringpipeline,butthefactthatwesetthehiringbarsohighintermsofbothcodingandsystemengineeringskillsmeansthatourhiringpoolisnecessarilysmall.Asourdisciplineisrelativelynewandunique,notmuchindustryinformationexistsonhowtobuildandmanageanSREteam(althoughhopefullythisbookwillmakestridesinthatdirection!).AndonceanSREteamisinplace,theirpotentiallyunorthodoxapproachestoservicemanagementrequirestrongmanagementsupport.Forexample,thedecisiontostopreleasesfortheremainderofthequarteronceanerrorbudgetisdepletedmightnotbeembracedbyaproductdevelopmentteamunlessmandatedbytheirmanagement. DevOpsorSRE? Theterm“DevOps”emergedinindustryinlate2008andasofthiswriting(early2016)isstillinastateofflux.Itscoreprinciples—involvementoftheITfunctionineachphaseofasystem’sdesignanddevelopment,heavyrelianceonautomationversushumaneffort,theapplicationofengineeringpracticesandtoolstooperationstasks—areconsistentwithmanyofSRE’sprinciplesandpractices.OnecouldviewDevOpsasageneralizationofseveralcoreSREprinciplestoawiderrangeoforganizations,managementstructures,andpersonnel.OnecouldequivalentlyviewSREasaspecificimplementationofDevOpswithsomeidiosyncraticextensions. TenetsofSRE Whilethenuancesofworkflows,priorities,andday-to-dayoperationsvaryfromSREteamtoSREteam,allshareasetofbasicresponsibilitiesfortheservice(s)theysupport,andadheretothesamecoretenets.Ingeneral,anSREteamisresponsiblefortheavailability,latency,performance,efficiency,changemanagement,monitoring,emergencyresponse,andcapacityplanningoftheirservice(s).WehavecodifiedrulesofengagementandprinciplesforhowSREteamsinteractwiththeirenvironment—notonlytheproductionenvironment,butalsotheproductdevelopmentteams,thetestingteams,theusers,andsoon.Thoserulesandworkpracticeshelpustomaintainourfocusonengineeringwork,asopposedtooperationswork. ThefollowingsectiondiscusseseachofthecoretenetsofGoogleSRE. EnsuringaDurableFocusonEngineering Asalreadydiscussed,GooglecapsoperationalworkforSREsat50%oftheirtime.Theirremainingtimeshouldbespentusingtheircodingskillsonprojectwork.Inpractice,thisisaccomplishedbymonitoringtheamountofoperationalworkbeingdonebySREs,andredirectingexcessoperationalworktotheproductdevelopmentteams:reassigningbugsandticketstodevelopmentmanagers,[re]integratingdevelopersintoon-callpagerrotations,andsoon.Theredirectionendswhentheoperationalloaddropsbackto50%orlower.Thisalsoprovidesaneffectivefeedbackmechanism,guidingdeveloperstobuildsystemsthatdon’tneedmanualintervention.Thisapproachworkswellwhentheentireorganization—SREanddevelopmentalike—understandswhythesafetyvalvemechanismexists,andsupportsthegoalofhavingnooverfloweventsbecausetheproductdoesn’tgenerateenoughoperationalloadtorequireit. Whentheyarefocusedonoperationswork,onaverage,SREsshouldreceiveamaximumoftwoeventsper8–12-houron-callshift.Thistargetvolumegivestheon-callengineerenoughtimetohandletheeventaccuratelyandquickly,cleanupandrestorenormalservice,andthenconductapostmortem.Ifmorethantwoeventsoccurregularlyperon-callshift,problemscan’tbeinvestigatedthoroughlyandengineersaresufficientlyoverwhelmedtopreventthemfromlearningfromtheseevents.Ascenarioofpagerfatiguealsowon’timprovewithscale.Conversely,ifon-callSREsconsistentlyreceivefewerthanoneeventpershift,keepingthemonpointisawasteoftheirtime. Postmortemsshouldbewrittenforallsignificantincidents,regardlessofwhetherornottheypaged;postmortemsthatdidnottriggerapageareevenmorevaluable,astheylikelypointtoclearmonitoringgaps.Thisinvestigationshouldestablishwhathappenedindetail,findallrootcausesoftheevent,andassignactionstocorrecttheproblemorimprovehowitisaddressednexttime.Googleoperatesunderablame-freepostmortemculture,withthegoalofexposingfaultsandapplyingengineeringtofixthesefaults,ratherthanavoidingorminimizingthem. PursuingMaximumChangeVelocityWithoutViolatingaService’sSLO ProductdevelopmentandSREteamscanenjoyaproductiveworkingrelationshipbyeliminatingthestructuralconflictintheirrespectivegoals.Thestructuralconflictisbetweenpaceofinnovationandproductstability,andasdescribedearlier,thisconflictoftenisexpressedindirectly.InSREwebringthisconflicttothefore,andthenresolveitwiththeintroductionofanerrorbudget. Theerrorbudgetstemsfromtheobservationthat100%isthewrongreliabilitytargetforbasicallyeverything(pacemakersandanti-lockbrakesbeingnotableexceptions).Ingeneral,foranysoftwareserviceorsystem,100%isnottherightreliabilitytargetbecausenousercantellthedifferencebetweenasystembeing100%availableand99.999%available.Therearemanyothersystemsinthepathbetweenuserandservice(theirlaptop,theirhomeWiFi,theirISP,thepowergrid…)andthosesystemscollectivelyarefarlessthan99.999%available.Thus,themarginaldifferencebetween99.999%and100%getslostinthenoiseofotherunavailability,andtheuserreceivesnobenefitfromtheenormouseffortrequiredtoaddthatlast0.001%ofavailability. If100%isthewrongreliabilitytargetforasystem,what,then,istherightreliabilitytargetforthesystem?Thisactuallyisn’tatechnicalquestionatall—it’saproductquestion,whichshouldtakethefollowingconsiderationsintoaccount: Whatlevelofavailabilitywilltheusersbehappywith,givenhowtheyusetheproduct? Whatalternativesareavailabletouserswhoaredissatisfiedwiththeproduct’savailability? Whathappenstousers’usageoftheproductatdifferentavailabilitylevels? Thebusinessortheproductmustestablishthesystem’savailabilitytarget.Oncethattargetisestablished,theerrorbudgetisoneminustheavailabilitytarget.Aservicethat’s99.99%availableis0.01%unavailable.Thatpermitted0.01%unavailabilityistheservice’serrorbudget.Wecanspendthebudgetonanythingwewant,aslongaswedon’toverspendit. Sohowdowewanttospendtheerrorbudget?Thedevelopmentteamwantstolaunchfeaturesandattractnewusers.Ideally,wewouldspendallofourerrorbudgettakingriskswiththingswelaunchinordertolaunchthemquickly.Thisbasicpremisedescribesthewholemodeloferrorbudgets.AssoonasSREactivitiesareconceptualizedinthisframework,freeinguptheerrorbudgetthroughtacticssuchasphasedrolloutsand1%experimentscanoptimizeforquickerlaunches. TheuseofanerrorbudgetresolvesthestructuralconflictofincentivesbetweendevelopmentandSRE.SRE’sgoalisnolonger"zerooutages";rather,SREsandproductdevelopersaimtospendtheerrorbudgetgettingmaximumfeaturevelocity.Thischangemakesallthedifference.Anoutageisnolongera"bad"thing—itisanexpectedpartoftheprocessofinnovation,andanoccurrencethatbothdevelopmentandSREteamsmanageratherthanfear. Monitoring Monitoringisoneoftheprimarymeansbywhichserviceownerskeeptrackofasystem’shealthandavailability.Assuch,monitoringstrategyshouldbeconstructedthoughtfully.Aclassicandcommonapproachtomonitoringistowatchforaspecificvalueorcondition,andthentotriggeranemailalertwhenthatvalueisexceededorthatconditionoccurs.However,thistypeofemailalertingisnotaneffectivesolution:asystemthatrequiresahumantoreadanemailanddecidewhetherornotsometypeofactionneedstobetakeninresponseisfundamentallyflawed.Monitoringshouldneverrequireahumantointerpretanypartofthealertingdomain.Instead,softwareshoulddotheinterpreting,andhumansshouldbenotifiedonlywhentheyneedtotakeaction. Therearethreekindsofvalidmonitoringoutput: Alerts Signifythatahumanneedstotakeactionimmediatelyinresponsetosomethingthatiseitherhappeningorabouttohappen,inordertoimprovethesituation. Tickets Signifythatahumanneedstotakeaction,butnotimmediately.Thesystemcannotautomaticallyhandlethesituation,butifahumantakesactioninafewdays,nodamagewillresult. Logging Nooneneedstolookatthisinformation,butitisrecordedfordiagnosticorforensicpurposes.Theexpectationisthatnoonereadslogsunlesssomethingelsepromptsthemtodoso. EmergencyResponse Reliabilityisafunctionofmeantimetofailure(MTTF)andmeantimetorepair(MTTR)[Sch15].Themostrelevantmetricinevaluatingtheeffectivenessofemergencyresponseishowquicklytheresponseteamcanbringthesystembacktohealth—thatis,theMTTR. Humansaddlatency.Evenifagivensystemexperiencesmoreactualfailures,asystemthatcanavoidemergenciesthatrequirehumaninterventionwillhavehigheravailabilitythanasystemthatrequireshands-onintervention.Whenhumansarenecessary,wehavefoundthatthinkingthroughandrecordingthebestpracticesaheadoftimeina"playbook"producesroughlya3ximprovementinMTTRascomparedtothestrategyof"wingingit."Theherojack-of-all-tradeson-callengineerdoeswork,butthepracticedon-callengineerarmedwithaplaybookworksmuchbetter.Whilenoplaybook,nomatterhowcomprehensiveitmaybe,isasubstituteforsmartengineersabletothinkonthefly,clearandthoroughtroubleshootingstepsandtipsarevaluablewhenrespondingtoahigh-stakesortime-sensitivepage.Thus,GoogleSREreliesonon-callplaybooks,inadditiontoexercisessuchasthe"WheelofMisfortune,"7toprepareengineerstoreacttoon-callevents. ChangeManagement SREhasfoundthatroughly70%ofoutagesareduetochangesinalivesystem.Bestpracticesinthisdomainuseautomationtoaccomplishthefollowing: Implementingprogressiverollouts Quicklyandaccuratelydetectingproblems Rollingbackchangessafelywhenproblemsarise Thistrioofpracticeseffectivelyminimizestheaggregatenumberofusersandoperationsexposedtobadchanges.Byremovinghumansfromtheloop,thesepracticesavoidthenormalproblemsoffatigue,familiarity/contempt,andinattentiontohighlyrepetitivetasks.Asaresult,bothreleasevelocityandsafetyincrease. DemandForecastingandCapacityPlanning Demandforecastingandcapacityplanningcanbeviewedasensuringthatthereissufficientcapacityandredundancytoserveprojectedfuturedemandwiththerequiredavailability.There’snothingparticularlyspecialabouttheseconcepts,exceptthatasurprisingnumberofservicesandteamsdon’ttakethestepsnecessarytoensurethattherequiredcapacityisinplacebythetimeitisneeded.Capacityplanningshouldtakebothorganicgrowth(whichstemsfromnaturalproductadoptionandusagebycustomers)andinorganicgrowth(whichresultsfromeventslikefeaturelaunches,marketingcampaigns,orotherbusiness-drivenchanges)intoaccount. Severalstepsaremandatoryincapacityplanning: Anaccurateorganicdemandforecast,whichextendsbeyondtheleadtimerequiredforacquiringcapacity Anaccurateincorporationofinorganicdemandsourcesintothedemandforecast Regularloadtestingofthesystemtocorrelaterawcapacity(servers,disks,andsoon)toservicecapacity Becausecapacityiscriticaltoavailability,itnaturallyfollowsthattheSREteammustbeinchargeofcapacityplanning,whichmeanstheyalsomustbeinchargeofprovisioning. Provisioning Provisioningcombinesbothchangemanagementandcapacityplanning.Inourexperience,provisioningmustbeconductedquicklyandonlywhennecessary,ascapacityisexpensive.Thisexercisemustalsobedonecorrectlyorcapacitydoesn’tworkwhenneeded.Addingnewcapacityofteninvolvesspinningupanewinstanceorlocation,makingsignificantmodificationtoexistingsystems(configurationfiles,loadbalancers,networking),andvalidatingthatthenewcapacityperformsanddeliverscorrectresults.Thus,itisariskieroperationthanloadshifting,whichisoftendonemultipletimesperhour,andmustbetreatedwithacorrespondingdegreeofextracaution. EfficiencyandPerformance Efficientuseofresourcesisimportantanytimeaservicecaresaboutmoney.BecauseSREultimatelycontrolsprovisioning,itmustalsobeinvolvedinanyworkonutilization,asutilizationisafunctionofhowagivenserviceworksandhowitisprovisioned.Itfollowsthatpayingcloseattentiontotheprovisioningstrategyforaservice,andthereforeitsutilization,providesavery,verybigleverontheservice’stotalcosts. Resourceuseisafunctionofdemand(load),capacity,andsoftwareefficiency.SREspredictdemand,provisioncapacity,andcanmodifythesoftware.Thesethreefactorsarealargepart(thoughnottheentirety)ofaservice’sefficiency. Softwaresystemsbecomeslowerasloadisaddedtothem.Aslowdowninaserviceequatestoalossofcapacity.Atsomepoint,aslowingsystemstopsserving,whichcorrespondstoinfiniteslowness.SREsprovisiontomeetacapacitytargetataspecificresponsespeed,andthusarekeenlyinterestedinaservice’sperformance.SREsandproductdeveloperswill(andshould)monitorandmodifyaservicetoimproveitsperformance,thusaddingcapacityandimprovingefficiency.8 TheEndoftheBeginning SiteReliabilityEngineeringrepresentsasignificantbreakfromexistingindustrybestpracticesformanaginglarge,complicatedservices.Motivatedoriginallybyfamiliarity—"asasoftwareengineer,thisishowIwouldwanttoinvestmytimetoaccomplishasetofrepetitivetasks"—ithasbecomemuchmore:asetofprinciples,asetofpractices,asetofincentives,andafieldofendeavorwithinthelargersoftwareengineeringdiscipline.TherestofthebookexplorestheSREWayindetail. 6VicePresident,GoogleEngineering,founderofGoogleSRE7SeeDisasterRolePlaying.8Forfurtherdiscussionofhowthiscollaborationcanworkinpractice,seeCommunications:ProductionMeetings.



請為這篇文章評分?