Google - Site Reliability Engineering

2024-09-21

文章推薦指數： 80 %

投票人數：10人

Thus, Google SRE relies on on-call playbooks, in addition to exercises such as the "Wheel of Misfortune," to prepare engineers to react to on-call events. TableofContents Foreword Preface PartI-Introduction 1.Introduction 2.TheProductionEnvironmentatGoogle,fromtheViewpointofanSRE PartII-Principles 3.EmbracingRisk 4.ServiceLevelObjectives 5.EliminatingToil 6.MonitoringDistributedSystems 7.TheEvolutionofAutomationatGoogle 8.ReleaseEngineering 9.Simplicity PartIII-Practices 10.PracticalAlerting 11.BeingOn-Call 12.EffectiveTroubleshooting 13.EmergencyResponse 14.ManagingIncidents 15.PostmortemCulture:LearningfromFailure 16.TrackingOutages 17.TestingforReliability 18.SoftwareEngineeringinSRE 19.LoadBalancingattheFrontend 20.LoadBalancingintheDatacenter 21.HandlingOverload 22.AddressingCascadingFailures 23.ManagingCriticalState:DistributedConsensusforReliability 24.DistributedPeriodicSchedulingwithCron 25.DataProcessingPipelines 26.DataIntegrity:WhatYouReadIsWhatYouWrote 27.ReliableProductLaunchesatScale PartIV-Management 28.AcceleratingSREstoOn-CallandBeyond 29.DealingwithInterrupts 30.EmbeddinganSREtoRecoverfromOperationalOverload 31.CommunicationandCollaborationinSRE 32.TheEvolvingSREEngagementModel PartV-Conclusions 33.LessonsLearnedfromOtherIndustries 34.Conclusion AppendixA.AvailabilityTable AppendixB.ACollectionofBestPracticesforProductionServices AppendixC.ExampleIncidentStateDocument AppendixD.ExamplePostmortem AppendixE.LaunchCoordinationChecklist AppendixF.ExampleProductionMeetingMinutes Bibliography Introduction WrittenbyBenjaminTreynorSloss6EditedbyBetsyBeyer Hopeisnotastrategy. TraditionalSREsaying Itisatruthuniversallyacknowledgedthatsystemsdonotrunthemselves.How,then,shouldasystem—particularlyacomplexcomputingsystemthatoperatesatalargescale—berun? TheSysadminApproachtoServiceManagement Historically,companieshaveemployedsystemsadministratorstoruncomplexcomputingsystems. Thissystemsadministrator,orsysadmin,approachinvolvesassemblingexistingsoftwarecomponentsanddeployingthemtoworktogethertoproduceaservice.Sysadminsarethentaskedwithrunningtheserviceandrespondingtoeventsandupdatesastheyoccur.Asthesystemgrowsincomplexityandtrafficvolume,generatingacorrespondingincreaseineventsandupdates,thesysadminteamgrowstoabsorbtheadditionalwork.Becausethesysadminrolerequiresamarkedlydifferentskillsetthanthatrequiredofaproduct’sdevelopers,developersandsysadminsaredividedintodiscreteteams:"development"and"operations"or"ops." Thesysadminmodelofservicemanagementhasseveraladvantages.Forcompaniesdecidinghowtorunandstaffaservice,thisapproachisrelativelyeasytoimplement:asafamiliarindustryparadigm,therearemanyexamplesfromwhichtolearnandemulate.Arelevanttalentpoolisalreadywidelyavailable.Anarrayofexistingtools,softwarecomponents(offtheshelforotherwise),andintegrationcompaniesareavailabletohelprunthoseassembledsystems,soanovicesysadminteamdoesn’thavetoreinventthewheelanddesignasystemfromscratch. Thesysadminapproachandtheaccompanyingdevelopment/opssplithasanumberofdisadvantagesandpitfalls.Thesefallbroadlyintotwocategories:directcostsandindirectcosts. Directcostsareneithersubtlenorambiguous.Runningaservicewithateamthatreliesonmanualinterventionforbothchangemanagementandeventhandlingbecomesexpensiveastheserviceand/ortraffictotheservicegrows,becausethesizeoftheteamnecessarilyscaleswiththeloadgeneratedbythesystem. Theindirectcostsofthedevelopment/opssplitcanbesubtle,butareoftenmoreexpensivetotheorganizationthanthedirectcosts.Thesecostsarisefromthefactthatthetwoteamsarequitedifferentinbackground,skillset,andincentives.Theyusedifferentvocabularytodescribesituations;theycarrydifferentassumptionsaboutbothriskandpossibilitiesfortechnicalsolutions;theyhavedifferentassumptionsaboutthetargetlevelofproductstability.Thesplitbetweenthegroupscaneasilybecomeoneofnotjustincentives,butalsocommunication,goals,andeventually,trustandrespect.Thisoutcomeisapathology. Traditionaloperationsteamsandtheircounterpartsinproductdevelopmentthusoftenendupinconflict,mostvisiblyoverhowquicklysoftwarecanbereleasedtoproduction.Attheircore,thedevelopmentteamswanttolaunchnewfeaturesandseethemadoptedbyusers.Attheircore,theopsteamswanttomakesuretheservicedoesn’tbreakwhiletheyareholdingthepager.Becausemostoutagesarecausedbysomekindofchange—anewconfiguration,anewfeaturelaunch,oranewtypeofusertraffic—thetwoteams’goalsarefundamentallyintension. Bothgroupsunderstandthatitisunacceptabletostatetheirinterestsinthebaldestpossibleterms("Wewanttolaunchanything,anytime,withouthindrance"versus"Wewon’twanttoeverchangeanythinginthesystemonceitworks").Andbecausetheirvocabularyandriskassumptionsdiffer,bothgroupsoftenresorttoafamiliarformoftrenchwarfaretoadvancetheirinterests.Theopsteamattemptstosafeguardtherunningsystemagainsttheriskofchangebyintroducinglaunchandchangegates.Forexample,launchreviewsmaycontainanexplicitcheckforeveryproblemthathasevercausedanoutageinthepast—thatcouldbeanarbitrarilylonglist,withnotallelementsprovidingequalvalue.Thedevteamquicklylearnshowtorespond.Theyhavefewer"launches"andmore"flagflips,""incrementalupdates,"or"cherrypicks."Theyadopttacticssuchasshardingtheproductsothatfewerfeaturesaresubjecttothelaunchreview. Google’sApproachtoServiceManagement:SiteReliabilityEngineering Conflictisn’taninevitablepartofofferingasoftwareservice.Googlehaschosentorunoursystemswithadifferentapproach:ourSiteReliabilityEngineeringteamsfocusonhiringsoftwareengineerstorunourproductsandtocreatesystemstoaccomplishtheworkthatwouldotherwisebeperformed,oftenmanually,bysysadmins. WhatexactlyisSiteReliabilityEngineering,asithascometobedefinedatGoogle?Myexplanationissimple:SREiswhathappenswhenyouaskasoftwareengineertodesignanoperationsteam.WhenIjoinedGooglein2003andwastaskedwithrunninga"ProductionTeam"ofsevenengineers,myentirelifeuptothatpointhadbeensoftwareengineering.SoIdesignedandmanagedthegroupthewayIwouldwantittoworkifIworkedasanSREmyself.ThatgrouphassincematuredtobecomeGoogle’spresent-daySREteam,whichremainstruetoitsoriginsasenvisionedbyalifelongsoftwareengineer. AprimarybuildingblockofGoogle’sapproachtoservicemanagementisthecompositionofeachSREteam.Asawhole,SREscanbebrokendownintotwomaincategories. 50–60%areGoogleSoftwareEngineers,ormoreprecisely,peoplewhohavebeenhiredviathestandardprocedureforGoogleSoftware Engineers.Theother40–50%arecandidateswhowereveryclosetotheGoogleSoftwareEngineeringqualifications(i.e.,85–99%oftheskillsetrequired),andwhoinadditionhadasetoftechnicalskillsthatisusefultoSREbutisrareformostsoftwareengineers.Byfar,UNIXsysteminternalsandnetworking(Layer1toLayer3)expertisearethetwomostcommontypesofalternatetechnicalskillsweseek. CommontoallSREsisthebeliefinandaptitudefordevelopingsoftwaresystemstosolvecomplexproblems.WithinSRE,wetrackthecareerprogressofbothgroupsclosely,andhavetodatefoundnopracticaldifferenceinperformancebetweenengineersfromthetwotracks.Infact,thesomewhatdiversebackgroundoftheSREteamfrequentlyresultsinclever,high-qualitysystemsthatareclearlytheproductofthesynthesisofseveralskillsets. TheresultofourapproachtohiringforSREisthatweendupwithateamofpeoplewho(a)willquicklybecomeboredbyperformingtasksbyhand,and(b)havetheskillsetnecessarytowritesoftwaretoreplacetheirpreviouslymanualwork,evenwhenthesolutioniscomplicated.SREsalsoendupsharingacademicandintellectualbackgroundwiththerestofthedevelopmentorganization.Therefore,SREisfundamentallydoingworkthathashistoricallybeendonebyanoperationsteam,butusingengineerswithsoftwareexpertise,andbankingonthefactthattheseengineersareinherentlybothpredisposedto,andhavetheabilityto,designandimplementautomationwithsoftwaretoreplacehumanlabor. Bydesign,itiscrucialthatSREteamsarefocusedonengineering.Withoutconstantengineering,operationsloadincreasesandteamswillneedmorepeoplejusttokeeppacewiththeworkload.Eventually,atraditionalops-focusedgroupscaleslinearlywithservicesize:iftheproductssupportedbytheservicesucceed,theoperationalloadwillgrowwithtraffic.Thatmeanshiringmorepeopletodothesametasksoverandoveragain. Toavoidthisfate,theteamtaskedwithmanagingaserviceneedstocodeoritwilldrown.Therefore,Googleplacesa50%capontheaggregate"ops"workforallSREs—tickets,on-call,manualtasks,etc.ThiscapensuresthattheSREteamhasenoughtimeintheirscheduletomaketheservicestableandoperable.Thiscapisanupperbound;overtime,lefttotheirowndevices,theSREteamshouldendupwithverylittleoperationalloadandalmostentirelyengageindevelopmenttasks,becausetheservicebasicallyrunsandrepairsitself:wewantsystemsthatareautomatic,notjustautomated.Inpractice,scaleandnewfeatureskeepSREsontheirtoes. Google’sruleofthumbisthatanSREteammustspendtheremaining50%ofitstimeactuallydoingdevelopment.Sohowdoweenforcethatthreshold?Inthefirstplace,wehavetomeasurehowSREtimeisspent.Withthatmeasurementinhand,weensurethattheteamsconsistentlyspendinglessthan50%oftheirtimeondevelopmentworkchangetheirpractices.Oftenthismeansshiftingsomeoftheoperationsburdenbacktothedevelopmentteam,oraddingstafftotheteamwithoutassigningthatteamadditionaloperationalresponsibilities.ConsciouslymaintainingthisbalancebetweenopsanddevelopmentworkallowsustoensurethatSREshavethebandwidthtoengageincreative,autonomousengineering,whilestillretainingthewisdomgleanedfromtheoperationssideofrunningaservice. We’vefoundthatGoogleSRE’sapproachtorunninglarge-scalesystemshasmanyadvantages.BecauseSREsaredirectlymodifyingcodeintheirpursuitofmakingGoogle’ssystemsrunthemselves,SREteamsarecharacterizedbybothrapidinnovationandalargeacceptanceofchange.Suchteamsarerelativelyinexpensive—supportingthesameservicewithanops-orientedteamwouldrequireasignificantlylargernumberofpeople.Instead,thenumberofSREsneededtorun,maintain,andimproveasystemscalessublinearlywiththesizeofthesystem.Finally,notonlydoesSREcircumventthedysfunctionalityofthedev/opssplit,butthisstructurealsoimprovesourproductdevelopmentteams:easytransfersbetweenproductdevelopmentandSREteamscross-traintheentiregroup,andimproveskillsofdeveloperswhootherwisemayhavedifficultylearninghowtobuildamillion-coredistributedsystem. Despitethesenetgains,theSREmodelischaracterizedbyitsowndistinctsetofchallenges.OnecontinualchallengeGooglefacesishiringSREs:notonlydoesSREcompeteforthesamecandidatesastheproductdevelopmenthiringpipeline,butthefactthatwesetthehiringbarsohighintermsofbothcodingandsystemengineeringskillsmeansthatourhiringpoolisnecessarilysmall.Asourdisciplineisrelativelynewandunique,notmuchindustryinformationexistsonhowtobuildandmanageanSREteam(althoughhopefullythisbookwillmakestridesinthatdirection!).AndonceanSREteamisinplace,theirpotentiallyunorthodoxapproachestoservicemanagementrequirestrongmanagementsupport.Forexample,thedecisiontostopreleasesfortheremainderofthequarteronceanerrorbudgetisdepletedmightnotbeembracedbyaproductdevelopmentteamunlessmandatedbytheirmanagement. DevOpsorSRE? Theterm“DevOps”emergedinindustryinlate2008andasofthiswriting(early2016)isstillinastateofflux.Itscoreprinciples—involvementoftheITfunctionineachphaseofasystem’sdesignanddevelopment,heavyrelianceonautomationversushumaneffort,theapplicationofengineeringpracticesandtoolstooperationstasks—areconsistentwithmanyofSRE’sprinciplesandpractices.OnecouldviewDevOpsasageneralizationofseveralcoreSREprinciplestoawiderrangeoforganizations,managementstructures,andpersonnel.OnecouldequivalentlyviewSREasaspecificimplementationofDevOpswithsomeidiosyncraticextensions. TenetsofSRE Whilethenuancesofworkflows,priorities,andday-to-dayoperationsvaryfromSREteamtoSREteam,allshareasetofbasicresponsibilitiesfortheservice(s)theysupport,andadheretothesamecoretenets.Ingeneral,anSREteamisresponsiblefortheavailability,latency,performance,efficiency,changemanagement,monitoring,emergencyresponse,andcapacityplanningoftheirservice(s).WehavecodifiedrulesofengagementandprinciplesforhowSREteamsinteractwiththeirenvironment—notonlytheproductionenvironment,butalsotheproductdevelopmentteams,thetestingteams,theusers,andsoon.Thoserulesandworkpracticeshelpustomaintainourfocusonengineeringwork,asopposedtooperationswork. ThefollowingsectiondiscusseseachofthecoretenetsofGoogleSRE. EnsuringaDurableFocusonEngineering Asalreadydiscussed,GooglecapsoperationalworkforSREsat50%oftheirtime.Theirremainingtimeshouldbespentusingtheircodingskillsonprojectwork.Inpractice,thisisaccomplishedbymonitoringtheamountofoperationalworkbeingdonebySREs,andredirectingexcessoperationalworktotheproductdevelopmentteams:reassigningbugsandticketstodevelopmentmanagers,[re]integratingdevelopersintoon-callpagerrotations,andsoon.Theredirectionendswhentheoperationalloaddropsbackto50%orlower.Thisalsoprovidesaneffectivefeedbackmechanism,guidingdeveloperstobuildsystemsthatdon’tneedmanualintervention.Thisapproachworkswellwhentheentireorganization—SREanddevelopmentalike—understandswhythesafetyvalvemechanismexists,andsupportsthegoalofhavingnooverfloweventsbecausetheproductdoesn’tgenerateenoughoperationalloadtorequireit. Whentheyarefocusedonoperationswork,onaverage,SREsshouldreceiveamaximumoftwoeventsper8–12-houron-callshift.Thistargetvolumegivestheon-callengineerenoughtimetohandletheeventaccuratelyandquickly,cleanupandrestorenormalservice,andthenconductapostmortem.Ifmorethantwoeventsoccurregularlyperon-callshift,problemscan’tbeinvestigatedthoroughlyandengineersaresufficientlyoverwhelmedtopreventthemfromlearningfromtheseevents.Ascenarioofpagerfatiguealsowon’timprovewithscale.Conversely,ifon-callSREsconsistentlyreceivefewerthanoneeventpershift,keepingthemonpointisawasteoftheirtime. Postmortemsshouldbewrittenforallsignificantincidents,regardlessofwhetherornottheypaged;postmortemsthatdidnottriggerapageareevenmorevaluable,astheylikelypointtoclearmonitoringgaps.Thisinvestigationshouldestablishwhathappenedindetail,findallrootcausesoftheevent,andassignactionstocorrecttheproblemorimprovehowitisaddressednexttime.Googleoperatesunderablame-freepostmortemculture,withthegoalofexposingfaultsandapplyingengineeringtofixthesefaults,ratherthanavoidingorminimizingthem. PursuingMaximumChangeVelocityWithoutViolatingaService’sSLO ProductdevelopmentandSREteamscanenjoyaproductiveworkingrelationshipbyeliminatingthestructuralconflictintheirrespectivegoals.Thestructuralconflictisbetweenpaceofinnovationandproductstability,andasdescribedearlier,thisconflictoftenisexpressedindirectly.InSREwebringthisconflicttothefore,andthenresolveitwiththeintroductionofanerrorbudget. Theerrorbudgetstemsfromtheobservationthat100%isthewrongreliabilitytargetforbasicallyeverything(pacemakersandanti-lockbrakesbeingnotableexceptions).Ingeneral,foranysoftwareserviceorsystem,100%isnottherightreliabilitytargetbecausenousercantellthedifferencebetweenasystembeing100%availableand99.999%available.Therearemanyothersystemsinthepathbetweenuserandservice(theirlaptop,theirhomeWiFi,theirISP,thepowergrid…)andthosesystemscollectivelyarefarlessthan99.999%available.Thus,themarginaldifferencebetween99.999%and100%getslostinthenoiseofotherunavailability,andtheuserreceivesnobenefitfromtheenormouseffortrequiredtoaddthatlast0.001%ofavailability. If100%isthewrongreliabilitytargetforasystem,what,then,istherightreliabilitytargetforthesystem?Thisactuallyisn’tatechnicalquestionatall—it’saproductquestion,whichshouldtakethefollowingconsiderationsintoaccount: Whatlevelofavailabilitywilltheusersbehappywith,givenhowtheyusetheproduct? Whatalternativesareavailabletouserswhoaredissatisfiedwiththeproduct’savailability? Whathappenstousers’usageoftheproductatdifferentavailabilitylevels? Thebusinessortheproductmustestablishthesystem’savailabilitytarget.Oncethattargetisestablished,theerrorbudgetisoneminustheavailabilitytarget.Aservicethat’s99.99%availableis0.01%unavailable.Thatpermitted0.01%unavailabilityistheservice’serrorbudget.Wecanspendthebudgetonanythingwewant,aslongaswedon’toverspendit. Sohowdowewanttospendtheerrorbudget?Thedevelopmentteamwantstolaunchfeaturesandattractnewusers.Ideally,wewouldspendallofourerrorbudgettakingriskswiththingswelaunchinordertolaunchthemquickly.Thisbasicpremisedescribesthewholemodeloferrorbudgets.AssoonasSREactivitiesareconceptualizedinthisframework,freeinguptheerrorbudgetthroughtacticssuchasphasedrolloutsand1%experimentscanoptimizeforquickerlaunches. TheuseofanerrorbudgetresolvesthestructuralconflictofincentivesbetweendevelopmentandSRE.SRE’sgoalisnolonger"zerooutages";rather,SREsandproductdevelopersaimtospendtheerrorbudgetgettingmaximumfeaturevelocity.Thischangemakesallthedifference.Anoutageisnolongera"bad"thing—itisanexpectedpartoftheprocessofinnovation,andanoccurrencethatbothdevelopmentandSREteamsmanageratherthanfear. Monitoring Monitoringisoneoftheprimarymeansbywhichserviceownerskeeptrackofasystem’shealthandavailability.Assuch,monitoringstrategyshouldbeconstructedthoughtfully.Aclassicandcommonapproachtomonitoringistowatchforaspecificvalueorcondition,andthentotriggeranemailalertwhenthatvalueisexceededorthatconditionoccurs.However,thistypeofemailalertingisnotaneffectivesolution:asystemthatrequiresahumantoreadanemailanddecidewhetherornotsometypeofactionneedstobetakeninresponseisfundamentallyflawed.Monitoringshouldneverrequireahumantointerpretanypartofthealertingdomain.Instead,softwareshoulddotheinterpreting,andhumansshouldbenotifiedonlywhentheyneedtotakeaction. Therearethreekindsofvalidmonitoringoutput: Alerts Signifythatahumanneedstotakeactionimmediatelyinresponsetosomethingthatiseitherhappeningorabouttohappen,inordertoimprovethesituation. Tickets Signifythatahumanneedstotakeaction,butnotimmediately.Thesystemcannotautomaticallyhandlethesituation,butifahumantakesactioninafewdays,nodamagewillresult. Logging Nooneneedstolookatthisinformation,butitisrecordedfordiagnosticorforensicpurposes.Theexpectationisthatnoonereadslogsunlesssomethingelsepromptsthemtodoso. EmergencyResponse Reliabilityisafunctionofmeantimetofailure(MTTF)andmeantimetorepair(MTTR)[Sch15].Themostrelevantmetricinevaluatingtheeffectivenessofemergencyresponseishowquicklytheresponseteamcanbringthesystembacktohealth—thatis,theMTTR. Humansaddlatency.Evenifagivensystemexperiencesmoreactualfailures,asystemthatcanavoidemergenciesthatrequirehumaninterventionwillhavehigheravailabilitythanasystemthatrequireshands-onintervention.Whenhumansarenecessary,wehavefoundthatthinkingthroughandrecordingthebestpracticesaheadoftimeina"playbook"producesroughlya3ximprovementinMTTRascomparedtothestrategyof"wingingit."Theherojack-of-all-tradeson-callengineerdoeswork,butthepracticedon-callengineerarmedwithaplaybookworksmuchbetter.Whilenoplaybook,nomatterhowcomprehensiveitmaybe,isasubstituteforsmartengineersabletothinkonthefly,clearandthoroughtroubleshootingstepsandtipsarevaluablewhenrespondingtoahigh-stakesortime-sensitivepage.Thus,GoogleSREreliesonon-callplaybooks,inadditiontoexercisessuchasthe"WheelofMisfortune,"7toprepareengineerstoreacttoon-callevents. ChangeManagement SREhasfoundthatroughly70%ofoutagesareduetochangesinalivesystem.Bestpracticesinthisdomainuseautomationtoaccomplishthefollowing: Implementingprogressiverollouts Quicklyandaccuratelydetectingproblems Rollingbackchangessafelywhenproblemsarise Thistrioofpracticeseffectivelyminimizestheaggregatenumberofusersandoperationsexposedtobadchanges.Byremovinghumansfromtheloop,thesepracticesavoidthenormalproblemsoffatigue,familiarity/contempt,andinattentiontohighlyrepetitivetasks.Asaresult,bothreleasevelocityandsafetyincrease. DemandForecastingandCapacityPlanning Demandforecastingandcapacityplanningcanbeviewedasensuringthatthereissufficientcapacityandredundancytoserveprojectedfuturedemandwiththerequiredavailability.There’snothingparticularlyspecialabouttheseconcepts,exceptthatasurprisingnumberofservicesandteamsdon’ttakethestepsnecessarytoensurethattherequiredcapacityisinplacebythetimeitisneeded.Capacityplanningshouldtakebothorganicgrowth(whichstemsfromnaturalproductadoptionandusagebycustomers)andinorganicgrowth(whichresultsfromeventslikefeaturelaunches,marketingcampaigns,orotherbusiness-drivenchanges)intoaccount. Severalstepsaremandatoryincapacityplanning: Anaccurateorganicdemandforecast,whichextendsbeyondtheleadtimerequiredforacquiringcapacity Anaccurateincorporationofinorganicdemandsourcesintothedemandforecast Regularloadtestingofthesystemtocorrelaterawcapacity(servers,disks,andsoon)toservicecapacity Becausecapacityiscriticaltoavailability,itnaturallyfollowsthattheSREteammustbeinchargeofcapacityplanning,whichmeanstheyalsomustbeinchargeofprovisioning. Provisioning Provisioningcombinesbothchangemanagementandcapacityplanning.Inourexperience,provisioningmustbeconductedquicklyandonlywhennecessary,ascapacityisexpensive.Thisexercisemustalsobedonecorrectlyorcapacitydoesn’tworkwhenneeded.Addingnewcapacityofteninvolvesspinningupanewinstanceorlocation,makingsignificantmodificationtoexistingsystems(configurationfiles,loadbalancers,networking),andvalidatingthatthenewcapacityperformsanddeliverscorrectresults.Thus,itisariskieroperationthanloadshifting,whichisoftendonemultipletimesperhour,andmustbetreatedwithacorrespondingdegreeofextracaution. EfficiencyandPerformance Efficientuseofresourcesisimportantanytimeaservicecaresaboutmoney.BecauseSREultimatelycontrolsprovisioning,itmustalsobeinvolvedinanyworkonutilization,asutilizationisafunctionofhowagivenserviceworksandhowitisprovisioned.Itfollowsthatpayingcloseattentiontotheprovisioningstrategyforaservice,andthereforeitsutilization,providesavery,verybigleverontheservice’stotalcosts. Resourceuseisafunctionofdemand(load),capacity,andsoftwareefficiency.SREspredictdemand,provisioncapacity,andcanmodifythesoftware.Thesethreefactorsarealargepart(thoughnottheentirety)ofaservice’sefficiency. Softwaresystemsbecomeslowerasloadisaddedtothem.Aslowdowninaserviceequatestoalossofcapacity.Atsomepoint,aslowingsystemstopsserving,whichcorrespondstoinfiniteslowness.SREsprovisiontomeetacapacitytargetataspecificresponsespeed,andthusarekeenlyinterestedinaservice’sperformance.SREsandproductdeveloperswill(andshould)monitorandmodifyaservicetoimproveitsperformance,thusaddingcapacityandimprovingefficiency.8 TheEndoftheBeginning SiteReliabilityEngineeringrepresentsasignificantbreakfromexistingindustrybestpracticesformanaginglarge,complicatedservices.Motivatedoriginallybyfamiliarity—"asasoftwareengineer,thisishowIwouldwanttoinvestmytimetoaccomplishasetofrepetitivetasks"—ithasbecomemuchmore:asetofprinciples,asetofpractices,asetofincentives,andafieldofendeavorwithinthelargersoftwareengineeringdiscipline.TherestofthebookexplorestheSREWayindetail. 6VicePresident,GoogleEngineering,founderofGoogleSRE7SeeDisasterRolePlaying.8Forfurtherdiscussionofhowthiscollaborationcanworkinpractice,seeCommunications:ProductionMeetings.

請為這篇文章評分？

延伸文章資訊

Google - Site Reliability Engineering

Thus, Google SRE relies on on-call playbooks, in addition to exercises such as the "Wheel of Misf...

The Essential Guide to SRE - Blameless

SRE is a practice first coined by Google in 2003 that seeks to create systems and ... To create y...

Security Automation Lessons from Site Reliability Engineering ...

Examples span the range of building playbooks for response ... In fact, our SRE peers remind us t...

Writing Runbook Documentation When You're An SRE

As The Site Reliability Workbook says, playbooks “reduce stress, ... as the Site Reliability Engi...

awesome-sre/README.md at master - GitHub

Ben Treynor Sloss, VP Google Engineering, founder of Google SRE ... Incidents + Outages at Circle...

Google - Site Reliability Engineering

文章推薦指數： 80 %

請為這篇文章評分？

延伸文章資訊

最新文章

相關網站資訊

跆拳道拳法

遊戲裝備英文

跆拳道基本動作

健身房

槓鈴

雪山入山證

排雲山莊

山域嚮導資格檢定辦法

打跆拳道英文

跆拳道英文簡寫

體操英文