The Essential Guide to SRE - Blameless

文章推薦指數: 80 %
投票人數:10人

SRE is a practice first coined by Google in 2003 that seeks to create systems and ... To create your incident playbook, Chris recommends aggregating the ... NeweBook:HowtoInvestinReliability:Top4Priorities.Downloadforfree.ProductBlamelessProductsIncidentResolutionReliabilityInsightsIntegrationsIncidentRetrospectivesSLOManagerCommsFlowFeaturedGuideJiraFollowUpActionsbyIncidentType03.09.2022ProductProductRoundup:NewBlamelessFeaturesinJune202206.22.2022SeeallpostsBlogBlogCategoriesMainCommunityCompanyDevOpsIncidentResponseProductSREFeaturedSRE4SREGoldenSignals(Whattheyareandwhytheymatter)07.29.2021SREBuildinganSRETeamwithSpecialization01.05.2022SeeallpostsCustomersResourcesResourceCategoriesBlogseBooksPodcastsVideosWebinarsResourceLibraryDocumentationCustomerStoriesReliabilityBuyer’sGuideTalksFeaturedeBookHowtoInvestinReliability:Top4Priorities05.02.2022WebinarSRE:FROMTHEORYTOPRACTICEWhat'sdifficultabouttechdebt?06.22.2022SeeallresourcesCompanyCompanyAboutUsEventsContactUsNewsroomCareersTalksFeaturedDevOps.comSheddingLightOnToil:WaysEngineersCanReduceIt05.02.2022DevOpsParadoxWhyIncidentsAreSlowingDownCompanies06.01.2022SeeallrelatednewsDocsScheduleDemoScheduleDemoLogInProductBlamelessProductsIncidentResolutionReliabilityInsightsIntegrationsIncidentRetrospectivesSLOManagerCommsFlowFeaturedGuideJiraFollowUpActionsbyIncidentType03.09.2022ProductProductRoundup:NewBlamelessFeaturesinJune202206.22.2022SeeallpostsBlogBlogCategoriesMainCommunityCompanyDevOpsIncidentResponseProductSREFeaturedSRE4SREGoldenSignals(Whattheyareandwhytheymatter)07.29.2021SREBuildinganSRETeamwithSpecialization01.05.2022SeeallpostsCustomersResourcesResourceCategoriesBlogseBooksPodcastsVideosWebinarsResourceLibraryDocumentationCustomerStoriesReliabilityBuyer’sGuideTalksFeaturedeBookHowtoInvestinReliability:Top4Priorities05.02.2022WebinarSRE:FROMTHEORYTOPRACTICEWhat'sdifficultabouttechdebt?06.22.2022SeeallresourcesCompanyCompanyAboutUsEventsContactUsNewsroomCareersTalksFeaturedDevOps.comSheddingLightOnToil:WaysEngineersCanReduceIt05.02.2022DevOpsParadoxWhyIncidentsAreSlowingDownCompanies06.01.2022SeeallrelatednewsDocsScheduleDemoTheEssentialGuidetoSREWhySiteReliabilityEngineeringIntheworldoftechnology,thestakeshaveneverbeenhigher.Themovetothecloudandmicroservicestomaximizeagilityhasgivenwaytodigitaldisruptorsandunprecedentedcompetitivethreats.Asdistributedsystemsbecomeincreasinglycomplex,thescaleof‘unknownunknowns’increases.Ontopofthis,customerexpectationsaresky-high.Thecostofdowntimeiscatastrophic,withcustomerswillingtochurniftheirneedsarenotpromptlymet.AccordingtoGartner,theaveragecostofdowntimeis$300,000perhour.Forsomecompanies,thisnumberisconsiderablyhigher;forexample,Amazonlostapproximately$90millionduringtheirPrimeDayoutagein2018,andtheoutageonlylasted75minutes.Organizationsneedtoprioritizereliabilitysotheycaninnovateasquicklyaspossibleontopofastrongfoundationthatwon’tcompromisecustomerexperience.Thiswillbecomeevenmorecriticalasmorebusinessesmovetowarddistributedsystemswithhighreliabilityrequirements.That’swheresitereliabilityengineering(SRE)comesin.TheSREfunctionisgrowingquickly(30-70%YoYgrowthinjoblistings),butthereisnotenoughskilledtalentinthemarkettocompensate.Inotherwords,itwillbeimportanttounderstandhowyoucannotjusthireSREs,butgrowyourexistingorganizationtoadoptthepracticesandmindsetsrequiredforproductionexcellence.WiththeshortageofSREsforhire,whatcanyoudotoensureyourservice’sreliability?Toanswerthisquestion,you’llneedadeeperunderstandingofwhatSREactuallyis.‍WhatisSRE?SREisapracticefirstcoinedbyGooglein2003thatseekstocreatesystemsandservicesthatarereliableenoughtosatisfycustomerexpectations.Sincethen,manylargeorganizationssuchasLinkedInandNetflixhaveadoptedSREbestpractices.Inrecentyears,SREhasbecomemorewidelyadoptedbymanyorganizationsglobally,withthegoalofreliabilityandresilienceinmindinlightofexponentiallygrowingcustomerexpectationsaswellassystemscomplexity.‍SREisbasedonacustomer-firstmentality.ThismeansthatSREeffortsarealltiedtocustomersatisfaction,evenifthecustomersusingtheserviceareactuallyinternalusers.Eachdecisionshouldresultinprotectingorimprovingcustomersatisfaction.Teamsworktogethertodeterminewhichfactorsandexperiencesaffectcustomerhappiness,measurethem,setgoals,andbalancereliabilityrequirementswiththeinnovationvelocityrequiredtostayviableinanincreasinglycompetitivedigitallandscape.Toachievethisgoal,SREsandteamsthathaveadoptedSREbestpracticesrefertoseveralkeytenetsofSRE.AccordingtoGoogle,theseinclude:EnsuringadurablefocusonengineeringPursuingmaximumchangevelocitywithoutviolatingaservicelevelobjective(SLO)Monitoring,includingalerts,ticketing,andloggingEmergencyresponseChangemanagementDemandforecastingandcapacityplanningProvisioning,andEfficiencyandperformanceAccordingtoForrester,46%ofthetenetscanbeappliedout-of-the-boxformostsoftwareteamsintheenterprise,buttherestrequirecustomizationsorwon’tmakesenseforthevastmajorityoforganizations.Theimportantquestiontoaskyourselfishowthesetenetsfitinwithwhatyou’realreadydoing,andhowyourteamscanimprove.We’vegotmoreanswersbelow.‍‍UnderstandingHowSREFitsIntoYourOperationsModelAcommonearlymistakeinadoptingSREbestpracticesisassumingthatfollowingSREbestpracticesmeansyou’llneedtoripandreplaceyourcurrentprocedures,whichsimplyisn’ttrue.Infact,SREcanworkasacomplementtobothDevOpsandITILmethodologies.Thetrickistoensurethatregardlessofyourorganizations’differentoperatingmodelsortoolchains,thereissharedvisibility,communication,andcollaborationacrossteams.Thiswillallowyourdisparateteamstostayalignedwhileusingthebestpracticesfromeachmethodology.HowSREworkswithDevOpsThinkofSREasthepracticethatbringslifetotheDevOpsphilosophy.ThecoreprinciplesofDevOpsandSREarenearlyidentical.AccordingtoGoogle’scourseonSRE,“classSREimplementsDevOps,”the5DevOpsprinciplesareasfollows:Reduceorganizationalsilos:SREhelpsbysharingownershipacrossdevelopersandproductionteams,andunifyingtooling.Acceptfailureasnormal:BlamelesspostmortemsareanSREbestpracticethatensuresthatallincidentsareusedaslearningopportunities.SREalsocreatesasafespaceandguardrailsforfailurethroughSLOsanderrorbudgets.Implementgradualchange:Thisisdonebycanaryingrolloutstoasmallsubsetofcustomersbeforeallowingalluserstointeractwithnewfeatures.Smallerchangesareeasierandsafertodissectanditerateon.Leveragetoolingandautomation:SREsworktoeliminatetoilbymeasuringitandcreatingautomationtodorepetitivetaskswithoutneedinghumanintervention.Thisway,humanscanfocusonhigher-valuework.Measureeverything:SREspecificallyfocusesonmeasuringtoilandreliabilitytomakesurethatbothcustomersandsoftwareteamsarehappywiththeservice.Withthesecommonprinciplesdefined,it’seasytoseehowSREandDevOpsfitreallywelltogether,withSREcodifyingpracticesthatmakeiteasiertoachievethepromisesofDevOps.HowSREworkswithITILInpractice,ITILandSREcanalsomakeforagreatcombination.Thefirstreasonwhyissimple:everyorganizationwantshappycustomers,andITILandSREcanhelpdifferentfunctionsworktogethertomakethatareality.Embeddingreliabilitythroughoutthesoftwarelifecyclecanensureahigherrateofcustomerhappiness.WiththenewestrevisionofITIL(ITIL4),whichintroducessevenguidingprinciples,SREandITILalignevenmoreclosely.StartWhereYouAre:AdoptingSREbestpracticesisnotone-size-fits-all,andeveryonestartssomewhere.Takingthefirststepsandimplementinganditeratingasyougoiswhatmattersmost.KeepitSimpleandPractical:IntheGoogleSREbook’schapteronsimplicity,itstates“Unlikejustabouteverythingelseinlife,‘boring’isactuallyapositiveattributewhenitcomestosoftware!Wedon’twantourprogramstobespontaneousandinteresting;wewantthemtosticktothescriptandpredictablyaccomplishtheirbusinessgoals.”Simplicityinbothsoftwareandbusinessoperationsstreamlinescommunication,increasesvelocity,andhelpsensurethatreliabilityisn’tcompromised.Lessismore.OptimizeandAutomate:OneofthegoalsofSREistoautomatetoil-heavyprocesses,andfreeupdevelopertimetofocusinnovationinsteadofunplannedwork.Thisoptimizesworkflowsandallowsnewfeaturestoshipfaster.ProgressIterativelywithFeedback:SREssetalertsforthemostimportantanduser-centricmetrics.Themetrics,alerts,andSLOsthey’retiedtoarealliteratedupontobettersatisfycustomerneeds.CollaborateandPromoteVisibility:SREisculturallycollaborative.Itfocusesonablamelessworkculturethatvalueslearningfromfailure,andtrustingthateachteammemberisdoingwhatheorshethinksisbestfortheorganization.FocusonValue:Withoutcustomers,thereisnovalueinsoftware.Businessvalueiscreatedwhencustomerswant,andget,whattheyneedfromaproduct.SREbestpracticesensurethattheproductisreliableenoughtoprovidevaluetothecustomers,andalsoprotectthemostimportantcustomerjourneys.Thus,theyprovidesignificantvaluetotheorganizationinhelpingtodrivesharedfocus.ThinkandWorkHolistically:Bybreakingdownsilosandfocusingonscalabilityandreliabilityonaholisticlevel,SREsareabletoprovidesignificantbenefitsinmaturingtheorganization.Business-widesuccessisinthehandsofeveryteammember,andSREsworktomakesurethatthecompany’sproduct,systems,andproceduresareresilientenoughtonotjustmeetbutexceedcustomerstandards.ForavisualonhowSRE,DevOps,andITIL’sbestpracticescanbeusedinconjunctionwitheachother,hereisahandygraph.‍‍WhetheryouidentifyasaDevOpsorITILshop,yourorganizationhassomethingtogainbyfollowingtheprinciplesofSRE.Let’sdiveintowhatexactlytheseprinciplesentail.‍Principle#1:CreateaMindsetofResiliencyResiliencyisn’tsomethingthatjusthappens;it’saresultofdedicationandhardwork.Toreachyouroptimalstateofresilience,therearesomecrucialSREbestpracticesyoushouldadopttostrengthenyourprocesses.‍IncidentPlaybooksAsyouknow,failureisnotanoption…becauseactually,it’sinevitable.Thingswillgowrong,especiallywithgrowingsystemscomplexityandrelianceonthird-partyserviceproviders.You’llneedtobepreparedtomaketherightdecisionsfast.There’snothingworsethanbeingcalledintheweehoursofaSundaymorningtohandleasituationwherethousandsofdollarsaregoingdownthedraineverysecond.Yourbrainisfoggy,andyou’lllikelyneedtimetoadjusttotheextremepressureofacriticalincident.Inthesecases(andreally,allcaseswhereanincidentisinvolved),incidentplaybookscanhelpguideyouthroughtheprocessandmaximizetheuseoftime.AccordingtoChrisTayloratTaksatiConsulting,goodincidentplaybookshelpyoucoverallyourbases.Theytypicallyincludeflowchartsandcheckliststodepictboththebigpictureandtheminutedetails,aRACI(responsible,accountable,consulted,informed)chartforeachstep,andalistofenvironmentalinfluencesthatareuniquetoyoursystem.Tocreateyourincidentplaybook,Chrisrecommendsaggregatingthefollowinginformation:AninventoryofrelevanttoolsTherightpersonnel/subjectmatterexpertstoengageinresponseKnowingtheproblemtosolve,ortheworkflowyou’retryingtodocumentCurrentstate(whetherthisisanewprocess,orupdatingandoldone)Bydevelopingincidentplaybooksandpracticingrunningthroughthem,you’llbemorepreparedfortheinevitable.ChangeManagementChangemanagementisoftendonehaphazardly,ifatall.Thismeansthatorganizationsareunabletomanagetheriskofpushingnewcode,possiblyleadingtomoreincidents.RatherthanemployITIL’sarduousCABmethod,SREseekstoempowerteamstopushcodeaccordingtotheirownschedulewhilestillmanagingrisk.Todothis,SREusesSLOsanderrorbudgets.SLOs,orservicelevelobjectives,areinternalgoalsforserviceavailabilityandspeedwhicharesetaccordingtocustomerneeds.TheseSLOsserveasabenchmarkforsafety.Eachmonth,youhaveacertainallowableamountofdowntimedeterminedbyyourSLO.Youcanusethisdowntimetopushnewfeatures.Ifafeatureisatriskforexceedingyourerrorbudget,itcannotbepusheduntilthenextwindow.IfthefeatureislowtonorisktoyourSLO,thenyoucanpushit.Eachmonthteamsshouldaspiretousetheentirety,butnotexceed,theirerrorbudgets.Thisway,yourorganizationcanoptimizeforinnovation,butdososafelywithoutriskingunacceptablelevelsofcustomerimpact.CapacityPlanningBlackFridayoutages,scaling,movingtocloud.Allofthesebigeventsrequiredheightenedcapacityplanning.Ifyoudon’thaveenoughloadbalancersonBlackFridayorCyberMonday,youmightbesunk.Or,ifyourcompanyissimplygrowingquickly,you’llneedtoadoptbestpracticestomakesurethatyourteamhaseverythingitneedstobesuccessful.Therearetwotypesofdemandthatrequireadditionalcapacity:thefirstisorganicdemand(thisisyourorganization’snaturalgrowth)andinorganicdemand(thisisthegrowththathappensduetoamarketingcampaignorholiday.Topreparefortheseevents,you’llneedtoforecastthedemandandplantimeforacquisition.Importantfacetsofcapacityplanningincluderegularloadtestingandaccurateprovisioning.Regularloadtestingallowsyoutoseehowyoursystemisoperatingundertheaveragestrainofdailyusers.AsGoogleSREStephenThornewrites,“It’simportanttoknowthatwhenyoureachboundaryconditions(suchasCPUstarvationormemorylimits)thingscangocatastrophic,sosometimesit’simportanttoknowwherethoselimitsare.”Ifyourserviceisstrugglingtoloadbalance,ortheCPUusageisthroughtheroof,youknowthatyou’llneedtoaddcapacityintheeventofincreaseddemand.That’swhereprovisioningcomesin.Addingcapacityinanyformcanbeexpensive,soknowingwhereyouneedadditionalresourcesiskey.It’simportanttoroutinelyplanforinorganicdemandsoyouhavetimetoprovisioncorrectly.Theprocessofaddingcapacitycansometimesbealengthyeffort,especiallyifit’sthecaseofmovingtocloud.You’llalsoneedtoknowhowmanyhandsyou’llneedondeckforthesemomentousoccasions.‍Capacityplanningisanimportantpartofhavingaresilientsystembecauseinthinkingabouttheallocationofresources,yourteammembersmatter.Theyneedtimeoffforholidays,personalvacations,andtheobligatoryannualcold.Whenyoufailtoplanfortimeoff,youwon’thaveenoughhandsondecktohandleincidentsastheyoccur.Denyingpeopletimeoffisobviouslynottheanswer,asthatwillonlyleadtoburnoutandchurn.Soit’simportanttodevelopacapacityplanthatcanaccommodatepeoplebeing,well,people.JohannStrassersharesfourstepsyoucantaketodevelopacapacityplanthatwilleliminatestaffinginsecurity:Establishallnecessaryprocesseswiththeappropriatestaff–fromtopmanagementtoteamleaders.Decidehowoftenyouwillneedtorevise/revisitthisprocessandmakesurethateveryoneisinagreementonthis.Provideforcompleteandup-to-dateprojectdataandprioritizeyourprojects.Whatprojectsarethemostimportant,andwhichcanbeputonthebackburnerfornow?Additionally,howlongwilleachprojecttake?You’llneedthisdatatobeabletomoveforwardwithaccurateplans.Identifythecapacitiesacrossyourexistingteam,aswellasyourinfrastructureandservices.Istheteamequippedandsystemarchitectedinawaythatminimizesperformanceregressions,toprotectefficiencyandcapacity?Consolidatetherequirements(step2)andthecapacities(step3).Identifyunderloadaswellasoverloadandtrytobalancethem.So,nowyou’vegotthepeopleandtheprocess,buthowcanyoulearnandimproveonyourresilience?Forthat,you’llneedgreatpostmortempracticesinplacethatfacilitaterealintrospection,psychologicalsafety,andforward-lookingaccountability.Postmortem(orIncidentRetrospective)BestPracticesWhensomethinggoeswrong,it’simportanttolearnfromittopreventthesamemistakefromhappeningagain.Todothis,it’simportanttocraftandanalyzepostmortems(orpost-incidentreviews,RCAreports,orwhateveryouliketocallthem).Tohavepostmortemsworthyofanalysis,applyingSREbestpracticeswillbekey.Infact,postmortemsareagreatplacetobeginyourSREadoptionjourney.AsSteveMcGhee,SRELeaderatGoogleshares,“Conductingblamelesspostmortemswillenableyoutoseegapsinyourcurrentmonitoringaswellasoperationalprocesses."Armedwithbettermonitoring,youwillfinditeasierandfastertodetect,triage,andresolveincidents.Moreeffectiveincidentresolutionwillthenfreeuptimeandmentalbandwidthformorein-depthlearningduringpostmortems,leadingtoevenbettermonitoring.‍Buildingapostmortempracticewilleventuallyenableyoutoidentifyandtackleclassesofissues,includingfixingdeeplyrootedtechnicaldebt.Withtime,you’llbeabletodirectlyimprovesystemscontinuously.Oneofthemostimportantelementsofapostmortem,andofSREasawhole,isthenotionofblamelessness.Tolearnfrompostmortems,thereneedstobetotaltransparency.Openingupaboutmistakescanoftenbefrightening,andrequiresapsychologicallysafespacetodoso.Positiveintentshouldalwaysbeassumedinordertofosterthetrustthatallowsfortrueopenness.Blamingteammembersordefiningpeopleastherootcauseforfailurewillonlyleadtomoreinsecurity,coveringuptheimportanttruthsthatpostmortemsaremeanttouncover.Tocraftgreatpostmortems,therearefourotherbestpracticesthatwillensureyourincidentsarebeingusedtotheirfulladvantage:Usevisualsinyourpostmortems:AsSteveMcGheesays,“A‘whathappened’narrativewithgraphsisthebesttextbook-letforteachingotherengineershowtogetbetteratprogressingthroughfutureincidents.”Graphsprovideanengineerwithaquicklyreadableyetin-depthexplanationforwhatwashappeningduringtheincidentdays,weeks,orevenyearslater.‍Beahistorian:Timelinescanbeinvaluableforparsingthroughaparticularlydenseincident.Chatlogscanbecluttered,andit’sdifficulttoquicklyfindwhatyou’relookingfor.Thus,it’simportanttohaveacentralizedtimelinethatgivesaclean,clearsummaryoftheevents.Thisalsoprovidesthecontextthathelpsrelevantteammembersanalyzewhathappened.‍Tellastory:Anincidentisastory.Totellastorywell,manycomponentsmustworktogether.Withoutsufficientbackgroundknowledge,thisstorylosesdepthandcontext.Withoutatimelinedictatingwhathappenedduringanincident,thestorylosesitsplot.Withoutaplantorectifyoutstandingactionitems,thestorylosesaresolution.‍Publishpromptly:Promptnesshastwomainbenefits:first,itallowstheauthorsofthepostmortemtoreportontheincidentwithaclearmind,andsecond,itsoothesaffectedcustomers.Best-in-classcompanieslikeGoogle,Uber,andothershaveinternalSLOsaroundpublishingtheirpostmortemswithin48hours.Creatingincidentplaybooks,utilizingchangemanagementandcapacityplanning,andfollowingpostmortembestpracticeswillallcontributetoyoursystem’s,butthat’snotallthatSREseekstodo.‍Principle#2:ReduceEngineeringProblems/InnovationBlockersFocusingonthecustomerhasbeenakeybusinessstrategysincethebeginningoftime.Buthowdoyoureallyknowwhatyourcustomerswant,andhowcanyouguaranteeyou’reprovidingit?SRE’sconceptofSLIs(servicelevelindicators),SLOs(servicelevelobjectives),anderrorbudgetswillkeepyourorganizationalignedonwhatcustomersuccesslookslike.ServiceLevelIndicatorsWhenyoulookatyourproductthroughtheeyesofyouruser,youaren’tjustfindingtherightSLIs,butcreatingkeyinformationforconstructingauserjourney.Auserjourneyisapowerfultoolformanyaspectsofproductdesignasithelpsdesignersfocusonusers’priorities.Thelessonsyoulearnfromdevelopingandanalyzinguserjourneyscanbeinsightfulinthemostfundamentalareasofproductdesign,butfortheseinsightstobeaccurate,theunderlyingdatamustbecarefullyselected.Thetouchpointsbetweentheuserandyourserviceallinvolverequestsandresponses–thebuildingblocksofSLIs.Foreachtouchpointyouidentify,youshouldbeabletobreakdownthespecificSLIsmeasuringthatinteraction.Fromthere,youcanfolloweachbranchthattheusercouldtake,gatheringtheSLIsforthefollowingrequestsintoabundleforthatjourney.Tounderstanduserintent,youmustidentifypotentialpainpointsforthechosenjourney.YourbundleofSLIscanbeinstrumentalinfindingpainsthatmightotherwisebeinvisible.Let’ssaythatauser’schannelinvolvesmakingadozenrequeststothesameservicecomponent–likeclickingthroughmanypagesofsearchresults.Separately,theserequestsreturnquicklyenoughthatuserswon’tbebothered,maybeunderasecond,andauserlookingatjustoneortwopageswillbesatisfiedwiththisspeed.However,ifyouruserjourneyinvolveslookingthroughtwentypages,theannoyanceofnearlyasecondwait,repeatedtwentytimes,couldbeintolerable.Onlythroughlookingatrelevantmonitoringdataaswellasunderstandingthebroadercontextcouldyoudiscoverthispointofuserfrustration.Findingthesepainpointsalongtheuserjourneycouldleadtoaradicalredesignoftheserviceasawhole.Additionally,itopensupapathtosolutionsdeepinthebackendandhelpsdetermineprioritiesfordevelopment.Inourexampleabove,youcouldeitherredesignthecatalogtoavoidtheneedtolookthroughtwentypages,oryoucouldoptimizethecomponentsservingthosepagesuntilthetotaldelayacrossthetwentypagesisstillacceptable.Onceyouidentifywhatmakesyourcustomerhappy,it’simportanttosetgoalstoreachthem.ServiceLevelObjectivesServiceLevelObjectives,orSLOs,areaninternalgoalfortheessentialmetricsofaservice,suchasuptimeorresponsespeed,andcorrelatetocustomerhappiness.AsSLOsarealwayssettobemorestringentthananyexternal-facingagreementsyouhavewithyourclients(SLAs),theyprovideasafetynettoensurethatissuesareaddressedbeforetheuserexperiencebecomesunacceptable.Forexample,youmayhaveanagreementwithyourclientthattheservicewillbeavailable99%ofthetimeeachmonth.YoucouldthensetaninternalSLOwherealertsactivatewhenavailabilitydipsbelow99.9%.Thisprovidesyouasignificanttimebuffertoresolvetheissuebeforeviolatingtheagreement:ServiceLevelAgreementwithClients:99%availability–7.31hoursacceptabledowntimepermonthServiceLevelObjectiveInternally:99.9%availability–43.83minutesacceptabledowntimepermonthSafetyBuffer:6.58hoursKnowingthatyou’llhaveoversixandahalfhoursbetweenyourinternalobjectiveandanagreementbreachcanprovidesomepeaceofmindasyoudeploy.However,itcanbedifficulttodetermineabufferthatprovidessufficienttimetorespondwhendisruptionsoccur.GarrettPlasky,whopreviouslyledEvernote’sSREteam,describesthischallenge:“SettinganappropriateSLOisanartinandofitself,butultimatelyyoushouldendeavortosetatargetthatisabovethepointatwhichyourusersfeelpainandalsoonethatyoucanrealisticallymeet(i.e.SLOsshouldnotbeaspirational).”ItmaybetemptingfromamanagementperspectivetosetanSLOof100%,butitjustisn’trealistic.DevelopmentwouldbeparalyzedbyfearthatthesmallestchangecouldtriggeranSLObreach.Moreover,suchahightargetisn’thelpful.AsGarrettpointsout,theSLOshouldstillbesetabovethepointwheretheusersoftheservicearepained,asanyrefinementbeyondthatquicklygivesdiminishingreturnsforadditionalusersatisfaction.SettingSLOscanalsopositivelyimpactdevelopmentvelocitybygivingdeveloperstheopportunitytousesmallamountsofdowntimetoimprovetheservice.Thisamountoftimeallowediscalledanerrorbudget.ErrorBudgetsErrorbudgetsaretheamountofdowntimethatcanbesparedperwindowbeforeviolatinganSLO.Settingerrorbudgetscanpositivelyimpactyourorganizationinmanyways.First,itcanincreasetherateofinnovation.Developersnolongerneedtospendtimeconsultingwithotherteamsbeforedoingacodepush,aslongasthepushwon’tendangertheSLOandfallswithintheerrorbudget.Theycanspenddowntheerrorbudgetonnewfeatures,orchoosetoallocatetimeinsteadtofixingtechnicaldebtorinfrastructure.Thisalsoensuresthatpushesdon’tthreatenthereliabilityofyoursystemorcustomersatisfaction.Beyondincreasinginnovation,errorbudgetsalsoaligndifferentpartsoftheorganizationonincentivesandconsequences.Withanerrorbudgetinplace,developerscanpushcodeasfastastheyneedtowithoutcompromisingreliability.Thus,developers,product,andproductionteamsareallhappy.Iferrorbudgetsareoverextendedforacertainperiodoftime,therearealsoconsequencespredeterminedbytheerrorbudgetpolicy,suchasacodefreeze.ImproveEngineeringEfficiencyandMoraleSREnotonlyhelpscustomersstayhappy,italsoboostsmoralewithintheorganization.Happyengineersmeanshappycustomers,asengineerswon’tbuildthebestproductspossiblewithoutsupportfromtheorganization.TherearetwomajorswaysthatSREcanhelpbrightenengineering’sday.Minimizingtoil:OneofthemainfocusesofSREisautomation.Toilisawasteofpreciousengineeringtime,andbySREscreatingframeworks,processes,internaltooling/buildingtoolingtoeliminateit,engineerscangetbacktoinnovating.Reducingtechdebt:SREscreateaccountabilityaroundpostmortemfollow-upactionitemstomakesurethatoldissuesaren’tburiedundernewcode.SREsalsoputtogetherframeworkstohelpdevelopersdelivermoreperformantcode,prioritizingwhatmattersmosttothecustomerexperience.Pinpointingthetechdebtbuild-upthathurtscustomerexperienceisimportanttoguiderefactoringinitiativesandotherpracticestohelpteamsspendlesstimeonreactive,unplannedworkandmoretimeonthethingsthatmatterforthebusiness.Thisestablishesabaselineforhealthyengineeringpracticestohelpminimizefutureaccrualoftechdebt.Additionally,SREsinvestinculturalchangethatpreventsmoretechdebtfromaccruinginthefuture,whilestillmakingwayforinnovation.JeanHsu,Co-FounderofCoLeadership,wroteaboutrefactoringMedium'scodebase,andrealizedthatthemostimportantthingshecoulddoforherteamwasn’tjusttofixspaghetticode;itwastocreateaculturethatfixestechnicaldebtasitgoesalong,deletingoldcodeasneeded.Jeanwrote,“IrealizedthatifIalwaysdidthistypeofworkmyself,Iwouldbeconstantlyrefactoring,andtherestoftheteamwouldtakeawaythelessonthatI'dcleanupafterthem.ThoughIdidenjoyitmyself,Ireallywantedtofosteralong-termculturewhereengineersfeltprideandownershipoverthistypeofwork.”SREsareoftentheculturaldriversforthissortofwork,improvingthewayengineeringteamsfunctionasawholeratherthansimplygoingfromprojecttoprojectfixingbugs.Thesechangesarelong-terminitiativesthatsparkgrowthandadoptionofbestpracticesfortheentireorganization.Asyoucansee,SREcouldpositivelyimpacteachengineer’sday-to-dayproductivity.Infact,SREisnotabouttoolingorjobtitles,andisratheramorehuman-centricapproachtosystemsasawhole.Withthiscontextinmind,adoptionbringspositivebusinessbenefitsforeveryoneintheorganization.‍Principle#3:ApproachSystemsfromaHumanPerspectiveResiliencyengineeringasapracticelooksatsystemsholistically,consideringnotonlyinfrastructurebutalsohuman,process,andculturalfactors.WithoutadoptingthecultureandmindsetbehindSRE,you’llsimplyhavenewprocesseswithnounitingvalueatthecentertokeeptheinitiativeinplace.Focusingonthehumanapproachtosystemsrequiresreevaluatingyourorganization’sattitudetowardsthefollowing:On-call&fullserviceownershippractices,keepingburnoutatbay,andcelebratingfailure.AnyorganizationcanadoptSREbestpractices,anditcanbegininsmallincrements.Themostimportantchangeyouwillmakewillbetheculturalone.Asorganizationsaremadeofpeople,anyorganizationcanfostercontinuouslearning,blamelessculture,andpsychologicalsafetysolongasitspeoplearecommittedtoagrowthmindset.Oncetheseculturalfactorsareinplace,itbecomesmucheasiertoimplementthepractices,processes,andtoolsthatscalethatcultureofexcellence.Todivedeeperandgetmorebonusreadingmaterialontheabovetopics,downloadyourcopyofTheEssentialGuidetoSRE.RelatedResourcesWhat'sDifficultAboutOn-Call?BridgingtheGap:FromDevOpstoSREBeyondthe4GoldenSignalsBuildReliableServicesontheCloudGetthelatestfromBlamelessReceivenews,announcements,andspecialoffers.



請為這篇文章評分?