Do you have an SRE team yet? How to start and assess your ...

文章推薦指數: 80 %
投票人數:10人

SRE is an essential part of engineering at Google. ... An operational playbook/runbook should exist, even if not complete. BlogSkiptocontentBlogMenuWhat'sNewProductNewsGoogleCloudGoogleWorkspaceChromeEnterpriseGoogleMapsPlatformSolutions&TechnologiesAI&MachineLearningAPIManagementApplicationDevelopmentCloudMigrationComputeContainers&KubernetesDataAnalyticsDatabasesDevOps&SREIdentity&SecurityInfrastructureNetworkingSAPServerlessStorage&DataTransferTopicsDevelopers&PractitionersPartnersInsideGoogleCloudIndustriesFinancialServicesHealthcare&LifeSciencesMedia&EntertainmentPublicSectorTelecommunicationsRetailStartupsTraining&CertificationsGoogleCloudNextCIOs&ITleadersAboutRSSFeed×ContactSalesGetstartedforfreeLateststoriesWhat'sNewProductNewsTopicsCIOs&ITleadersAboutRSSFeedDevOps&SREDoyouhaveanSREteamyet?Howtostartandassessyourjourney#DevOpsGustavoFrancoCustomerReliabilityEngineerJanuary25,2019AccelerateStateofDevOpsReportGetacomprehensiveviewoftheDevOpsindustry,providingactionableguidancefororganizationsofallsizes.DownloadWe’repleasedtoannouncethat TheSiteReliabilityWorkbookisavailableinHTMLnow!SiteReliabilityEngineering(SRE),asithascometobegenerallydefinedatGoogle,iswhathappenswhenyouaskasoftwareengineertosolveanoperationalproblem. SREisanessentialpart ofengineeringatGoogle.It’samindset,andasetofpractices,metrics,andprescriptivewaystoensuresystemsreliability.ThenewworkbookisdesignedtogiveyouactionabletipsongettingstartedwithSREandmaturingyourSREpractice.We’veincludedlinkstospecificchaptersoftheworkbookthatalignwithourtipsthroughoutthispost.We’reoftenaskedwhatimplementingSREmeansinpractice,sinceourcustomersfacechallengesquantifyingtheirsuccesswhensettinguptheirownSREpractices.Inthispost,we’resharingacoupleofcheckliststobeusedbymembersofanorganizationresponsibleforanyhigh-reliabilityservices.Thesewillbeusefulwhenyou’retryingtomoveyourteamtowardanSREmodel.Implementingthismodelatyourorganizationcanbenefitbothyourservicesandteamsduetohigherservicereliability,loweroperationalcost,andhigher-valueworkforthehumans.Buthowcanyoutellhowfaryouhaveprogressedalongthisjourney?Whilethereisnosimpleorcanonicalanswer,youcanseebelowanon-exhaustivelisttocheckyourprogress,organizedaschecklistsbyascendingorderofmaturityofateam.Withineverychecklist,theitemsareroughlyinchronologicalorder,butwedorecognizethatanygiventeam’sactualneedsandprioritiesmayvary.Ifyou’repartofamatureSREteam,thesechecklistscanbeusefulasaformofindustrybenchmark,andwe’dlovetoencourageothersto publishtheirsaswell.Ofcourse,SREisn’tanexactscience,andchallengesarisealongtheway.Youmaynotgetto100%completionoftheitemshere,butwe’velearnedatGooglethatSREisanongoingjourney. SRE:JustgettingstartedThefollowingthreepracticesarekeyprinciplesofSRE,butcanlargelybeadoptedbyanyteamresponsibleforproductionsystems,regardlessofitsname,beforeandinparalleltostaffinganSREteam.Someservice-levelobjectives(SLOs)havebeendefined(jointlywithdevelopersandbusinessowners,ifyouaren’tpartofoneofthesegroups)andaremetmostmonths.There'sacultureofauthoringblamelesspostmortems.There'saprocesstomanageproductionincidents.Itmaybecompany-wide.BeginnerSREteamsMost,ifnotall,SREteamsatGooglehaveestablishedthefollowingpracticesandcharacteristics.WegenerallyviewtheseasfundamentaltoaneffectiveSREteam,unlesstherearegoodreasonswhytheyaren’tfeasibleforaspecificteam’scircumstances.Astaffingandhiringplanisinplaceandfundinghasbeenapproved.Oncestaffed,theteammaybeon-callforsomeserviceswhiletakingatleastpartoftheoperationalload(toil).Thereisdocumentationforthereleaseprocess,servicesetup,teardown(andfailover,ifapplicable).AcanaryprocessforreleaseshasbeenevaluatedasafunctionoftheSLO.Arollbackmechanismisinplacewhereit’sapplicable(thoughit’sunderstoodthatthisisanontrivialexercisewhenmobileapplicationsareinvolved,forexample).Anoperationalplaybook/runbookshouldexist,evenifnotcomplete.Theoretical(role-playing)disasterrecoverytestingtakesplace,atleastannually.SREplansandexecutesprojectwork,whichmaynotbeimmediatelyvisiblebytheirdevelopercounterparts,suchasoperationalloadreductioneffortsthatmaynotneeddeveloperbuy-in.ThefollowingpracticesarealsocommonforSREteamsstartingout.Iftheydon’texist,thatcanbeasignofpoorteamhealthandsustainabilityissues:Enoughon-callloadtoexerciseincidentresponseproceduresonaregular(i.e.,weekly)basis.AnSREteamcharterthat’sbeenreviewedbytheappropriateleadershipbeyondSRE(i.e.,CTO).PeriodicmeetingsbetweenSREanddeveloperleadershiptodiscussissuesandgoalsandshareinformation.ProjectplanningandexecutionisdonejointlybydevelopersandSRE.SREworkandpositiveimpactisvisibletodeveloperleadership.IntermediateSREteamsThesecharacteristicsarecommoninmatureteamsandgenerallyindicatethattheteamistakingaproactiveapproachtoefficientmanagementofitsservices.ThereareperiodicreviewsofSREprojectworkandimpactwithbusinessleaders.ThereareperiodicreviewsofSLIsandSLOswithbusinessleaders.There’salowvolumeoftoiloverall;<=50%canbemeasuredbeyond“just”lowon-callload.Theteamestablishesanapproachregardingconfigurationchangesthattakesreliabilityintoaccount.SREshaveestablishedaplantoscaleimpactbeyondaddingscopeorservicestotheiron-callload.There'sarollbackmechanismincaseofcanaryfailures.Itmaybeautomated.Thereisperiodictestingofincidentmanagement,usingacombinationofrole-playingwithsomeautomationinplace.There’sanescalationpolicytiedtoSLOviolations;thismightbeareleaseprocessfreeze/unfreeze,orsomethingelse.CheckoutourpreviouspostonthepossibleconsequencesofSLOviolations.ThereareperiodicreviewsofpostmortemsandactionitemsthataresharedbetweendevelopersandSRE.Disasterrecoveryisperiodicallytestedagainstnon-productionenvironments.Teamsmeasuredemandvs.capacityanduseactiveforecastingtodeterminewhendemandmightexceedcapacity.TheSREteammayproducelong-termplans(i.e.,ayearlyroadmap)jointlywithdevs.AdvancedSREteamsThesepracticesarecommoninmoreseniorteams,orsometimescanbeachievedwhenanorganizationorsetofSREteamsshareabroadercharter.Atleastsomeindividualsontheteamcanclaimmajorpositiveimpactonsomeaspectofthebusinessbeyondfirefightingorops.Projectworkcanbeandisoftenexecutedhorizontally,positivelyimpactingmanyservicesatonceasopposedtolinearlyorworseperservice.MostservicealertsarebasedonSLOburnrate.Automateddisasterrecoverytestingisinplaceandpositiveimpactcanbemeasured.AnothersetofSRE“features”whichmaybedesirablebutunlikelytobeimplementedbymostcompaniesare:SREsarenoton-call24x7.SREteamsaregeographicallydistributedintwolocations,suchasU.S.andEurope.It’sworthpointingoutthatneitherhalfistreatedassecondary.SREanddeveloperorganizationssharecommongoalsandmayhaveseparatereportingchainsuptoSVPlevelorhigher.Thisarrangementhelpstoavoidconflictsofinterest.WhatshouldIdonext?Onceyou’velookedthroughthesechecklists,yournextstepistothinkaboutwhethertheymatchyourcompany’sneeds.ForthosewithoutanSREteamwheremostofthebeginnerlistisunfilled,we’dhighlyrecommendreadingtheassociatedSREWorkbookchaptersintheordertheyhavebeenpresented.IfyouhappentobeaGoogleCloudPlatform(GCP)customerandwouldliketorequestCREinvolvement,contactyouraccountmanagertoapplyforthisprogram.Buttobeclear,SREisamethodologythatwillworkonahugevarietyofinfrastructures,andusingGoogleCloudisnotaprerequisiteforpursuingthissetofengineeringpractices.We’dalsorecommendattendingexistingconferencesandorganizingsummitswithothercompaniesinordertosharebestpracticesonhowtosolvesomeoftheblockers,suchasrecruiting.Wehavealsoseenteamsstrugglingtofillouttheadvancedlistbecauseofchurn.Therateofsystemsandpersonnelchangesmaybeadeterrenttogetthere.Inordertoavoidteamsrevertingtothebeginnerstageandotherproblems,ourSREleadershipreviewskeymetricsperteameverysixmonths.Thescopeismorenarrowthanthechecklistsabovebecauseseveraloftheitemshavenowbecomestandard.Asyoumayhaveguessedbynow,answeringthecentralquestioninthisarticleinvolvesaddressingandattemptingtoassessagiventeam’simpact,health,andmostimportantly,howtheactualworkisdone.Afterall,aswewroteinourfirstbookonSRE:"Ifweareengineeringprocessesandsolutionsthatarenotautomatable,wecontinuehavingtostaffhumanstomaintainthesystem.Ifwehavetostaffhumanstodothework,wearefeedingthemachineswiththeblood,sweat,andtearsofhumanbeings."Soyes,youmighthaveanSREteamalready.Isiteffective?Isitscalable?Arepeoplehappy?WhereveryouareinyourSREjourney,youcanlikelycontinuetoevolve,growandhoneyourteam’sworkandyourcompany’sservices.LearnmorehereaboutgettingstartedbuildinganSREteam.ThankstoAdrianHilton,AlecWarner,DavidFerguson,EricHarvieux,MattBrown,MykTaylor,StephenThorne,ToddUnderwoodandVivekRauamongothersfortheircontributionstothispost.RelatedArticleSREvs.DevOps:Competingstandardsorclosefriends?WhatexactlyisSREandhowdoesitrelatetoDevOps?Thisposthelpsanswerquestionsandreducefrictionbetweenthecommunities.ReadArticlePostedin:DevOps&SREGoogleCloud



請為這篇文章評分?