SRE vs DevOps - Gremlin

文章推薦指數: 80 %
投票人數:10人

The cultures are also very similar. The biggest difference is that SRE has an intentionally narrowed focus on keeping services and platforms available to ... SREvsDevOpsCantheycoexistordotheycompete?DevOps.SiteReliabilityEngineering(SRE).Aretheydifferentorjustdifferentnamesforthesamething?Thisarticleexploresthatquestionindepthbydelvingintoeachandthencomparingthem.DevOpsisanimportantparadigmshifttobridgethegapbetweenthetypicallysiloeddevelopmentteamsandoperationsteams.Traditionallythesetwoteamsrarelycommunicate,muchlesscollaborateonwork.Developmentwritescodeandthrowsitoverthemetaphoricwalltooperationswhosejobitistodeploythatcode,withallitsdependenciesandconfigurations,andkeepitrunning.SiteReliabilityEngineeringisthenextstageimplementationofDevOps.DevOpsisaphilosophywithawiderangeofimplementationstylesavailable.SREismoreprescriptiveabouthowthingsaretobedoneandwhattheprioritiesoftheteamexplicitlyare,specifically,thejobistokeepthesitereliableandavailableandonlythingsthatcontributetothisgoalareprioritized.WhatisDevOps?WhatistheRoleofDevOpsinanOrganization?WhataretheBenefitsofDevOps?WhatisSiteReliabilityEngineering?WhataretheBenefitsofSiteReliabilityEngineering?WhatistheroleofSiteReliabilityEngineeringinanOrganization?HowdoesDevOpsWork?FourMetricsfortheSuccessofDevOpsHowDoWeMakeThisHappen?DevOpsTechnologyandToolsHowdoDevOpsandSiteReliabilityEngineeringCompare:SREvsDevOpsDoesEveryEngineeringOrganizationNeedSiteReliabilityEngineers,orDoesDevOpsSuffice?IsthePositionofSysAdminNoLongerRelevant?CanaSiteReliabilityEngineeringTeamPreventProductionIncidents?SummaryWhatisDevOps?TheshortestdefinitionofDevOpsiscombiningdevelopmentandoperationsteamsforthepurposeofmovingcodeintoproductionasquicklyandsmoothlyaspossible.ThephilosophybehindDevOpsisthatteamswhosharetheresponsibilitiesforbothcodewritingaswellasmaintenancetokeepitrunningwellonceinproductionaremoreefficient.WhatistheRoleofDevOpsinanOrganization?AccordingtoGoogle,theprimaryroleofDevOpsinanorganizationisto“increasesoftwaredeliveryvelocity,improveservicereliability,andbuildsharedownershipamongsoftwarestakeholders.”Thisisdoneviaaculturalandorganizationalmovement,onethatrequiresfocusandbuy-infromstakeholders,becauseitreallyisanewwayofthinkingaboutsoftwaredevelopment.WhataretheBenefitsofDevOps?DevOpsbenefitsanorganizationbyimprovingthespeedofsoftwaredeliverywithmorefrequentreleasescomprisedofsmallerchanges.Thisisacompetitiveadvantage,allowingcompaniestobringproductstomarketfaster,whetherfeatureadditionsorstability/bugfixes.Wesplitoutlargesoftwareintoservicesormicroservices,makingupdatesandreplacementseasierandfaster,andsincewehavetrainedteamsoverseeingthemandimplementinggoodpracticeslikefailoverschemesandChaosEngineeringtoenhancereliability,weminimizetheopportunityforfailureduetonetworkingandmessagingproblems.DevOpsalsoimprovessoftwarestability,becauseeventhoughchangesarepushedtoproductionfrequently,thosechangesaresmallandthereforehavefarlesspotentialtocausedisruption.Further,smallchangesareeasytorollbackquicklyintheeventofanunforeseenproblem,makingitsafertopushthosefrequentchanges.Anotherbenefitisintheavailabilityandsecurityofteams’softwaredeliverycapability.Whenweareusingatoolchainandbuildprocessfrequently,weworkouttheproblemsandtheprocessgetssmootherandeasierovertime.Ofcourse,wealsoautomateit,whichitselfhasgreatbenefits.Allthisleadstoreducedopportunityforerrors,bugs,andsecurityholes.WhatisSiteReliabilityEngineering?SiteReliabilityEngineering(SRE)istheoutcomeofcombiningsystemoperationsresponsibilitieswithsoftwaredevelopmentandsoftwareengineering.SREsacceptabroadrangeofresponsibilityrelatingtosoftwarecode.Iftheywriteit,theybuildit,theyshipit,andtheyownitinproduction.OneinterestingmetaphorincommonuseisthattheclassSREimplementstheDevOpsinterface.Inotherwords,classesinobject-orientedprogrammingoftenincludemorespecificbehaviorsthanwhatinterfacesdefineandsometimesclassesimplementmultipleinterfaces.Inthatsense,SREincludespracticesandrecommendationsthataresometimesmorepreciseoradditionaltowhatDevOpsdescribes.WedefineSiteReliabilityEngineeringindetailinWhatisSiteReliabilityEngineering?APrimerforEngineeringLeaders.WhataretheBenefitsofSiteReliabilityEngineering?WhereDevOpsbringsgreatercollaborationandvelocitytocompanies,themainbenefitofSiteReliabilityEngineeringisgreatlyenhanceduptime.ThestrongfocusonkeepingasoftwareplatformorservicerunningisthefoundationofSRE.Thegoalistokeepthingsoperational“nomatterwhat,”meaningthatsignificanteffortandemphasisisplacedonthingslikeredundancy,disastermitigationandprevention,andultimately,reliability.ForanSRE,uptimeiskey.Evenbeyondwhatispromised,thegoalisalwaystofindbetterandbetterwaystopreventproblemsthatcancausedowntimeandtokeepthingsupandrunning.Theunexpectedhappens,andweallknowit,soperfectionisnotthefocus.Instead,thefocusisonlearningfrompastproblems,preventingrecurrence,andanticipatingasmanypotentialproblemsaspossible.Top-notchSREsdoallofthesewellandarepaidaccordingly.ItisnotacoincidencethatcompaniessuchasEvernoteandHomeDepotwithsolidSREteamscandemonstratesignificantlyimproveduptime,asshowninthesecasestudiesfromGoogle.WhatistheroleofSiteReliabilityEngineeringinanOrganization?TheroleofSiteReliabilityEngineeringinanorganizationistokeeptheorganizationfocusedonwhatultimatelymatterstocustomers:theplatformsandservicescustomerswantmustbeavailablewhencustomerswanttousethem.Teammembersuseavarietyoftools,programminglanguages,andabroadskillset,makingthejobonethatisconstantlystimulatingandinteresting.SeeoursampleSREjobdescriptionandinterviewquestionsarticleformore.HowdoesDevOpsWork?DevOpsworksbybuildingacultureofcollaborationfromthebeginning.Teamsmustworktoestablishtrustbetweenmembers,andbysharingresponsibilitiesofallthestagesofsoftwaredevelopmentteammemberscanmakemoreinformeddecisionsaboutthecodethattheywrite,test,deploy,andmaintain.Thisfliesinthefaceofpastsoftwaredevelopmentmethodologiesthatreliedonanassemblylineofmulti-stagetestingdeployments,reviewcommitteescomprisedofpeopleacrossthebusiness,andcareful,oftentedious,checklists.Ithasalwaysbeenachallengeinawaterfallsettingtogetcodefromideatoimplementationtoproductionefficiently.Evenamajorbugfixfromaqualitysoftwareengineerwouldrequirenavigatingorganizationalsilos,settingupmeetingsandasignofffrommultipledepartments,manyofwhommighthaveonlyapassinginterestinthesystemorserviceinvolved.Itisnotuncommonforafeatureupdatetotakesixtoninemonthstomakeitintoproductionsandprovidevaluetocustomers.Thisisuntenableintoday’smarketplace.Instead,DevOpsteamsareentrustedbythebusinesstorememberthebigpicturewhilewritingcode,becausethosesamepeoplemustworktogethertodeploythatcodetoproductionandmaintainit.Theverysameteamisresponsibleforbugs,outages,oranythingelserelatedtothecodetheyhavewritten.Teamsareempoweredtoexperimentandinnovate.Theyownthecode.Theyowntheprocess.Theyownthedeployment.Theyalsoholdthepowertomakeimprovementsandtryoutnewideaswithoutapprovalfromanyoneoutsidetheteam.Theteamisaccountableforthereliabilityoftheircodeanddeploymentandareotherwisegivenwideleewaytodeterminetheirownprocesses,changeapprovals,management,andneeds.Thisrequiresaculturalshiftandagreatdegreeoftrust,includingtrustamongteammembersandalsotrustfrommanagement.FourMetricsfortheSuccessofDevOpsGoogle’sJezHumbledefinedfourmetricsforthesuccessofDevOps:Leadtimeforchangesmeasureshowmuchtimeyoumustplaninadvanceforaproposedsoftwarechangetomakeittoproduction.Decreasingthatisvitalforincreaseddeploymentcadence.Lowperformerstakeaweekorevenamonth.Highperformersonlyneedadayorless.Deploymentfrequencyhasadirectimpactonhowrapidlyitispossibleforsoftwareuserstobenefitfrombugfixesandneworenhancedfeatures.Ultimately,elitecompaniesdeploymultipletimesperday!Timetorestoreserviceistheamountoftimerequiredtobringservicesbackupwhenaproblemoccurs.Gettingyournumberdownunderonehourisideal.Eliminatingtheneedentirelyisanunreasonableexpectationinaneraofincreaseddeploymentvelocitythatsometimesintroducesbreakingchanges.Notethatthisandthenextentrydonotmeanfailureoftheoverallsystem,butonlyfailureofanindividualservice.Ifyouareusingcanarydeployments,thefailureofanewserviceinstanceshouldhavenoimpactonthenumerousinstancesofthepreviousstablereleaseandthereforethereshouldnotbeacustomerimpact,eventhoughyouencounterproblems.Changefailureratemeasureshowfrequentlyadeployedreleasehastoberolledbackduetoitnotworkingproperly.Thebestteamshavearatebetweenzeroand15%.Thingslikecodereviews,testing,andgooddesignhelp,butoursystemsaresocomplexandunderconstantchangethatweshouldexpectsomeservicefailures.HowDoWeMakeThisHappen?Howdoweaccomplishanyofthis?First,weneedgoodmeasurement.Observability.Wemonitoroursystemsandusewhatwelearntoinformourbusinessdecisions.Bottlenecksandsqueakywheelsgetattention,soonerratherthanlater.Failurenotificationsaresentproactivelybasedondatathresholdssetinmonitoringtools.Weactivelyworktoautomatefailuremitigationandtrytosetactivationbasedonmonitoringdatathresholdssetwellbelowactualfailurelevels,sothatevenifanodefailsornetworkingfalters,enduserneedsarealreadyroutedtootherpathsandourcustomersnevernoticetherewasaproblem.DevOpsrequiresestablishingaculturalnormthataccidentsarenormalandthatfailureshappen--andthatneithershouldbealightningrodforblame.Eliminatingblameenhancesateam’sabilitytofocusonhowtofixproblemsandexperimentationratherthanworryingaboutreputationsandbattlinganxieties.Increasingtherateofchangewillalsoincreasepotentialfailures,soDevOpsculturesneedtobecomfortablewithfailurewhilealsofocusingonrecoveryandbackups.DevOpsTechnologyandToolsSomeofthetechnicalsolutionsthateffectiveteamsuseintheirDevOpsworkflowsinclude:Versioncontrolforallcode,includingconfigurationmanagementandsecretsmanagement,usingtoolssuchasgit,withGitHuborGitLabforcentralizedmasterrepositoryaccessorsimilarAtrunk-baseddevelopmentmodelwheredevelopersandengineerspullfromamasterbranchfrequentlyandpushchangesthatareasatomicandsmallaspossibleasfrequentlyaspossibleinseparatepullormergerequestsContinuousintegrationusingtoolslikeJenkins,Spinnaker,TravisCIorsimilarDeploymentautomation,typicallyusingthesameCItoolsTestautomation,includingsecuritytestsstartingasearlyinthepipelineaspossible,usingtoolslikeSelenium,Postman,mablorsimilarIncidentmanagement,forwhenthingsgowrong,usingtoolssuchasOpsgenie,PagerDuty,andVictorOpsorsimilarHowdoDevOpsandSiteReliabilityEngineeringCompare:SREvsDevOpsSystemsfail,sometimespubliclyandatgreatcost.Airlineshavebeenhitwithsystem-wideticketingoutagescausingsignificantinconvenienceandthecompanyresponsiblesaid,“Nodowntimeisacceptable”astheyapologizedforthedowntime.Costco’swebsitecrashedforseveralhoursonThanksgivingDay,costingthemanestimated$11million.CenturyLinkhadanoutagelastingover24hoursthatincludeddisruptiontothevital911emergencyservice.Thesearejusthighlightsfrom2019.Canwepreventoutagesinaneraofsuchgreatvelocity?Wehavegonefromannualsoftwarereleasestodailyreleases,fromrunningsoftwareasamonolithtorunninghundredsofmicroservices,fromonpremhostingonhundredsofphysicalhoststoKubernetes,containers,andcloudhostsnumberingsometimesintothehundredsofthousands.ThisiswhereitisvitaltojoinDevOpswithSiteReliabilityEngineeringperspectivesandimplementation.SitereliabilityengineeringmaybethoughtofasaspecificimplementationofDevOps,eventhoughtheyweredevelopedseparately.Therearemanysimilaritiesinintentandfoundationalperspectives.Differencesmainlyresultfromanarrowingofteamfocus.BothDevOpsEngineersandSiteReliabilityEngineersbeginwithabeliefthatchangeisnecessarytoimprove.Nosoftwareremainsstagnant.Nosystemidlesunchangedforever.Whetheritisfixingbugsorevolvingandaddingfeatures,thingschange.Capacityneedswaxandwaneandinfrastructurecannotremainstatic.Everythingmustandwilleventuallychangeordieout.Bothhaveastrongfocusonworkingtogetherasateamwithsharedresponsibilitiesandanassumptionofcollaboration.Nooneworksinasilo.Ownershipissharedfrominitialcodecreationtosoftwarebuildstodeploymenttoproductionandmaintenance.Keepingeverythingworkingiseveryone’sresponsibility,evenifthereissomerole-basedfocusforindividualteammembers,theresponsibilityremainseveryone’s.Whilebothconsideratomicchangesasharedvalue,withreliabilityasthemainfocus,managingchangeisvitalforSRE.Bothpromotemakingsoftwarechangesassmallaspossible,becausesmalldeltasusuallymergemoresmoothlyandareeasiertorollbackwhenaproblemarises.However,theRisSREis“reliability”andthatfocuspromotesthisvaluetoahigherstanding.HowthesesmallchangesaremergedandthenintegratedintoabuildanddeployedmaydifferfromatoolingperspectiveacrossDevOpsandSRE,butbothshareastrongpreferenceforautomationwherepossible.SREtendstotakethistothelogicalextremewhereitcan,seekingtoautomatetheCI/CDpipeline,testing,chaosexperiments,andmore.SREteamsworktoautomatenearlyeveryactionthatisperformedmorethanonceortwicebyahuman,removinganypossibletoilfromthedailyroutineinfavorofusinghumanintellectualcapacitytofindandenactimprovements.ThismayhappeninaDevOpsteam,butitisrarelyafocus.Thetoolsusedbyeachtypeofteamaregenerallysimilarandmaybenearlyidentical,withtheexceptionofteam-writtentoolsspecifictothatteam’sresponsibilities.ThemainsimilarityisaperspectivethatisfocusedonAPIsandabstractedinteractionsratherthandirectentanglementsbetweensystemsorforadministrationandmanagementtasks.Sometoolsarecreatedin-house,someareadaptedopensourcetools,andsomearepurchasedproprietarytools.Ahugesimilarityistherequirementforgoodmeasurementandobservability.Data,especiallygooddata,isvitaltobothDevOpsandSRE.OnebigdifferenceisthatSREteamsalwaysfocusonservicelevelobjectives(SLOs),keepingthemandimprovingsystemstomaximizeeffectivenessbasedonthem.DevOpstendtothinkaboutwhatthedatatellsthemaboutthesystem,howitisrunning,whereitisweakorfailing,andsoon.SREstendtobemorespecificallypractical,thinkingabouthowtousethesamedatatoimproveperformanceononeormoreSLOs,evenusingmachinelearningtechniquestohavesystemsadaptthemselvestochangingcircumstances.BothDevOpsandSREteamssharetheexpectationthatbadthingshappen.Systemcomponentsfail.Humansaccidentallyinputthewronginstructions.Networksgetoverloadedandlatentorfail.Withthisexpectation,focusisputonhowtopreventandthenhowtofixquicklywhenpreventionfails.Thereisnoblameplacedonanyone.Lookingatfailuresaftertheyarerepairedinablamelesswaywithablamelessretrospectiveorpostmortempermitsteamstofocusonhowtopreventarecurrenceofthesameproblemratherthankeepingsilentoutoffearofrepercussion.Bettersystemsresult.ThebiggestdifferencebetweenDevOpsandSREisnotinperspectiveorwiderphilosophy.Theculturesarealsoverysimilar.ThebiggestdifferenceisthatSREhasanintentionallynarrowedfocusonkeepingservicesandplatformsavailabletocustomerswhileDevOpstendstofocusonoverallprocesses,whichismuchbroader.ThetwohavedifferentfoundationalguidingprinciplesatthelowestlayerasDevOpssimplybelievesithasfoundabetterwaytomeettheneedsofthecompanyanditscustomerswhileSREbelievesitexiststokeepasitereliable.Itisinterestingthatbothperspectives,developedseparately,havesometoembracesuchastoundinglysimilarpractices.DoesEveryEngineeringOrganizationNeedSiteReliabilityEngineers,orDoesDevOpsSuffice?Toanswerthatquestion,answerthisonefirst.Doesyourorganizationproduceandmaintainanythingthatisvitaltocustomersuccess?Howcomplexisyoursystem?Ifdowntimeisokayanduptimeisnotyourmainfocus,perhapsDevOpswillsuffice.Itisundoubtedlyanimprovementonpastmethodsofsoftwaredevelopment,deployment,andoperations.If,however,yourapplicationorservicesareexpectedtobereliabletotheleveloftwoormoreninesofuptimeandavailability,thenthelaser-likefocusofSREwithitserrorbudgetsandSLOswillhelpremovethepoliticsandguessesfromtheprocess.Thisenablesyoutoseeclearlyhowtomostdirectlyandeffectivelyimpacttheavailabilityandreliabilityofyoursystem.Thisisabitofatrickquestion,becausecustomerexpectationscontinuetoriseanddowntimebecomeslessandlessacceptable.Ifyoudon’tbelieveus,askyourselfifyou’dtolerateanhourslongmaintenancewindow,somethingthatwascommonafewyearsago.Thisisespeciallytrueifwebringsystemarchitectureintotheconversation.Withthegrowingcomplexityofcontainerizedmicroservicesrunningoncloudservice,orchestratingeverythingandkeepingeverythingworkingtogether,evenwithcomponentsorservicesfail,isamajorundertaking.Planningforsitereliabilityisvital.Doesyourcurrentlevelofreliabilitysuffice?Usethisinteractivereliabilitycalculatortoranktheoverallreliabilityofyourdifferentservicesandthengetpersonalizedrecommendationsonhowtoimprove.Takethequiz→IsthePositionofSysAdminNoLongerRelevant?Theanswerisaqualified,“Yes,but.”SystemsadministrationisstillavitalpartoftheoperationssideofDevOpsandSiteReliabilityEngineering.However,specializinginjustthatwithoutlearninghowtoworkinawider,collaborativecontextisabadidea.Specializedsystemsadministrationrolesinaclassicoperationssiloaredyingout.Itissimplynotpossibletocreatewebapplicationsystemsatscalewiththevelocityofchangeneededtoberelevanttodayusingthetraditionalsiloedprocessesandtechnology.GoodSysAdmin-trainedengineersareavaluablepartofthenewworld.Infact,havingsomereasonableSysAdmincapabilitiesonyourteamisamustinbothDevOpsandSRE.SomeonewhoknowsLinuxsystemcalls,forexample,maysavethedaywhenanodethattheteamcan’taffordtodestroyandreplacecanbebroughtbackintoservicemoreelegantlythankillingitandspinningupanewone.Forsure,therearemanyclassicSysAdminjobsstilloutthere.Thelegitimateworryisthatthelandscapeischangingrapidly.Workinginthiscapacitywithoutalsofindingwaystodeveloptheskillsandexperienceneededtostayrelevantinthewiderworldhasthepotentialtobringstagnationandendcareers.Today,there’sahighdemandforSREsandthereisanaturalevolutionfromSysAdmintoSRE.CanaSiteReliabilityEngineeringTeamPreventProductionIncidents?Yes.Thatistheirgoal.Theycanpreventmanyincidents.Noteamcanpreventallproductionincidents.However,lookatthecompaniesthatuseSREteamsandthinkabouthowlongithasbeensincetheyhadanincidentthatimpactedcustomers.Thinkaboutthenatureoftheincidentandhowquicklyitwastakencareof.ThedatasaysthatSREisthewaytogowhenuptimeandaminimizationofincident-relateddowntimeandcostsarekey.SummaryTheultimateanswertothequestionaskedinthisarticle’stitleisyes,SREandDevOpscancoexist.Whilethetwosharesomefoundationalvalues,thefocusoftheirworkisdifferent.Theysharesimilartoolinganddevelopmentpractices.ThebigdifferentiatoristhatSREshaveastronganddeliberatefocusonkeepingasiteupandrunning;anythingthatdoesnotdirectlycontributetothatgoalinameasurablewayisexcludedfromtheirpriorities.Sometimes,companiescreatewiderDevOpsteamswithanSREteamworkingalongsidethemorasasubsetoftheteam.GuideChaptersAprimeronSREforengineeringleadersSiteReliabilityEngineering(SRE)istheoutcomeofcombiningIToperationsresponsibilitieswithsoftwaredevelopment.WithSREthereisaninherentexpectationofresponsibilityformeetingtheservice-levelobjectives(SLOs)setfortheservicetheymanageandtheservice-levelagreements(SLAs)wepromiseinourcontracts.TheroleandresponsibilitiesofSREsinsoftwareengineeringWhatdoSiteReliabilityEngineersdoandwhatexactlyaretheyresponsibleforwithinanengineeringorganization?Whilethespecificswilldependonyourcompany,therearesomegeneraltrendsforhowSREteamstendtoorganizethemselves.ThisarticlefocusesonhowSREteamsshareresponsibilitiesacrossmemberswhileatthesametimerecognizingthestrengthseachmemberbringstotheteamastheyworktowardsacommonreliabilitygoal.HowtobecomeatopnotchSREYouhavesomeexperiencewithprogrammingorsystemsadministration,developmentoroperations,andnowthatyouhaveheardaboutSiteReliabilityEngineering(SRE)youthinkthissoundslikesomethingyouwouldliketodoasyournextstep.Thisarticlewillhelpyoulearningreaterdetailwhatyouneedtoknowtonotonlybesuccessful,butoneofthebestSREs.HowmuchmoneydoSREsmake?WonderingabouttheaverageSiteReliabilityEngineersalary?Orhowmuchtop-notchSREsatbest-in-classorganizationsarecompensated?Wedidsomeresearchandaresharingourfindingshere.SREinterviewquestionsandjobdescriptionsWhatdoSiteReliabilityEngineersdoandwhatexactlyaretheyresponsibleforwithinanengineeringorganization?Whilethespecificswilldependonyourcompany,therearesomegeneraltrendsforhowSREteamstendtoorganizethemselves.ThisarticlefocusesonhowSREteamsshareresponsibilitiesacrossmemberswhileatthesametimerecognizingthestrengthseachmemberbringstotheteamastheyworktowardsacommonreliabilitygoal.Avoiddowntime.UseGremlintoturnfailureintoresilience.Gremlinempowersyoutoproactivelyrootoutfailurebeforeitcausesdowntime.SeehowyoucanharnesschaostobuildresilientsystemsbyrequestingademoofGremlin.GetstartedSignuptogetthelatestinfoaboutGremlinCompanyTeamJoinusProductContactPressPrivacyResourcesBlogDocsSecurityIndustriesSaaSFinanceRetailFeaturedWhatisChaosEngineering?WhatisChaosMonkey?WhatisSiteReliabilityEngineering?The2021StateofChaosEngineeringReportHowtoachievereliabilityindistributedsystemsLoading...©2022GremlinInc.Walnut,CA91789



請為這篇文章評分?