Who owns on-call? - Increment

文章推薦指數: 80 %
投票人數:10人

On-call engineers are the “first responders” of software engineering. ... Spotify was one of the companies that adopted the SRE role early on, ... NEWBuytheprinteditionIssuesTopicsStoreAboutWhensomethinggoeswrongwithasoftwareapplication,service,orsystem,someoneneedstoberesponsibleforfiguringoutwhatwentwrongandfixingit.Inthetechindustry,thissetoftasksisusuallyreferredtoas “beingon-call”forthatsoftware.Similarlytothepracticeofdoctorsbeingon-callatahospital,asetofengineersisplacedonan on-callrotation (meaningthattheysharetheon-callresponsibilitywithateam,andeveryoneontheteamtakesturnsbeingontherotation),duringtheiron-callshiftstheyare paged anytimesomethingbreaks(usuallyviaanautomatedpushnotificationontheirsmartphone,atext,oracall),andtheyareresponsibleforquicklyrespondingtothepage,fixingwhatbroke,andmakingsurethatthesameproblemneverhappensagain.On-callengineersarethe “firstresponders”ofsoftware engineering.Historically,theresponsibilitiesrequiredtorunlargesoftwareapplicationsandsystemshavebeendivviedupbetweentwokindsofteams:so-called “development”teams,whoareresponsibleforalltasksassociatedwithbuildingandaddingnewfeaturestoapplicationsandsystems,andso-called “operational”teams,whoareresponsibleforrunningandmaintainingthem.On-callresponsibilitieshavebeenviewedforalongtimeasbeingpartoftheoperationalworkload,anddevelopershaverarelybeenon-callforthesoftwarethey build.Inthepastseveralyears,everythingintheindustrychanged.It’sdifficulttopinpointexactlywhentheindustrychangeditsmindabouton-callresponsibilities,butthe “who”,the “where”,andthe “why”arerelativelystraightforwardtouncoverandunderstand.Todeterminethestateoftheindustry, Increment spokewithoverthirtyindustryleadersaboutthe “who”andthe “why”,andwhatwelearnedfromourconversationsabouttheindustry-widemovementtoputdeveloperson-callfortheir software.Themajorityofthecompanieswesurveyedusedtodivideengineeringtasksbetweentheirtechnicalteamsintheoldway:theirdevelopmentteamswrotethecode(andsometimesdidthetesting),andthenthrewthedebugging,thetesting,therunning,andthemaintenanceofthecodeovertoanoperationalteam.Overthepastfewyears,mostofthesecompaniesdiscoveredthatthisapproachtorunningsoftwaresimplydidn’tscale,andthatdevelopersfeltalackofownershipwhentheyweren’ton-callforthecodetheywrote—mostimportantly,thislackofownershiptranslatedintounreliablesystemsbeingbuiltandrun.Tofixthesescalabilityandreliabilityproblems,theymovedtheoperationalworkloadontothedevelopmentteams,whoquickly(thoughnotpainlessly)learnedtobuildbetter,moreresilient systems.Google notoriouslywasoneofthefirstcompaniesinthetechworldtorealizethattheoldwayofdoingoperationalworkwouldn’tandcouldn’tscaleattheleveltheirsystemsrequired,sotheycreatedanewrolefor “SiteReliabilityEngineers”(SREs).Thesenew SREs approachedtheoperationaltaskswithasoftwareengineeringmindset:theyautomatedawayalloftheoperationalgrunt-work,andmadethesystemsrunmorereliably.Nowadays, SREs atGooglerun,maintain,andareon-callonlyforthemostimportantandstableservices(likeAds,Gmail,andSearch),whiledevelopmentteamscarrytheoperationalworkloadforothernon-stable,non-criticalservices(whicharen’tstaffedby SREs).The SRE approachtooperationsisnowcreditedwiththesuccessofGoogle’ssystems—successthatmuchoftheindustryhastriedtoemulatebyadoptingthe SRE roleandpractices.However,manyindustryadoptershavetakenthe SRE titlewithoutalsoadoptingthe SRE mindsetorGoogle’srequirementthat SREs onlyrunandmaintainstablesystems:Googlerequiresdevelopmentteamstoruntheirownservicesifthosesystems aren’t stable.Spotify wasoneofthecompaniesthatadoptedthe SRE roleearlyon,andtreated SREs astypicaloperationsengineers.InSpotify’searlydays,theirsmall SRE teamwasresponsibleforalloperationalwork,includingbeingon-callforallSpotifysystems.Asthecompanygrew,andtheoperationalworkloadgrewalongsideit,Spotify’sleadershipdiscoveredthattheycouldn’thire SREs quicklyenoughtomeettheoperationaldemands.Theonlyscalablesolutiontheyfoundwasmovingtheon-callresponsibilitiestothedevelopment teams.Airbnb discoveredthathavingaseparateoperationsteam “createsadivideandsimplydoesn’tscale,”saysAirbnb SRE managerJoeyParsons,and “itputstheonusofresponsibilityforfixinganissueonthewrongteam.”Airbnbdecidedtoputdeveloperson-callfortheirsystems,takingthestancethatifdeveloperscandeploytoproductionwhenevertheywant,thentheyshouldbetheonesfixingproblemscausedbytheirservicesanddeployments.ThoughAirbnbhas SREs thatworkcloselywithdevelopmentteams,their SREs focusonlyonimprovingreliabilityacrosssystems,andtheyaretheonlyteamthatis not on-callforanyofAirbnb’sservices.Manyothercompanies,likePinterestandNewRelic,havefollowedasimilarapproachtothatofAirbnb:developersareon-callfortheirservices,buthave SREs workingalongsidethem(usually “embedded”withintheteam)tomakesurethatthedevelopmentteamsarefollowingindustrybestpracticesforon-callandgeneralservice reliability.Airbnbdiscoveredthathavingaseparateoperationsteam “createsadivideandsimplydoesn’t scale.”Somecompanies—like Datadog, DigitalOcean,and Dropbox—havefocusedontakingashared,holisticapproachtoon-callresponsibilities,andhaveputbothdevelopmentandoperationsteamson-callforservicestogether.AtDatadog,engineeringleadershipwasdeterminedtoavoidanops/devsplitfromtheverystart,andsotheyensuredthatoperationaltasksweredistributedbetweenopsanddevteams.Importantly, SREs anddevelopersatDatadogsharetheon-callrotations,ensuringthateveryon-callshiftisstaffedbybothexpertsinthecode(developers)andexpertsinreliability(the SREs).Dropboxtakesasimilarapproach,viewingon-callresponsibilityassomethingthatbothdevelopmentand SRE teamsneedtoown.DigitalOceanhasbothdevelopmentteams and operationalteamson-call,butwithatwist:developmentteamsareon-callfortheirservices,whileoperationsteamsareon-callfortheinteractions between the services.PagerDuty,ontheotherhand,haswhatengineeringmanagerSwetaAckermanreferstoasa “youbuildityouownit”and “end-to-endownership”model: SREs areon-callandresponsibleforlow-levelinfrastructure(likehardware,middleware,communication,databases,etc),whiledevelopersareon-callandresponsibleforeverythingontopofthatinfrastructure(includingdevelopment,deployment,monitoring,andthehardwaretheyruntheirserviceson).AckermansaysthatPagerDutyhadtoswitchtotheshared-responsibilitymodeltwoyearsago,inanefforttoshipfeaturesmorequickly,encourageteamsto “controltheirowndestinies,”andto “reduce[inter-team]dependencies”—amodelthatthecompanyhasfoundwildly successful.Amazon isfamous(or,rather,infamous)forpracticallydoingawaywiththeoperationalrolealtogether,andwereoneofthefirstindustryleaderstodoso.ThroughoutallengineeringorganizationsatAmazon(including AWS),developersareresponsibleforalldevelopmentandoperationaltasksassociatedwiththeirservices.Puttingtheonusondeveloperstorun,maintain,andbeon-callfortheirservicesispartofAmazon’sculturalemphasison “ownership”:youdon’t “own”thecodeyouwrite,Amazonsays,unlessyourunandmaintainit, too.Puttingtheonusondeveloperstorun,maintain,andbeon-callfortheirservicesispartofAmazon’sculturalemphasison “ownership”:youdon’t “own”thecodeyouwrite,Amazonsays,unlessyourunandmaintainit, too.Netflix takesanapproachsimilartoAmazon’s,withthemotto “Youownit,yourunit.”DevelopmentteamsatNetflixareon-callfortheirservices24/7,andthere’saCore SRE teamthatmonitorsservicesataveryhighlevelandengagesdevelopmentteamsonlywhenlarge-scaleoutagesoccur.AccordingtoNetflix SRE ManagerBlakeScrivener, “Whensomethinggoeswrong[atNetflix],whichourautomationdoesn’thandlecorrectly,wewanttheexpertsintheservicetobeimmediatelyavailabletomaketherepairand[bring]stabilitytothecustomerexperience…whenthingsarebroken,wewantpeoplewiththebestcontexttryingtofixthings.”Inanengineeringenvironmentwhereservicesarebeingdeployedmultipletimesaday,thepeoplewiththebestcontextarealmostalwaysthedevelopment teams.Outofallofthecompanieswesurveyed,only Slack stillhadanythingresemblinganold-schooloperationsteam.Slack’soperationsteam,whichison-callforallofSlack’sservices,isspreadacrosstheglobeandusesafollow-the-sunrotation,withoperationsengineerslocatedinMelbourne,Dublin,andSanFrancisco. “Thedecisiontoputoperatorson-callasthefirstrespondersisasoldasthecompanyitself,”saysRichardCrowley,DirectorofOperationsatSlack,because “historically,thethingsthatbroketendedtoultimatelyhavecontributingfactorslikehardwarefailuresornetworkpartitions.”Crowleysaysthatthey’verecentlystartedtoseescalabilityproblemswiththeoldwayofoperations,however,whichledSlacktocreateasecondaryon-callrotationfullofdevelopers;softwareandperformancebugs,hesays,arebecomingmuchmorecommonthanlow-levelinfrastructureproblems—bugsthatonlythedevelopmentteamsknowhowtofix.Giventheindustrytrend,wedon’tthinkit’llbelongbeforeSlackjoinstherestoftheindustryandputstheirdevelopmentteamson-callforalloftheir services.ArtworkbyMarkConlanmarkconlan.comTopicsBuytheprinteditionVisittheIncrementStoretopurchaseprintissues.StoreContinueReadingExploreTopicsLearnSomethingNewScaling&GrowthAskanExpertInterviews&SurveysGuides&BestPracticesEssays&OpinionWorkplace&CultureAllIssuesIssue19November2021PlanningIssue18August2021MobileIssue17May2021ContainersIssue16February2021ReliabilityIssue15November2020RemoteIssue14August2020APIsIssue13May2020FrontendIssue12February2020SoftwareArchitectureIssue11November2019TeamsIssue10August2019TestingIssue9May2019OpenSourceIssue8February2019InternationalizationIssue7October2018SecurityIssue6August2018DocumentationIssue5April2018ProgrammingLanguagesIssue4February2018Energy&EnvironmentIssue3October2017DevelopmentIssue2July2017CloudIssue1April2017On-Call



請為這篇文章評分?