What's An SRE? Site Reliability Engineer Roles and... - Splunk
文章推薦指數: 80 %
Site reliability engineers sit at the crossroads of traditional IT and software development. Basically, SRE teams are made up of software ... LEARN What’sAnSRE?SiteReliabilityEngineerRolesandResponsibilities Share: ByStephenWattsJune27,2022 DevOpsgainedpopularityinordertocombatsiloedworkflows,decreasedcollaborationandalackofvisibilityacrossthesoftwaredevelopmentlifecycle.Whileestablishinga cultureofDevOps hashelpedteamscollaboratebetteranddeliverreliablesoftwarefaster,DevOpsteamsdon’tnecessarilyhavesomeonespecificallydedicatedtodevelopingsystemsthatincreasesitereliabilityandperformance. That’swhereasitereliabilityengineer(SRE)comesintothepicture. SitereliabilityengineerssitatthecrossroadsoftraditionalITandsoftwaredevelopment.Basically,SREteamsaremadeupofsoftwareengineerswhobuildandimplementsoftwaretoimprovethereliabilityoftheirsystems. So,inthisarticle,let’s… Definethebasicrolesandresponsibilitiesofasitereliabilityengineer. ShowhowSREcandrasticallyimprovetheresilienceofyourpeople,processesandtechnology. SREoverview SitereliabilityengineeringwasoriginallydevelopedbyGoogle.In thewordsofBenTreynor,SREis“whathappenswhenyouaskasoftwareengineertodesignanoperationsfunction.” InatraditionalsetupofsiloedIToperationsandsoftwaredevelopmentteams,developerswouldthrowtheircodeovertoITprofessionals.Then,ITwouldbeinchargeofdeployment,maintenanceandanyon-callresponsibilitiesassociatedwiththesysteminproduction.Luckily,DevOpscamealongandforceddeveloperstoshareaccountabilityforsystemsinproduction, owntheircode andtakeon-callresponsibilities. DevOpspushedsharedresponsibilityforthereliabilityofyourapplicationsandinfrastructure.And,whilethisisagreatfirststepforward,itdoesn’tproactivelyhelpteamsaddresiliencetotheirsystem.ManyDevOpsteams,evenwithshortenedfeedbackloopsandimprovedcollaboration,canstillfindthemselvesdeployingnew,unreliableservicesintoproductionatarapidpace. SitereliabilityengineeringisawaytobridgethegapbetweendevelopersandIToperations,eveninaDevOpsculture.Itisn’t SREversusDevOps—it’sSRE with DevOps.SREiskindoflikeamoreproactiveformofqualityassurance(QA).Sitereliabilityengineerswillbededicatedfull-timetocreatingsoftwarethatimprovesthereliabilityofsystemsinproduction,including: Fixingissues Respondingtoincidents Usuallytaking on-callresponsibilities Asidefromitsgrowingroletoday,SRE’sbiggestclaimtofamemightbethefourgoldensignalsofmonitoring: Latency Traffic Errors Saturation CommonSRErolesandresponsibilities ImplementinganSREteamwillgreatlybenefitbothIToperationsandsoftwaredevelopmentteams.NotonlycanSREdrivedeeperreliabilitytosystemsinproductionbutitwilllikelyhelpIT,supportanddevelopmentteamsspendlesstimeworkingonsupportescalations—givingthemfocusedtimetobuildnewfeaturesandservices. So,let’slookatcommonsitereliabilityengineeringrolesandresponsibilitiesyoucanexpecttosee. BuildingsoftwaretohelpDevOps,ITOps&supportteams SREteamsareinchargeofproactivelybuildingandimplementingservicestomakeITandsupportbetterattheirjobs.Thiscanbeanythingfromadjustmentsto monitoringandalerting tocodechangesinproduction.Asitereliabilityengineercanbetaskedwithbuildingahomegrowntoolfromscratchtohelpwithweaknessesinsoftwaredeliveryorincidentmanagement. Fixingsupportescalationissues Similartothepointabove,asitereliabilityengineercanexpecttospendtimefixingsupportescalationcases.But,asyourSREoperationsmature,yoursystemswillbecomemorereliableandyou’llseefewercriticalincidentsinproduction–leadingtofewersupportescalations. Becausean SREteam touchessomanydifferentpartsoftheengineeringandITorganization,theycanbeagreatsourceofknowledgeandcanbehelpfulforroutingissuestotherightpeopleandteams. Optimizingon-callrotations&processes Moretimesthannot,sitereliabilityengineerswillneedtotakeon-callresponsibilities.Atmostorganizations,theSRErolewillhavealotofsayinhowtheteamcanimprovesystemreliabilitythroughtheoptimizationofon-callprocesses. SREteamswillhelpaddautomationandcontexttoalerts–leadingtobetterreal-timecollaborativeresponsefromon-callresponders.Additionally,sitereliabilityengineerscanupdate runbooks,toolsanddocumentationtohelpprepareon-callteamsforfutureincidents. Documenting“tribal”knowledge SREteamsgainexposuretosystemsinbothstagingandproduction,aswellasalltechnicalteams.Theytakepartinworkwithsoftwaredevelopment,support,IToperationsand on-callduties –meaningtheybuildupagreatamountofhistoricalknowledgeovertime.Insteadofsiloingthisknowledgeintothemindofoneteamoroneperson,sitereliabilityengineerscanbetaskedwithdocumentingmuchofwhattheyknow.Constantupkeepofdocumentationandrunbookscanensurethatteamsgettheinformationtheyneedrightwhentheyneedit. Conductingpost-incidentreviews Withoutthorough post-incidentreviews,youhavenowaytoidentifywhat’sworkingandwhat’snot.SREteamsneedtokeepteamshonestandensurethateveryone—softwaredevelopersandITprofessionals—areconductingpost-incidentreviews,documentingtheirfindingsandtakingactionontheirlearnings. Then,sitereliabilityengineersareoftentaskedwithactionitemsforbuildingoroptimizingsomepartoftheSDLCorincidentlifecycletobolsterthereliabilityoftheirservice. WheredoesSREfitonyourteam? Sitereliabilityengineeringrolesandresponsibilitiesarecrucialtothecontinuousimprovementofpeople,processesandtechnologywithinanyorganization.Whetheryourteamhasalreadytakenonafull-blownDevOpscultureoryou’restillattemptingtomakethetransition,SREoffersnumerousbenefitstospeedandreliability. SREfitsrightatthecrossroadsofIToperations,supportandsoftwareengineering.SREservesastheperfectblendofskillstotightentherelationshipbetweenITanddevelopers–leadingtoshorterfeedbackloops, bettercollaboration andmorereliablesoftware. ReadyforanSREapproach?LearnhowtohirethebestcandidateswiththeseSREinterviewquestions. Pros&consofbeingaSiteReliabilityEngineer InCatchpoint’s2021SREReport,theirsurveyindicatesthatsitereliabilityengineersweresomeofthehappiestemployeesinsoftwaredevelopmentandIT.WhileSREscan’tspendalltheirtimebuildingnewfeaturesforcustomers,they’reconstantlymakinganimpactoncustomerexperience.Infact,ifyou’relookingforaroledesignedtohelpcustomersthemost–thenSREisit. Sitereliabilityengineeringnotonlyimprovesthelivesofcustomersbut,whendoneright,improvesthelivesof: On-callteams ITprofessionals Softwaredevelopers SREcanbeoneofthemostfulfillingrolesforasoftwareengineer.ItcanhelpyoubetterunderstandthestrugglesofITandsupport,makingyouabetterdevelopergoingforward.Formoresupport,exploretheseDevOps&SREconferences. WhatisSplunk?   ThispostingdoesnotnecessarilyrepresentSplunk'sposition,strategiesoropinion Postedby StephenWatts StephenWattsworksingrowthmarketingatSplunk.StephenholdsadegreeinPhilosophyfromAuburnUniversityandisanMSIScandidateatUCDenver.HecontributestoavarietyofpublicationsincludingCIO.com,SearchEngineJournal,ITSM.Tools,ITChronicles,DZone,andCompTIA. RelatedPosts
延伸文章資訊
- 1網站可靠性工程|Google 的系統管理之道(Site Reliability ...
網站可靠性工程|Google 的系統管理之道(Site Reliability Engineering: How Google Runs Production Systems)(SRE). Be...
- 2推薦:Site Reliability Engineering (SRE, 網站可靠性工程)
SRE 全名是Site Reliability Engineering 網站可靠性工程,是Google 提倡的系統管理實踐之道、指導思想,這個名詞同時也是軟體工程師(Software ...
- 3What's An SRE? Site Reliability Engineer Roles and... - Splunk
Site reliability engineers sit at the crossroads of traditional IT and software development. Basi...
- 4SRE 是什麼? 維運管理與SRE 的關係 - Cloud Ace 技術部落格
SRE 是什麼呢?SRE 全稱Site Reliability Engineering,根據提出SRE 概念的Benjamin Treynor Sloss 的說法,SRE 詢問一個軟體工程師如何...
- 5[好文翻譯] 你在找的是SRE 還是DevOps? - Medium
SRE is a DevOps (香蕉是一種水果) ... Site Reliability Engineering (SRE)和DevOps 是目前相當熱門的開發與維運文化,有著很高的相似程度。