Site reliability engineering - Wikipedia

文章推薦指數: 80 %
投票人數:10人

Site reliability engineering (SRE) is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure ... Sitereliabilityengineering FromWikipedia,thefreeencyclopedia Jumptonavigation Jumptosearch UseofsoftwareengineeringpracticesforITSitereliabilityengineering(SRE)isasetofprinciplesandpractices[1]thatincorporatesaspectsofsoftwareengineeringandappliesthemtoinfrastructureandoperationsproblems.[2]Themaingoalsaretocreatescalableandhighlyreliablesoftwaresystems.[2]SitereliabilityengineeringiscloselyrelatedtoDevOps,asetofpracticesthatcombinesoftwaredevelopmentandIToperations,andSREhasalsobeendescribedasaspecificimplementationofDevOps.[2][3] Contents 1History 2Definition 3Principlesandpractices 4Implementations 4.1KitchenSink,a.k.a.“EverythingSRE” 4.2Infrastructure 4.3Tools 4.4Productorapplication 4.5Embedded 4.6Consulting 5Industry 6Seealso 7References 8Furtherreading 9Externallinks History[edit] ThefieldofsitereliabilityengineeringoriginatedatGooglewithBenTreynorSloss,[4][5]whofoundedasitereliabilityteamafterjoiningthecompanyin2003.[6]In2016,Googleemployedmorethan1,000sitereliabilityengineers.[7]AfteroriginatingatGooglein2003,theconceptspreadintothebroadersoftwaredevelopmentindustry,andothercompaniessubsequentlybegantoemploysitereliabilityengineers.[8]Thepositionismorecommonatlargerwebcompanies,assmallcompaniesoftendon'toperateatascalethatwouldrequirededicatedSREs.[8]OrganizationswhohaveadoptedtheconceptincludeAirbnb,Dropbox,IBM,[9]LinkedIn,Netflix,[7]andWikimedia.[10]Accordingtoa2021reportbytheDevOpsInstitute,22%oforganizationsinasurveyof2,000respondentshadadoptedtheSREmodel.[11][12] Definition[edit] Sitereliabilityengineering,asajobrole,maybeperformedbysolopractitionersororganizedinteamsusuallybeingresponsibleforacombinationofthefollowingwithinabroaderengineeringorganization:Systemavailability,latency,performance,efficiency,changemanagement,monitoring,emergencyresponse,andcapacityplanning.[13]Sitereliabilityengineersoftenhavebackgroundsinsoftwareengineering,systemengineering,orsystemadministration.[14]Focusesofsitereliabilityengineeringincludeautomation,systemdesign,andimprovementstosystemresilience.[14] Sitereliabilityengineering,asasetofprinciplesandpractices,canbeperformedbyanyone.SREissimilartoSecurityengineeringinthewaythatanyoneisexpectedtocontributetogoodsecuritypractices,butacompanymaydecidetoeventuallystaffspecialistsforthejob.Conversely,forsecuringinternetsystems,companiesmayhireSecurityEngineersandtodefineandensuretheirreliabilitygoals,companiesmayhireSREsaswell. SitereliabilityengineeringhasalsobeendescribedasaspecificimplementationofDevOps[2][3]butitfocusesspecificallyonbuildingreliablesystems,whereasDevOpsismorebroadlyfocusedoninfrastructure.[2] StephenGossettwroteinBuiltInthatsomecompanieshaverebrandedtheiroperationsteamstoSREteamswithlittlemeaningfulchange.[8]ThisisalsoperceivedtobetrueforoperationsteamsrebrandedtobecalledDevOpsteams. Principlesandpractices[edit] Therehavebeenmultipleattemptsofdefiningacanonicallistofsitereliabilityengineeringprinciples,[15][16]butwhileconsensusislacking,thefollowingcharacteristicsareusuallyincludedinmostofsuchdefinitions: Automationoreliminationofanythingrepetitivethat'salsocost-effectivetoautomateoreliminate. Avoidancetopursuemuchmorereliabilitythanwhat'sstrictlynecessary.Definingwhat'snecessaryisapracticebyitself(seelistofpracticesbelow). Systemsdesignwithabiastowardreductionofriskstoavailability,latency,andefficiency. Observability,asin,theabilitytobeabletoaskarbitraryquestionsaboutyoursystemwithouthavingtoknowaheadoftimewhatyouwantedtoask.[17] Thesitereliabilityengineeringpracticesalsovarywidely,butthelistbelowisrelativelycommonlyseenbeingatleastpartiallyimplemented: Toilmanagementastheimplementationofthefirstprincipleoutlinedabove. Definingandmeasuringreliabilitygoals—SLIs,SLOs,anderrorbudgets. Non-AbstractLargeScaleSystemsDesign(NALSD)withafocusonreliability. Designingforandimplementingobservability. Defining,testing,andrunninganincidentmanagementprocess. Capacityplanning. Changeandreleasemanagement,includingCI/CD. Chaosengineering. Implementations[edit] SitereliabilityengineeringteamsengagewiththeotherteamswithintheircompaniesandtheSREprinciplesandpracticesinvariousforms.HereisahighleveloverviewofcommonSREteamimplementations:[18] KitchenSink,a.k.a.“EverythingSRE”[edit] Scopeofservicesorworkflowscoveredisusuallyunbounded. Infrastructure[edit] Focusesonthereliabilityofbehind-the-scenessystemsthathelpmakeotherteams'jobsmoreefficient.Theseareoftenconfusedwith"Platform"teamsor"PlatformOperations"teams.InfrastructureSREteamsmaypairupwithoneormoreplatformengineeringteam(s),buttheydifferinthatInfrastructureSREteamsfocusesonperformingmost,ifnotall,oftheworkdescribedintheprinciplesandpracticeslistabove.Platformteamstendtofocusonbuildingtheplatformandwhilereliabilityisdesirablethat'snottheirsolepriority. Tools[edit] Focusesontoolstomeasure,maintain,andimprovesystemreliability. Forexample,NagiosCore. Productorapplication[edit] SREteamforproductand/orapplication.Somelargecompaniestendtostaffseveralofthese. Embedded[edit] UsuallySREsolopractitionersorpairsstaffedwithinasoftwareengineeringteamtoapplymostoftheprinciplesandpracticesdescribedabove. Consulting[edit] ConsultonhowtoimplementSREprinciplesandpractices.TheseareusuallyexperiencedSREswho'veworkedonteamsinoneorseveraloftheimplementationsabove.SREsonexternalfacingconsultingSREteamsareoftencalled"CustomerReliabilityEngineers".Theyrarely,ifever,changecustomer'sconfigurationorcode. LargecompanieswhohaveadoptedSREtendtohaveacombinationoftheimplementationsdescribedabove,includingmultipleteamsofthesameimplementation,e.g.multipleProduct/applicationSREteamstomeetspecificdemandsofseveralproductsandanInfrastructureSREteamtopairupwithaPlatformengineeringgrouptomeetreliabilitygoalsofacommonplatformforbothproducts/applications. Industry[edit] TheUSENIXorganizationhasheldanannualSREconconferencesince2014forsitereliabilityengineersintheindustry,andalsoholdsregionalconferenceswithsimilarthemes.[19] Seealso[edit] Chaosengineering Cloudcomputing Datacenter Disasterrecovery Highavailabilitysoftware Infrastructureascode Operations,administrationandmanagement Operationsmanagement Reliabilityengineering Systemadministration References[edit] ^"EvaluatingwhereyourteamliesontheSREspectrum".GoogleCloudBlog.Retrieved2021-06-26. ^abcdeBeyer,Betsy;Jones,Chris;Petoff,Jennifer;Murphy,Niall,eds.(2016).SiteReliabilityEngineering:HowGoogleRunsProductionSystems.Sebastopol,CA:O'ReillyMedia.ISBN 978-1-4919-5118-7.OCLC 945577030. ^abVargo,Seth;Fong-Jones,Liz(March1,2018).What'stheDifferenceBetweenDevOpsandSRE?(classSREimplementsDevOps)(Video).Google. ^Hill,Patrick."LoveDevOps?WaituntilyoumeetSRE".Atlassian.RetrievedJune17,2021.{{citeweb}}:CS1maint:url-status(link) ^"WhatisSRE?".RedHat.RetrievedJune17,2021.{{citeweb}}:CS1maint:url-status(link) ^Treynor,Ben(2014)."KeystoSRE".USENIXSREcon14.RetrievedJune17,2021.{{citeweb}}:CS1maint:url-status(link) ^abFischer,Donald(March2,2016)."Aresitereliabilityengineersthenextdatascientists?".TechCrunch.RetrievedJune17,2021.{{citeweb}}:CS1maint:url-status(link) ^abcGossett,Stephen(June1,2020)."WhatIsaSiteReliabilityEngineer?WhatDoesanSREDo?".BuiltIn.RetrievedJune17,2021.{{citeweb}}:CS1maint:url-status(link) ^"SiteReliabilityEngineering".IBMCloudEducation.IBM.November12,2020.RetrievedJune21,2021.{{citeweb}}:CS1maint:url-status(link) ^"SRE-Wikitech".wikitech.wikimedia.org.Retrieved2021-10-17. ^Oehrlich,Eveline;Groll,Jayne;Garbani,Jean-Pierre(2021).Upskilling2021EnterpriseDevOpsSkillsReport(PDF)(Report).DevOpsInstitute.RetrievedJune17,2021. ^Oehrlich,Eveline(May4,2021)."Whatittakestobeasitereliabilityengineer".TechBeacon.MicroFocus.RetrievedJune17,2021.{{citeweb}}:CS1maint:url-status(link) ^Treynor,Ben."InConversation"(Interview).InterviewedbyNiallMurphy.GoogleSiteReliabilityEngineering. ^ab Jones,Chris;Underwood,Todd;Nukala,Shylaja(June2015)."HiringSiteReliabilityEngineers"(PDF).;login:.Vol. 40,no. 3.pp. 35–39.RetrievedJune17,2021. ^"The7SREPrinciples[AndHowtoPutThemIntoPractice]".www.blameless.com.Retrieved2021-06-26. ^"EvaluatingwhereyourteamliesontheSREspectrum".GoogleCloudBlog.Retrieved2021-06-26. ^"Learnaboutobservability|Honeycomb".docs.honeycomb.io.Retrieved2021-06-26. ^"SREatGoogle:HowtostructureyourSREteam".GoogleCloudBlog.Retrieved2021-06-26. ^"UsenixSREcon".USENIX.2021.RetrievedJune17,2021. Furtherreading[edit] Limoncelli,Tom;Chalup,StrataR.;Hogan,ChristinaJ.(September2014).ThePracticeofCloudSystemAdministration:DevOpsandSREPracticesforWebServices.Vol. 2.UpperSaddleRiver,NJ:Addison-Wesley.ISBN 978-0133478549.OCLC 891786231. Beyer,Petoff,Murphy,Jones,Betsy,Jennifer,Niall,Chris(2016).SiteReliabilityEngineering:HowGoogleRunsProductionSystems.O'Reilly.ISBN 978-1491929124.{{citebook}}:CS1maint:multiplenames:authorslist(link) Blank-Edelman,DavidN.,ed.(2018).SeekingSRE:ConversationsAboutRunningProductionSystemsatScale(1 ed.).Sebastopol,CA:O'Reilly.ISBN 978-1491978863.OCLC 1052565720. Beyer,Murphy,Kawahara,Rensin,Thorne,Betsy,Niall,Kent,David,Stephen(2018).TheSiteReliabilityWorkbook:PracticalWaystoImplementSRE.O'Reilly.ISBN 978-1492029502.{{citebook}}:CS1maint:multiplenames:authorslist(link) Welch,Nat(2018).Real-WorldSRE:TheSurvivalGuideforRespondingtoaSystemOutageandMaximizingUptime.Packt.ISBN 978-1788628884. Adkins,Oprea,Blankinship,Lewandowski,Stubblefield,Beyer,Heather,Ana,Paul,Piotr,Adam,Betsy(2020).BuildingSecureandReliableSystems:BestPracticesforDesigning,Implementing,andMaintainingSystems.O'Reilly.ISBN 978-1492083122.{{citebook}}:CS1maint:multiplenames:authorslist(link) Rosenthal,Jones,Casey,Nora(2020).ChaosEngineering:SystemResiliencyinPractice.O'Reilly.ISBN 978-1492043867. Externallinks[edit] AwesomeSiteReliabilityEngineeringresourceslist HowtheySREresourceslist SREWeeklyweeklynewsletterdevotedtoSRE SREatGooglelandingpageforlearningmoreaboutSREinGoogle KomodorK8sReliabilitylearningcenterwithresourcesforSREsworkingwithKubernetes vteSoftwareengineeringFields Computerprogramming DevOps Requirementsengineering Sitereliabilityengineering Softwaredeployment Softwaredesign Softwaremaintenance Softwaretesting Systemsanalysis Formalmethods Concepts Datamodeling Enterprisearchitecture Functionalspecification Modelinglanguage Programmingparadigm Software Softwarearchaeology Softwarearchitecture Softwareconfigurationmanagement Softwaredevelopmentprocess/methodology Softwarequality Softwarequalityassurance Softwareverificationandvalidation Structuredanalysis EssentialAnalysis Orientations Agile Aspect-oriented Objectorientation Ontology Serviceorientation SDLC ModelsDevelopmental Agile EUP ExecutableUML Incrementalmodel Iterativemodel Prototypemodel RAD UP Scrum Spiralmodel V-Model Waterfallmodel XP Other SPICE CMMI Datamodel ERmodel Functionmodel Informationmodel Metamodeling Objectmodel Systemsmodel Viewmodel Languages IDEF UML USL SysML Relatedfields Computerscience Computerengineering Informationscience Projectmanagement Riskmanagement Systemsengineering Category Commons Retrievedfrom"https://en.wikipedia.org/w/index.php?title=Site_reliability_engineering&oldid=1096002463" Categories:2003introductionsGoogleReliabilityengineeringSoftwareengineeringHiddencategories:CS1maint:url-statusArticleswithshortdescriptionShortdescriptionisdifferentfromWikidataCS1maint:multiplenames:authorslistACwith0elements Navigationmenu Personaltools NotloggedinTalkContributionsCreateaccountLogin Namespaces ArticleTalk English Views ReadEditViewhistory More Search Navigation MainpageContentsCurrenteventsRandomarticleAboutWikipediaContactusDonate Contribute HelpLearntoeditCommunityportalRecentchangesUploadfile Tools WhatlinkshereRelatedchangesUploadfileSpecialpagesPermanentlinkPageinformationCitethispageWikidataitem Print/export DownloadasPDFPrintableversion Languages العربيةDeutschFrançais한국어日本語Português中文 Editlinks



請為這篇文章評分?