Site Reliability Engineering: How Google Runs ... - Goodreads

文章推薦指數: 80 %
投票人數:10人

Site Reliability Engineering, or Google's claim to fame re: technology and concepts developed more than a decade ago by the grid computing community, is a ... Goodreadshelpsyoukeeptrackofbooksyouwanttoread. Startbymarking“SiteReliabilityEngineering:HowGoogleRunsProductionSystems”asWanttoRead: WanttoRead saving… WanttoRead CurrentlyReading Read Othereditions Enlargecover WanttoRead saving… Errorratingbook.Refreshandtryagain. Ratethisbook Clearrating 1of5stars2of5stars3of5stars4of5stars5of5stars OpenPreview SeeaProblem? We’dloveyourhelp. Letusknowwhat’swrongwiththispreviewof SiteReliabilityEngineeringbyBetsyBeyer. Problem: It’sthewrongbook It’sthewrongedition Other Details(ifother):   Cancel Thanksfortellingusabouttheproblem. ReturntoBookPage Notthebookyou’relookingfor? Preview—SiteReliabilityEngineering byBetsyBeyer SiteReliabilityEngineering:HowGoogleRunsProductionSystems by BetsyBeyer(Editor), ChrisJones(Editor), JenniferPetoff(Editor), NiallRichardMurphy(Editor) 4.23  ·  Ratingdetails ·  2,284 ratings  ·  229 reviews Theoverwhelmingmajorityofasoftwaresystem'slifespanisspentinuse,notindesignorimplementation.So,whydoesconventionalwisdominsistthatsoftwareengineersfocusprimarilyonthedesignanddevelopmentoflarge-scalecomputingsystems?Inthiscollectionofessaysandarticles,keymembersofGoogle'sSiteReliabilityTeamexplainhowandwhytheircommitmen Theoverwhelmingmajorityofasoftwaresystem'slifespanisspentinuse,notindesignorimplementation.So,whydoesconventionalwisdominsistthatsoftwareengineersfocusprimarilyonthedesignanddevelopmentoflarge-scalecomputingsystems?Inthiscollectionofessaysandarticles,keymembersofGoogle'sSiteReliabilityTeamexplainhowandwhytheircommitmenttotheentirelifecyclehasenabledthecompanytosuccessfullybuild,deploy,monitor,andmaintainsomeofthelargestsoftwaresystemsintheworld.You'lllearntheprinciplesandpracticesthatenableGoogleengineerstomakesystemsmorescalable,reliable,andefficient--lessonsdirectlyapplicabletoyourorganization.Thisbookisdividedintofoursections:Introduction--LearnwhatsitereliabilityengineeringisandwhyitdiffersfromconventionalITindustrypracticesPrinciples--Examinethepatterns,behaviors,andareasofconcernthatinfluencetheworkofasitereliabilityengineer(SRE)Practices--UnderstandthetheoryandpracticeofanSRE'sday-to-daywork:buildingandoperatinglargedistributedcomputingsystemsManagement--ExploreGoogle'sbestpracticesfortraining,communication,andmeetingsthatyourorganizationcanuse ...more GetACopy KindleStore $25.64 AmazonStores ▾AudibleBarnes&NobleWalmarteBooksAppleBooksGooglePlayAbebooksBookDepositoryAlibrisIndigoBetterWorldBooksIndieBoundThriftbooksLibraries Paperback,552pages Published April26th2016 byO'ReillyMedia (firstpublishedApril16th2016) MoreDetails... OriginalTitle SiteReliabilityEngineering:HowGoogleRunsProductionSystems ISBN 149192912X (ISBN13:9781491929124) EditionLanguage English OtherEditions(12) AllEditions ...LessDetail EditDetails FriendReviews Toseewhatyourfriendsthoughtofthisbook, pleasesignup. ReaderQ&A Toaskotherreadersquestionsabout SiteReliabilityEngineering, pleasesignup. PopularAnsweredQuestions Isitthesameas"https://landing.google.com/sre/book/index.html"? 9likes · like 4yearsago Seeall3answers HampusWessman Yes.Thefreeversiononthewebsitewasreleasedlater.It'sthesamebook.…moreYes.Thefreeversiononthewebsitewasreleasedlater.It'sthesamebook.(less) flag See1questionaboutSiteReliabilityEngineering… ListswithThisBook DevOpsReadingList 88books — 135voters DevOpsmust-to-readlist 37books — 24voters Morelistswiththisbook... CommunityReviews Showing1-30 Averagerating 4.23  ·  Ratingdetails  ·  2,284 ratings  ·  229 reviews AllLanguagesEnglish‎(224) Français‎(1) Português‎(1) Pусскийязык‎(1) Morefilters  |  Sortorder StartyourreviewofSiteReliabilityEngineering:HowGoogleRunsProductionSystemsWriteareview Apr04,2016 SimonEskildsen ratedit likedit  ·  reviewofanotheredition MuchoftheinformationonrunningproductionsystemseffectivelyfromGooglehasbeenextremelyimportanttohowIhavechangedmythinkingabouttheSREroleovertheyears—finally,there'sonepiecethathasallofwhatwaspreviouslysomethingyou'dhadtolooklongandhardforinvarioustalks,papersandabstracts:errorbudgets,theSREroledefinition,scaling,etc.Thatsaid,thisbooksuffersaclassicproblemfromhavingtoomanyauthorswriteindependentchapters.Muchisrepeated, MuchoftheinformationonrunningproductionsystemseffectivelyfromGooglehasbeenextremelyimportanttohowIhavechangedmythinkingabouttheSREroleovertheyears—finally,there'sonepiecethathasallofwhatwaspreviouslysomethingyou'dhadtolooklongandhardforinvarioustalks,papersandabstracts:errorbudgets,theSREroledefinition,scaling,etc.Thatsaid,thisbooksuffersaclassicproblemfromhavingtoomanyauthorswriteindependentchapters.Muchisrepeated,andeachchapterstandstoomuchonitsown—buildingfromfirstprincipleseachtime,insteadofleveragingtherestofthebook.Thismakesthebookmuchlongerthanitneedstobe.Furthermore,ittriestobebothtechnicalandnon-technical—thisconfusesthenarrativeofthebook,anditendsupnotexcellingateitherofthem.Iwouldlovetoseetwobooks:SREthetechnicalparts,andSREthenon-technicalparts.Overall,thisbookisstillagoldmineofinformationtoa5/5—butitisexactlythat,agoldminethatyou'llhavetoputafairamountofeffortintodissectingtoretrievethemostvaluefrom,becausethebook'sstructuredoesn'thandittoyou—that'swhywelandata3/5.Whenrecommendingthisbooktocoworkers,whichIwill,itwillbechaptersfromthebook—notthebookatlarge. ...more flag 45likes · Like  · seereview Viewall3comments Apr25,2016 Mircea ratedit itwasok BoringasF.Themainmessageis:ohlookatus,wehavesuperhardproblemsandlikesaying99.999%alot.Andohyeah...SREsaredevelopers.Wedon'tspendmorethan50%on"toil"work.Pleeeease.Bookhassomeinterestingstoriesandifyouaregoodatreadingbetweenthelinesyoumightlearnsomething.EverythingelseisBS.Doeseverychapterneedstostarttellinguswhoeditedthechapter?Idon'tgiveaf.Thebookalsoseemstobetheproductofmultipleindividuals(alotofthemact BoringasF.Themainmessageis:ohlookatus,wehavesuperhardproblemsandlikesaying99.999%alot.Andohyeah...SREsaredevelopers.Wedon'tspendmorethan50%on"toil"work.Pleeeease.Bookhassomeinterestingstoriesandifyouaregoodatreadingbetweenthelinesyoumightlearnsomething.EverythingelseisBS.Doeseverychapterneedstostarttellinguswhoeditedthechapter?Idon'tgiveaf.Thebookalsoseemstobetheproductofmultipleindividuals(alotofthemactually)whosesoleconnectionisthattheywroteachapterforthisbook.Fthereader,Fstructure,Ffocusingonthecoreoftheissue.Let'sjustdumpastreamofconsciousnesskindofjunkandafterthattelleveryonehowharditisandhowwecareaboutworklifebalance.Again,boringandingeneralyou'regonnawasteyourtimereadingthis(unlessyouwanttoknowwhatborg,chubbyandbigtableare) ...more flag 16likes · Like  · seereview Apr23,2016 MichaelScott ratedit likedit Shelves: compsci-tech SiteReliabilityEngineering,orGoogle'sclaimtofamere:technologyandconceptsdevelopedmorethanadecadeagobythegridcomputingcommunity,isacollectionofessaysonthedesignandoperationoflarge-scaledatacenters,withthegoalofmakingthemsimultaneouslyscalable,robust,andefficient.Overall,despite(willing?)ignoranceofthehistoryofdistributedsystemsandinparticular(grid)datacentertechnology,thisisanexcellentbookthatteachesushowGooglethinks(oruse SiteReliabilityEngineering,orGoogle'sclaimtofamere:technologyandconceptsdevelopedmorethanadecadeagobythegridcomputingcommunity,isacollectionofessaysonthedesignandoperationoflarge-scaledatacenters,withthegoalofmakingthemsimultaneouslyscalable,robust,andefficient.Overall,despite(willing?)ignoranceofthehistoryofdistributedsystemsandinparticular(grid)datacentertechnology,thisisanexcellentbookthatteachesushowGooglethinks(orusedtothink,afewyearsback)aboutitsdatacenters.Ifyou'reinterestedinthistopic,youhavetoreadthisbook.Period.StructureThebookisdividedintofourmainparts,eachcomprisedofseveralessays.EachessayisauthoredbywhatIassumeisaGoogleengineer,andeditedbyoneofBetsyBeyer,ChrisJones,JenniferPetoff,andNiallRichardMurphy.(IjusthopethatwhatIdidn'tlikeaboutthebookcanbeattributedtotheeditors,becauseIreallydidn'tlikesomestuffinhere.)InPartI,Introduction,theauthorsintroduceGoogle'sSiteReliabilityEngineering(SRE)approachtomanagingglobal-scaleITservicesrunningindatacentersspreadacrosstheentireworld.(Trulyimpressiveachievement,nodoubtaboutit!)AfteradiscussionabouthowSREisdifferentfromDevOps(anotherhottermoftheday),thispartintroducesthecoreelementsandrequirementsofSRE,whichincludethetraditionalServiceLevelObjectives(SLOs)andServiceLevelAgreements(SLAs),managementofchangingservicesandrequirements,demandforecastingandcapacity,provisioningandallocation,etc.Throughasimpleservice,Shakespeare,theauthorsintroducethecoreconceptsofrunningaworkflow,whichisessentiallyacollectionofITtasksthathaveinter-dependencies,inthedatacenter.InPartII,Principles,thebookfocusesonoperationalandreliabilityrisks,SLOandSLAmanagement,thenotionoftoil(mundaneworkthatscaleslinearly(whynotsuper-linearlyaswell?!?!)withservices,yetcanbeautomated)andtheneedtoeliminateit(throughautomation),howtomonitorthecomplexsystemthatisadatacenter,aprocessforautomationasseenatGoogle,thenotionofengineeringreleases,and,last,anessayontheneedforsimplicity.Thisratherdisparatecollectionofnotionsisveryuseful,explainedforthelaymenbutstillwithenoughtechnicalcontenttobeinterestingevenfortheexpert(practitioneroracademic).InPartsIIIandIV,PracticesandManagement,respectively,thebookdiscussesavarietyoftopics,fromtime-seriesanalysisforanomalydetection,tothepracticeandmanagementofpeopleon-call,tovariouswaystopreventandaddressincidentsoccurringinthedatacenter,topostmortemsandroot-causeanalysisthatcouldhelppreventfuturedisasters,totestingforreliability(anotoriouslydifficultissue),tosoftwareengineeringintheSREteam,toload-balancingandoverloadmanagement(resourcemanagementandscheduling101),communicationbetweenSREengs,etc.etc.etc.,untilthepredictablecallforeveryonetouseSREasearlyaspossibleandasoftenaspossible.Overall,palatablematerial,butspreadtoothinandwithtoomuchmuchoverlapwithpriorrelatedworkofadecadeago,especiallyacademic,andnotmuchnewinsight.WhatIliked IespeciallylikedPartII,whichinmyviewisoneofthebestintroductionstodatacentermanagementavailabletodaytothestudentsofthisandrelatedtopics(e.g.,applieddistributedsystems,cloudcomputing,gridcomputing,etc.) Someofthetopicsaddressed,suchasriskandteampractices,arerathernewformanyinthebusiness.Ilikedtheapproachproposedinthisbook,whichseemedtomeaboveandbeyondthecurrentstate-of-the-art. Topicsinreliability(correlatedfailures,root-causeanalysis)andscheduling(overloadmanagement,loadbalancing,architecturalissues,etc.)arecurrentlyopeninbothpracticeandacademia,andthisbookemphasizesinmyviewthedearthofgoodsolutionsbutforthesimplestofproblems. Manyoftheissuesrelatedtoautomatedmonitoringandincidentdetectioncouldleadinthefuturetobettertechnologyandmuchinnovation,soIlikedtheprominencegiventothesetopicsinthisbook. WhatIdidn'tlikeIthoroughlydislikedthestatementsclaimingbyomissionthatGooglehasinventedmostoftheconceptspresentedinthebook,whichofcourseintheacademicworldwouldhavebeenpromptlysenttotherejectpile.Asananecdote,considerthesentenceBenTreynorSloss,Google’sVPfor24/7Operations,originatorofthetermSRE,claimsthatreliabilityisthemostfundamentalfeatureofanyproduct:asystemisn’tveryusefulifnobodycanuseit!.I'llskipthediscussionaboutwhoistheoriginatorofthetermSRE,andfocusonthemeatofthisstatement.Byomission,itmakesthereaderthinkthatGoogle,throughitsBenTreynorSloss,isthefirsttounderstandtheimportanceofreliabilityfordatacenter-relatedsystems.Infact,thishasbeenlong-knowninthegridcomputingcommunity.IfoundinjustafewminutesexplicitreferencesfromGeoffreyFox(in2005,onpage317ofyetanothergridcomputinganthology,"serviceconsidersreliabledeliverytobemoreimportantthantimelydelivery"),AlexandruIosup(in2007,onpage5ofthispresentation,andagainin2009,inthiscourse,"Intoday’sgrids,reliabilityismoreimportantthanperformance!").Ofcourse,thisnotionhasbeenexploredforthegeneralcaseofservicesmuchearlier...anyonefamiliarwithairandespeciallyspaceflight?ThelistofconceptsactuallynotinventedatGoogbutaboutwhichthebookimpliestothecontrarygoesonandon...Ialsodidnotlikesomeoftheexaggeratedclaimsofhavingfoundsolutionsforthegeneralproblems.Muchremainstobedone,ashiringatGoogleintheseareascontinuesunabated.(There'salsosomethingcalledcomputerscience,whosestate-of-the-artindicatesthesame.) ...more flag 11likes · Like  · seereview View1comment Jul26,2017 DimitriosZorbas ratedit itwasamazing Shelves: devops Ihavesomanybookmarksinthisbookandconsideritaninvaluableread.Whilenoteveryproject/companyneedstooperateatGooglescale,ithelpsstreamliningtheprocesstodefineSLO/SLAsfortheoccasionandestablishingcommunicationchannelsandpracticestoachievethem.IthelpedmewrapmyheadaroundconceptsforwhichIusedtorelyonintuition.I'veshapedprocessesandcreatedtemplatedocuments(postmortem/launchcoordinationchecklist)forworkbasedonthisbook. Ihavesomanybookmarksinthisbookandconsideritaninvaluableread.Whilenoteveryproject/companyneedstooperateatGooglescale,ithelpsstreamliningtheprocesstodefineSLO/SLAsfortheoccasionandestablishingcommunicationchannelsandpracticestoachievethem.IthelpedmewrapmyheadaroundconceptsforwhichIusedtorelyonintuition.I'veshapedprocessesandcreatedtemplatedocuments(postmortem/launchcoordinationchecklist)forworkbasedonthisbook. ...more flag 8likes · Like  · seereview Apr16,2016 SebastianGebski ratedit reallylikedit Veryuneven.Exactlywhatyoushouldexpectofabookinwhichischapterisaseparateessaywrittenbyaseparategroupofpeople:)Chapterscanbegroupedintofollowingcategories:a*solidknowledge,notreallyfascinating,butuseful,someGoogleinsidestoriesb*fairlysolidknowledge,boringduetomassiverepetitionsorbeingtoogeneralc*excitingstuffthatisuselessforyou,becauseyou'renotGoogle(butstill,it'sexciting;>)d*excitingstuffthatyouactuallymayuseoutsideof Veryuneven.Exactlywhatyoushouldexpectofabookinwhichischapterisaseparateessaywrittenbyaseparategroupofpeople:)Chapterscanbegroupedintofollowingcategories:a*solidknowledge,notreallyfascinating,butuseful,someGoogleinsidestoriesb*fairlysolidknowledge,boringduetomassiverepetitionsorbeingtoogeneralc*excitingstuffthatisuselessforyou,becauseyou'renotGoogle(butstill,it'sexciting;>)d*excitingstuffthatyouactuallymayuseoutsideofGoogle,sometimeswithneatwarstoriesSadly,it'smorebthana&morecthand.Butitdoesn'tchangemyopinionthatthisbookisactuallyworthreading-it'soneofthefewbooksforthetopic,it'sbasedonactualengineeringperspectiveofaveryinterestingcompanythatoperatesinamassivescale,it'smassivelyinfluencedbythisorganization'sculture.EventypicalSoftwareEngineers(especiallyjuniorones)shouldreadittolearnthatsoftwaredelivery&maintenanceissomuchmorethanjustsimpledevelopment.Onelastremarktoconclude:sorryifImadeafalseimpression,butthisisNOTatechnicalbook.It'sfarmoreaboutprocesses,communication,attitude&mindsetthanactualtechnologyrunningunderthehood. ...more flag 6likes · Like  · seereview Mar03,2017 MichaelKoltsov ratedit itwasamazing Idon’tnormallybuypaperbooks,whichmeansthatinthecourseofthelastfewyearsI’veboughtonlyonepaperbookeventhoughI’vereadhundredsofbooksduringthatperiodoftime.ThisbookisthesecondoneI’veboughtsofar,whichmeansalottome.NotmentioningthatGoogleisprovidingitontheInternetfreeofcharge.Forme,personally,thisbookisabasisonwhichalotofmypastassumptionscouldbearguedasviablesolutionswiththescaleofGoogle.Thisbookisnotrevealin Idon’tnormallybuypaperbooks,whichmeansthatinthecourseofthelastfewyearsI’veboughtonlyonepaperbookeventhoughI’vereadhundredsofbooksduringthatperiodoftime.ThisbookisthesecondoneI’veboughtsofar,whichmeansalottome.NotmentioningthatGoogleisprovidingitontheInternetfreeofcharge.Forme,personally,thisbookisabasisonwhichalotofmypastassumptionscouldbearguedasviablesolutionswiththescaleofGoogle.ThisbookisnotrevealinganyGoogle’ssecrets(dotheyreallyhaveanysecrets?)Butit’sagreatstartevenifyoudon’tneedthescaleofGooglebutwanttowriterobustandfailure-resilientapps.Technicalsolutions,dealingwiththeuserfacingissues,findingpeers,on-callsupport,post-mortems,incident-trackingsystems–thisbookhasitallthough,aschaptershavebeenwrittenbydifferentpeoplesomeaspectsaremoreemphasizedthantheothers.Iwishsomeofthechaptershadmoregoryproduction-baseddetailsthantheydonow.Myscoreis5/5 ...more flag 6likes · Like  · seereview Sep04,2016 AlexanderYakushev ratedit reallylikedit Shelves: management, software-engineering Thisbookisgreatonmultiplelevels.Firstofall,itpacksgreatcontent—thedetailedexplanationofhowandwhyGooglehasinternallyestablishedwhatwenowcall"theDevOpsculture."Rationalecoupledtogetherwithhands-onimplementationguideprovideincredibleinsightintocreatingandrunningSREteaminyourowncompany.Thetextqualityistop-notch,thebookiswrittenwithclarityinmindandthoroughlyedited.I'dratethecontentitselfatfourstars.Butthebookdeservesthefi Thisbookisgreatonmultiplelevels.Firstofall,itpacksgreatcontent—thedetailedexplanationofhowandwhyGooglehasinternallyestablishedwhatwenowcall"theDevOpsculture."Rationalecoupledtogetherwithhands-onimplementationguideprovideincredibleinsightintocreatingandrunningSREteaminyourowncompany.Thetextqualityistop-notch,thebookiswrittenwithclarityinmindandthoroughlyedited.I'dratethecontentitselfatfourstars.Butthebookdeservesthefifthstarbecauseitisasuperbexampleofamaterialthatgivesyouthepreciseunderstandingofhowsomecompany(oritsdivision)operatesinside.Apparently,Googlecanaffordtoexposesuchsecretswhilenotmanyothercompaniescan,butweneedmorelow-BSto-the-pointbookslikethistoshareandexchangetheexperienceofrunningthemostcomplexsystems(thatis,humanorganizations)efficiently. ...more flag 5likes · Like  · seereview Dec23,2019 RegisHattori ratedit reallylikedit Shelves: infrastructure, devops, sre Thisbookisdividedintofiveparts:Introduction,Principles,Practices,Management,andConclusions.Iseealotofvalueinthefirsttwopartsforanypeopleinvolvedinsoftwaredevelopment.Itconvincesusabouttheimportanceofthesubjectwithverygoodarguments,nomatterifyouareasoftwareengineering,aproductmanagerorevenauser.Thispartdeserves5starsAftersomechaptersofthePracticespart,theconclusionImadeisthatthispartofthebookmayonlybeusefulifyou Thisbookisdividedintofiveparts:Introduction,Principles,Practices,Management,andConclusions.Iseealotofvalueinthefirsttwopartsforanypeopleinvolvedinsoftwaredevelopment.Itconvincesusabouttheimportanceofthesubjectwithverygoodarguments,nomatterifyouareasoftwareengineering,aproductmanagerorevenauser.Thispartdeserves5starsAftersomechaptersofthePracticespart,theconclusionImadeisthatthispartofthebookmayonlybeusefulifyouarefacingaspecificproblemandarelookingforsomeinsightsbutnottoreadend-to-end.SomeexamplesaretoospecificforGoogleorsimilarcompaniesthathavenotthesamebudget,skills,andpre-requisites.Ingeneral,3starsisfair,butIwillrateas4becauseIreallylikedthefirst2parts. ...more flag 5likes · Like  · seereview Jun12,2016 JamesStewart ratedit itwasok Loadsofinterestingideasandthoughts,butabitofaslogtogetthrough.Theapproachofhavingdifferentmembersoftheteamwritedifferentsectionsprobablyworkedreallywellforengagingeveryone,butitmadeforquiteabitofrepetition.Italsoendsupfeelinglikeafewbooksrolledintoone,withoneondistributedsystemsdesign,anotheronSREcultureandpractices,andmaybeanotheronmanagement. flag 5likes · Like  · seereview Mar18,2017 AlexPalcuie ratedit itwasamazing Shelves: favorites Ithinkthisisthebestengineeringbookinthelastdecade. flag 4likes · Like  · seereview Sep03,2021 VladRomanenko ratedit reallylikedit Shelves: tech, available-to-read VeryusefulandfundamentalworkforSREdiscipline.UnsurprisinglyachunkofthebookisquiteGooglespecific. flag 3likes · Like  · seereview Sep25,2017 TomasVaraneckas ratedit itwasok Thiswasareallyhardread,inabadsense.Thefirstcoupleofdozenpageswerereallypromising,butthebookturnedouttobeunnecessarilylong,incrediblyboring,repetativeandinconsistentgangbangofrandomblogpostsandoftentrivialinformation.Ithasroughly10%ofvaluablecontent,andwouldgreatlybenefitfrombeingreducedto50-pager.Atit'scurrentstateitseemsthatitwasacorporatecollaborativeego-trip,toshowpotentialemployeeshowcoolGoogleSREis,andhowmaje Thiswasareallyhardread,inabadsense.Thefirstcoupleofdozenpageswerereallypromising,butthebookturnedouttobeunnecessarilylong,incrediblyboring,repetativeandinconsistentgangbangofrandomblogpostsandoftentrivialinformation.Ithasroughly10%ofvaluablecontent,andwouldgreatlybenefitfrombeingreducedto50-pager.Atit'scurrentstateitseemsthatitwasacorporatecollaborativeego-trip,toshowpotentialemployeeshowcoolGoogleSREis,andhowmajestictheirscalehappenstobe.Afterreadingthisbook,IamabsolutelysureIwouldnevereverwanttoworkforGoogle. ...more flag 3likes · Like  · seereview Oct19,2016 Chris ratedit reallylikedit There'satonofgreatinformationhere,andwerefertoitregularlyaswe'retryingtochangethecultureatwork.Igaveita4insteadofa5becauseitdoessufferalittlefromthestyle–thinkcollectionofessaysratherthanaunifiedarc–butit'sreallyworthreadingevenifitrequiressomecaretotransfertomoreusualenvironments. There'satonofgreatinformationhere,andwerefertoitregularlyaswe'retryingtochangethecultureatwork.Igaveita4insteadofa5becauseitdoessufferalittlefromthestyle–thinkcollectionofessaysratherthanaunifiedarc–butit'sreallyworthreadingevenifitrequiressomecaretotransfertomoreusualenvironments. ...more flag 3likes · Like  · seereview Aug27,2019 BjoernRochel ratedit reallylikedit Shelves: 2018, 2019 Alittledisclaimer:Myreviewhereismoreabouttheconceptandorganizationalpartsthanthepuretechnicalaspects.MostlybecauseImanageengineeringteamsnowadaysandtheseareasarethemoreimportantonesforme.ThisbookcontainsalsoalotoftechnicalinformationonhowtoimplementSREthatIwouldhighlyrecommendedforinterestedsoftwareengineers.OneaspectIlikedinparticularaboutSREistheErrorBudgetconcept,Googleswaytomanagetheageoldconflictbetweenproducta Alittledisclaimer:Myreviewhereismoreabouttheconceptandorganizationalpartsthanthepuretechnicalaspects.MostlybecauseImanageengineeringteamsnowadaysandtheseareasarethemoreimportantonesforme.ThisbookcontainsalsoalotoftechnicalinformationonhowtoimplementSREthatIwouldhighlyrecommendedforinterestedsoftwareengineers.OneaspectIlikedinparticularaboutSREistheErrorBudgetconcept,Googleswaytomanagetheageoldconflictbetweenproductandengineeringonhowtodistributedevelopmenteffortsaroundnonfunctionalrequirementsandespeciallytechnicaldebtononesideandnewfeaturesontheotherside.Thedatadrivenapproachandconsequentlythedepersonalizationofthisdebateseemsverysaneandprofessionaltome.Ialsolikedtheiremphasisontraining,simulationandcarefulon-boardingforSREs.Formethisisstillanareawherethemajorityoftheindustryhasplentyroomforimprovement.LookingatwhatGoogledoesheremakestherestofuslooklikef***ingamateurs.AnotherthingthatI’malmostguaranteedtostealistheideaofestablishingaProductionReadinessReviewtoensurereliabilityofnewproductsandfeaturesfrommultipleangles(design,security,capacity,etc.).WhatI’mstilltryingtowrapmyheadaroundiswhetherhavingdedicatedSREteamsareagoodidea(incontrasttoayou-build-it-you-run-itapproachwhereeverydeliveryteameffectivelyownstheresponsibilitytoreachthedefinedSLA/Os).AprinciplethatIlikealotistogiveengineersalotoffreedombuttoalsomakethemaccountablefortheirdecisionsandthesoftwaretheyproduce.Separatingoutproduction-fitnessintoaseparategroup/teamsoundslikeitgoesintotheoppositedirection.Icanimaginethatseveralfactorsplayintothis(standardization,activetech/stackmanagement,skillavailability,etc.)andcertainlyGooglehascarefullyevolvedittowhereitisnow,butmyinitialreactionforthisideawasnegative.OverallaverygoodresourcethatIwillcomebackto ...more flag 2likes · Like  · seereview Nov16,2019 LiviuCostea ratedit itwasamazing Shelves: devops Alotoffoodforthought,abookthatbecameareferenceinthefield.Theonlyproblemisthewidecoverage,youmightfindsomechaptersveryniche,likenoteverybodycareshowtobuildlayer4loadbalancer.Highlyrecommendedifyouarefollowingdevopsapproaches. flag 2likes · Like  · seereview Dec25,2018 VítListík ratedit itwasamazing Ilikethefactthatitiswrittenbymultipleauthors.Everythingstatedinthebookseemssoobviousbutitissosadtoreaditbecauseitisnotyetanindustrystandard.AmustreadforeverySRE. flag 2likes · Like  · seereview Dec24,2019 AmirSarabadani ratedit likedit Shelves: best-software-engineering-books It'sbasicallyalooongadvertisementforgooglewithsomeusefulinformationinsidewhileitshouldbeotherwayaround. It'sbasicallyalooongadvertisementforgooglewithsomeusefulinformationinsidewhileitshouldbeotherwayaround. ...more flag 2likes · Like  · seereview Dec21,2021 JonasMinelga ratedit reallylikedit Verylonganddetailedbook.Informationinitisextremelyvaluable,butithinkGoogleisoneoflike2-3companiesintheworld,wereallofthatcanbeused.Ithinkforbroaderaudienceitistoodetailedinsomeparts,duplicateinfoinothers,andslightlydifficulttoread.Butoverall,bookprovidesalotofamazinginsightsandprovidesmanyideas. flag 2likes · Like  · seereview Apr03,2017 Ahmadhosseini ratedit likedit Shelves: software-engineering, programming WhatisSRE?SiteReliabilityEngineering(SRE)isGoogle’sapproachtoservicemanagement.AnSREteamisresponsiblefortheavailability,latency,performance,efficiency,changemanagement,monitoring,emergencyresponse,andcapacityplanningoftheirservice(s).TypicalSREactivitiesfallintothefollowingapproximatecategories:• Softwareengineering:Involveswritingormodifyingcode,inadditiontoanyassociateddesignanddocumentationwork.• Systemengineering:Involvesconfiguringp WhatisSRE?SiteReliabilityEngineering(SRE)isGoogle’sapproachtoservicemanagement.AnSREteamisresponsiblefortheavailability,latency,performance,efficiency,changemanagement,monitoring,emergencyresponse,andcapacityplanningoftheirservice(s).TypicalSREactivitiesfallintothefollowingapproximatecategories:• Softwareengineering:Involveswritingormodifyingcode,inadditiontoanyassociateddesignanddocumentationwork.• Systemengineering:Involvesconfiguringproductionsystems,modifyingconfiguration,ordocumentingsystemsinawaythatproductslastingimprovementsfromaone-timeeffort.• Toil:workdirectlytorunningaservicethatisrepetitive,manual,etc.• Overhead:Administrativeworknottieddirectlytorunningaservice.Quotes“Bewarnedthatbeinganexpertismorethanunderstandinghowasystemissupposedtowork.Expertiseisgainedbyinvestigatingwhyasystemdoesn’twork.”–BrainRedman“Waysinwhichthingsgorightarespecialcasesofthewaysinwhichthingsgowrong.”–JohnAllspawAboutbookThisbookisaseriesofessayswrittenbymembersandalumniofGoogle’sSiteReliabilityEngineeringorganization.It’smuchmorelikeconferenceproceedingsthanitislikeastandardbookbyanauthororasmallnumberofauthors.Eachchapterisintendedtobereadasapartofacoherentwhole,butagooddealcanbegainedbyreadingonwhateversubjectparticularlyinterestsyou.“Essentialreadingforanyonerunninghighlyavailablewebservicesatscale.”–AdrianCockcroft,BatteryVentures,formerNetflixCloudArchitect ...more flag 3likes · Like  · seereview Jul15,2018 David ratedit reallylikedit ThebookseemslargelytobeacollectionofessayswrittenbydisparatepeoplewithinGoogle'sSREorganization.It'saswell-organizedandcoherentasthatcanbe(andIthinkit'sagoodformatforthis--farbetterthanifthey'dtriedtocreatesomethingwithamoreunifiednarrative).Butit'sveryuneven:somechaptersareterrificwhilesomeseemratherempty.Ifoundthechaptersonrisk,loadbalancing,overload,distributedconsensus,and(surprisingly)launchestobeamongthemost ThebookseemslargelytobeacollectionofessayswrittenbydisparatepeoplewithinGoogle'sSREorganization.It'saswell-organizedandcoherentasthatcanbe(andIthinkit'sagoodformatforthis--farbetterthanifthey'dtriedtocreatesomethingwithamoreunifiednarrative).Butit'sveryuneven:somechaptersareterrificwhilesomeseemratherempty.Ifoundthechaptersonrisk,loadbalancing,overload,distributedconsensus,and(surprisingly)launchestobeamongthemostuseful.Ontheotherhand,thechapteronsimplicitywasindeedsimplistic,andthechapterondataintegritywas(surprisingly)disappointing.Thegood:there'salotofexcellentinformationinthisbook.It'sacomprehensive,thoughtfuloverviewforanybodyenteringtheworldofdistributedsystems,cloudinfrastructure,ornetworkservices.Despiteafewmisgivings,I'mprettyonboardwithGoogle'sapproachtoSRE.It'saverythoughtfulapproachtotheproblemsofoperatingproductionservices,coveringtopicsrangingfromtimemanagement,prioritization,onboarding,plusallthetechnicalchallengesindistributedsystems.Thebad:Thebookgetsreligious(aboutGoogle)attimes,andsomeofit'sprettysmug.Thisisn'tabigdeal,butit'slikelytoturnoffpeoplewho'veseenfromexperiencehowfrustratingandunproductiveitcanbewhengoodideasaboutbuildingsystemsbecomereligion. ...more flag 2likes · Like  · seereview Oct16,2017 LukeAmdor ratedit reallylikedit Somereallygreatchaptersespeciallytowardsthebeginningandtheend.However,Ifeellikeitcouldhavebeeneditedbetter.Itmeandersalot. flag 2likes · Like  · seereview Mar08,2020 Amr ratedit likedit Shelves: paused Thebookisgreatintermsofgettingmoreunderstandingofgoogle’sSREculture.ButIgottoaplacewhereitbecameirrelevanttometocontinuethebooksoIdecidedtodropit. flag 2likes · Like  · seereview Oct04,2018 SundarrajKaushik ratedit itwasamazing Awonderfulbooktolearnhowtomanagewebsitessothattheyarereliable.Somegoodrandomextractsfromthebook.SiteReliabilityEngineering1.Operationspersonnelshouldspend50%oftheirtimeinwritingautomationscriptsandprograms.2.thedecisiontostopreleasesfortheremainderofthequarteronceanerrorbudgetisdepleted3.anSREteamisresponsiblefortheavailability,latency,performance,efficiency,changemanagement,monitoring,emergencyresponse,andcapacityplanningo Awonderfulbooktolearnhowtomanagewebsitessothattheyarereliable.Somegoodrandomextractsfromthebook.SiteReliabilityEngineering1.Operationspersonnelshouldspend50%oftheirtimeinwritingautomationscriptsandprograms.2.thedecisiontostopreleasesfortheremainderofthequarteronceanerrorbudgetisdepleted3.anSREteamisresponsiblefortheavailability,latency,performance,efficiency,changemanagement,monitoring,emergencyresponse,andcapacityplanningoftheirservice(s).4.codifiedrulesofengagementandprinciplesforhowSREteamsinteractwiththeirenvironment—notonlytheproductionenvironment,butalsotheproductdevelopmentteams,thetestingteams,theusers,andsoon5.operatesunderablame-freepostmortemculture,withthegoalofexposingfaultsandapplyingengineeringtofixthesefaults,ratherthanavoidingorminimizingthem.6.Therearethreekindsofvalidmonitoringoutput:Alerts:Signifythatahumanneedstotakeactionimmediatelyinresponsetosomethingthatiseitherhappeningorabouttohappen,inordertoimprovethesituation.Tickets:Signifythatahumanneedstotakeaction,butnotimmediately.Thesystemcannotautomaticallyhandlethesituation,butifahumantakesactioninafewdays,nodamagewillresult.Logging:Nooneneedstolookatthisinformation,butitisrecordedfordiagnosticorforensicpurposes.Theexpectationisthatnoonereadslogsunlesssomethingelsepromptsthemtodoso.7.Resourceuseisafunctionofdemand(load),capacity,andsoftwareefficiency.SREspredictdemand,provisioncapacity,andcanmodifythesoftware.Thesethreefactorsarealargepart(thoughnottheentirety)ofaservice’sefficiency.SLI-ServiceLevelIndicator-Indicatorsusedtomeasurethehealthofaservice.UsedtodeterminetheSLOandSLA.SLO-ServiceLevelObjective-Theobjectivethatmustbemetbytheservice.SLA-ServiceLevelAgreement-TheAgreementwiththeclientwithrespecttotheservicesrenderedtothem.Don’toverachieveUsersbuildontherealityofwhatyouoffer,ratherthanwhatyousayyou’llsupply,particularlyforinfrastructureservices.Ifyourservice’sactualperformanceismuchbetterthanitsstatedSLO,userswillcometorelyonitscurrentperformance.Youcanavoidover-dependencebydeliberatelytakingthesystemofflineoccasionally(Google’sChubbyserviceintroducedplannedoutagesinresponsetobeingoverlyavailable),18throttlingsomerequests,ordesigningthesystemsothatitisn’tfasterunderlightloads."Ifahumanoperatorneedstotouchyoursystemduringnormaloperations,youhaveabug.Thedefinitionofnormalchangesasyoursystemsgrow."FourGoldenSignalsofMonitoringThefourgoldensignalsofmonitoringarelatency,traffic,errors,andsaturation.Ifyoucanonlymeasurefourmetricsofyouruser-facingsystem,focusonthesefour.Latency:Thetimeittakestoservicearequest.It’simportanttodistinguishbetweenthelatencyofsuccessfulrequestsandthelatencyoffailedrequests.Forexample,anHTTP500errortriggeredduetolossofconnectiontoadatabaseorothercriticalbackendmightbeservedveryquickly;however,asanHTTP500errorindicatesafailedrequest,factoring500sintoyouroveralllatencymightresultinmisleadingcalculations.Ontheotherhand,aslowerrorisevenworsethanafasterror!Therefore,it’simportanttotrackerrorlatency,asopposedtojustfilteringouterrors.Traffic:Ameasureofhowmuchdemandisbeingplacedonyoursystem,measuredinahigh-levelsystem-specificmetric.Forawebservice,thismeasurementisusuallyHTTPrequestspersecond,perhapsbrokenoutbythenatureoftherequests(e.g.,staticversusdynamiccontent).Foranaudiostreamingsystem,thismeasurementmightfocusonnetworkI/Orateorconcurrentsessions.Forakey-valuestoragesystem,thismeasurementmightbetransactionsandretrievalspersecond.Errors:Therateofrequeststhatfail,eitherexplicitly(e.g.,HTTP500s),implicitly(forexample,anHTTP200successresponse,butcoupledwiththewrongcontent),orbypolicy(forexample,"Ifyoucommittedtoone-secondresponsetimes,anyrequestoveronesecondisanerror").Whereprotocolresponsecodesareinsufficienttoexpressallfailureconditions,secondary(internal)protocolsmaybenecessarytotrackpartialfailuremodes.Monitoringthesecasescanbedrasticallydifferent:catchingHTTP500satyourloadbalancercandoadecentjobofcatchingallcompletelyfailedrequests,whileonlyend-to-endsystemtestscandetectthatyou’reservingthewrongcontent.Saturation:How"full"yourserviceis.Ameasureofyoursystemfraction,emphasizingtheresourcesthataremostconstrained(e.g.,inamemory-constrainedsystem,showmemory;inanI/O-constrainedsystem,showI/O).Notethatmanysystemsdegradeinperformancebeforetheyachieve100%utilization,sohavingautilizationtargetisessential.Incomplexsystems,saturationcanbesupplementedwithhigher-levelloadmeasurement:canyourserviceproperlyhandledoublethetraffic,handleonly10%moretraffic,orhandleevenlesstrafficthanitcurrentlyreceives?Forverysimpleservicesthathavenoparametersthatalterthecomplexityoftherequest(e.g.,"Givemeanonce"or"Ineedagloballyuniquemonotonicinteger")thatrarelychangeconfiguration,astaticvaluefromaloadtestmightbeadequate.Asdiscussedinthepreviousparagraph,however,mostservicesneedtouseindirectsignalslikeCPUutilizationornetworkbandwidththathaveaknownupperbound.Latencyincreasesareoftenaleadingindicatorofsaturation.Measuringyour99thpercentileresponsetimeoversomesmallwindow(e.g.,oneminute)cangiveaveryearlysignalofsaturation.Finally,saturationisalsoconcernedwithpredictionsofimpendingsaturation,suchas"Itlookslikeyourdatabasewillfillitsharddrivein4hours."Ifyoumeasureallfourgoldensignalsandpageahumanwhenonesignalisproblematic(or,inthecaseofsaturation,nearlyproblematic),yourservicewillbeatleastdecentlycoveredbymonitoring.Whyitisimportanttohavecontroloverthesoftwarethatoneisusing?Whyandwhenitmakessensetorolloutone'sownframeworkand/orplatform?Anotherargumentinfavorofautomation,particularlyinthecaseofGoogle,isourcomplicatedyetsurprisinglyuniformproductionenvironment,describedinTheProductionEnvironmentatGoogle,fromtheViewpointofanSRE.WhileotherorganizationsmighthaveanimportantpieceofequipmentwithoutareadilyaccessibleAPI,softwareforwhichnosourcecodeisavailable,oranotherimpedimenttocompletecontroloverproductionoperations,Googlegenerallyavoidssuchscenarios.WehavebuiltAPIsforsystemswhennoAPIwasavailablefromthevendor.Eventhoughpurchasingsoftwareforaparticulartaskwouldhavebeenmuchcheaperintheshortterm,wechosetowriteourownsolutions,becausedoingsoproducedAPIswiththepotentialformuchgreaterlong-termbenefits.Wespentalotoftimeovercomingobstaclestoautomaticsystemmanagement,andthenresolutelydevelopedthatautomaticsystemmanagementitself.GivenhowGooglemanagesitssourcecode,theavailabilityofthatcodeformoreorlessanysystemthatSREtouchesalsomeansthatourmissionto“owntheproductinproduction”ismucheasierbecausewecontroltheentiretyofthestack.Whendevelopedin-housetheplatform/frameworkcanbedesignedtomanageanyfailuresautomatically.Thereisnoexternalobserverrequiredtomanagethis.Oneofthenegativesofautomationisthathumansforgethowtodoataskwhenrequired.Thismaynotbealwaysgood.GoogleCherryPicksfeaturesforrelease.Shouldwedothesame?"Allcodeischeckedintothemainbranchofthesourcecodetree(mainline).However,mostmajorprojectsdon’treleasedirectlyfromthemainline.Instead,webranchfromthemainlineataspecificrevisionandnevermergechangesfromthebranchbackintothemainline.Bugfixesaresubmittedtothemainlineandthencherrypickedintothebranchforinclusionintherelease.Thispracticeavoidsinadvertentlypickingupunrelatedchangessubmittedtothemainlinesincetheoriginalbuildoccurred.Usingthisbranchandcherrypickmethod,weknowtheexactcontentsofeachrelease."Notethatcherrypickingisofspecificreleasebranchesandnotchangesinspecificbranch.Surprisesvs.boring"Unlikejustabouteverythingelseinlife,"boring"isactuallyapositiveattributewhenitcomestosoftware!Wedon’twantourprogramstobespontaneousandinteresting;wewantthemtosticktothescriptandpredictablyaccomplishtheirbusinessgoals.InthewordsofGoogleengineerRobertMuth,"Unlikeadetectivestory,thelackofexcitement,suspense,andpuzzlesisactuallyadesirablepropertyofsourcecode."SurprisesinproductionarethenemesesofSRE."Commentingorflaggingcode"Becauseengineersarehumanbeingswhooftenformanemotionalattachmenttotheircreations,confrontationsoverlarge-scalepurgesofthesourcetreearenotuncommon.Somemightprotest,"Whatifweneedthatcodelater?""Whydon’twejustcommentthecodeoutsowecaneasilyadditagainlater?"or"Whydon’twegatethecodewithaflaginsteadofdeletingit?"Theseareallterriblesuggestions.Sourcecontrolsystemsmakeiteasytoreversechanges,whereashundredsoflinesofcommentedcodecreatedistractionsandconfusion(especiallyasthesourcefilescontinuetoevolve),andcodethatisneverexecuted,gatedbyaflagthatisalwaysdisabled,isametaphoricaltimebombwaitingtoexplode,aspainfullyexperiencedbyKnightCapital,forexample(see"OrderIntheMatterofKnightCapitalAmericasLLC"[Sec13])."WritingblamelessRCAPointingfingers:"Weneedtorewritetheentirecomplicatedbackendsystem!It’sbeenbreakingweeklyforthelastthreequartersandI’msurewe’realltiredoffixingthingsonesy-twosy.Seriously,ifIgetpagedonemoretimeI’llrewriteitmyself…"Blameless:"Anactionitemtorewritetheentirebackendsystemmightactuallypreventtheseannoyingpagesfromcontinuingtohappen,andthemaintenancemanualforthisversionisquitelongandreallydifficulttobefullytrainedupon.I’msureourfutureon-callerswillthankus!"EstablishingastrongtestingcultureOnewaytoestablishastrongtestingcultureistostartdocumentingallreportedbugsastestcases.Ifeverybugisconvertedintoatest,eachtestissupposedtoinitiallyfailbecausethebughasn’tyetbeenfixed.Asengineersfixthebugs,thesoftwarepassestestingandyou’reontheroadtodevelopingacomprehensiveregressiontestsuite.ProjectVs.SupportDedicated,noninterrupted,projectworktimeisessentialtoanysoftwaredevelopmenteffort.Dedicatedprojecttimeisnecessarytoenableprogressonaproject,becauseit’snearlyimpossibletowritecode—muchlesstoconcentrateonlarger,moreimpactfulprojects—whenyou’rethrashingbetweenseveraltasksinthecourseofanhour.Therefore,theabilitytoworkonasoftwareprojectwithoutinterruptsisoftenanattractivereasonforengineerstobeginworkingonadevelopmentproject.Suchtimemustbeaggressivelydefended.ManagingLoadsRoundRobinVs.WeightedRoundRobin(RoundRobin,buttakingintoconsiderationthenumberoftaskspendingattheserver)Overloadofthesystemhastobeavoidedbyusageofloadtesting.Ifdespitethisthesystemisoverloadedthenanyretrieshavetobewellcontrolled.Aretryatahigherlevelcancascadetheretriesatthelowerlevel.Usejitterretries(retryatrandomintervals)andexponentialretry(exponentiallyincreasethetimebetweentheretries)andfailquicklytopreventoverloadonthealreadyoverloadedsystem.IfqueuingisusedtopreventoverloadingofserverthensometimesFIFOmaynotbeagoodoptionastheuserwaitingforthetasksattheheadofthequeuemayhaveleftthesystemnotexpectingaresponse.Iftaskissplitintomultiplepipelinedtasksthenitwillbegoodtocheckateachstageifthereissufficienttimeforperformingtherestofthetasksbasedontheexpectedtimethatwillbetakenbytheremainingtasksinthepipeline.Implementadeadlinepropagation.SafeguardingthedataThreelevelsofguardagainstdataloss1.SoftDelete(Visibletouserintherecyclebin)2.Backup(incrementalandfull)beforeactualdeletionandtestabilitytorestore.Replicateliveandbackedupdata.3.Purgedata(Canberecoveredonlyfrombackupnow)OutofBanddatavalidationtopreventsurprisingdataloss.Importantto1.Continuouslytesttherecoveryprocessaspartofyournormaloperations2.SetupalertsthatfirewhenarecoveryprocessfailstoprovideaheartbeatindicationofitssuccessLaunchCoordinationChecklistThisisGoogle’soriginalLaunchCoordinationChecklist,circa2005,slightlyabridgedforbrevity:1.Architecture:Architecturesketch,typesofservers,typesofrequestsfromclients2.Programmaticclientrequests3,Machinesanddatacenters4,Machinesandbandwidth,datacenters,N+2redundancy,networkQoS5.Newdomainnames,DNSloadbalancing6.Volumeestimates,capacity,andperformance7.HTTPtrafficandbandwidthestimates,launch“spike,”trafficmix,6monthsout8.Loadtest,end-to-endtest,capacityperdatacenteratmaxlatency9.Impactonotherserviceswecaremostabout10.Storagecapacity11.Systemreliabilityandfailover Whathappenswhen:Machinedies,rackfails,orclustergoesofflineNetworkfailsbetweentwodatacentersForeachtypeofserverthattalkstootherservers(itsbackends):Howtodetectwhenbackendsdie,andwhattodowhentheydieHowtoterminateorrestartwithoutaffectingclientsorusersLoadbalancing,rate-limiting,timeout,retryanderrorhandlingbehaviorDatabackup/restore,disasterrecovery12.MonitoringandservermanagementMonitoringinternalstate,monitoringend-to-endbehavior,managingalertsMonitoringthemonitoringFinanciallyimportantalertsandlogsTipsforrunningserverswithinclusterenvironmentDon’tcrashmailserversbysendingyourselfemailalertsinyourownservercode13.SecuritySecuritydesignreview,securitycodeaudit,spamrisk,authentication,SSLPrelaunchvisibility/accesscontrol,varioustypesofblacklists14.AutomationandmanualtasksMethodsandchangecontroltoupdateservers,data,andconfigsReleaseprocess,repeatablebuilds,canariesunderlivetraffic,stagedrollouts15.GrowthissuesSparecapacity,10xgrowth,growthalertsScalabilitybottlenecks,linearscaling,scalingwithhardware,changesneededCaching,datasharding/resharding16.ExternaldependenciesThird-partysystems,monitoring,networking,trafficvolume,launchspikesGracefuldegradation,howtoavoidaccidentallyoverrunningthird-partyservicesPlayingnicewithsyndicatedpartners,mailsystems,serviceswithinGoogle17.ScheduleandrolloutplanningHarddeadlines,externalevents,MondaysorFridaysStandardoperatingproceduresforthisservice,forotherservicesAsmentioned,youmightencounterresponsessuchas"Whyme?"Thisresponseisespeciallylikelywhenateambelievesthatthepostmortemprocessisretaliatory.ThisattitudecomesfromsubscribingtotheBadAppleTheory:thesystemisworkingfine,andifwegetridofallthebadapplesandtheirmistakes,thesystemwillcontinuetobefine.TheBadAppleTheoryisdemonstrablyfalse,asshownbyevidence[Dek14]fromseveraldisciplines,includingairlinesafety.Youshouldpointoutthisfalsity.Themosteffectivephrasingforapostmortemistosay,"Mistakesareinevitableinanysystemwithmultiplesubtleinteractions.Youwereon-call,andItrustyoutomaketherightdecisionswiththerightinformation.I'dlikeyoutowritedownwhatyouwerethinkingateachpointintime,sothatwecanfindoutwherethesystemmisledyou,andwherethecognitivedemandsweretoohigh.""Thebestdesignsandthebestimplementationsresultfromthejointconcernsofproductionandtheproductbeingmetinanatmosphereofmutualrespect."PostmortemCultureCorrectiveandpreventativeaction(CAPA)isawell-knownconceptforimprovingreliabilitythatfocusesonthesystematicinvestigationofrootcausesofidentifiedissuesorrisksinordertopreventrecurrence.ThisprincipleisembodiedbySRE'sstrongcultureofblamelesspostmortems.Whensomethinggoeswrong(andgiventhescale,complexity,andrapidrateofchangeatGoogle,somethinginevitablywillgowrong),it'simportanttoevaluateallofthefollowing:WhathappenedTheeffectivenessoftheresponseWhatwewoulddodifferentlynexttimeWhatactionswillbetakentomakesureaparticularincidentdoesn'thappenagainThisexerciseisundertakenwithoutpointingfingersatanyindividual.Insteadofassigningblame,itisfarmoreimportanttofigureoutwhatwentwrong,andhow,asanorganization,wewillrallytoensureitdoesn'thappenagain.Dwellingonwhomighthavecausedtheoutageiscounterproductive.PostmortemsareconductedafterincidentsandpublishedacrossSREteamssothatallcanbenefitfromthelessonslearned.Decisionsshouldbeinformedratherthanprescriptive,andaremadewithoutdeferencetopersonalopinions—eventhatofthemost-seniorpersonintheroom,whoEricSchmidtandJonathanRosenbergdubthe"HiPPO,"for"Highest-PaidPerson'sOpinion" ...more flag 1like · Like  · seereview Mar13,2021 AndrewBarchuk ratedit likedit WhilethebookcontainsalotofgreatadviceIhaveaveryhardtimetorecommendit.Thechaptersareallovertheplace,repeatingeachother,overlappingandvaryinginquality.Somearedefinite5starandsomeareforsomeit’shardtofigureoutwhatexactlyisthesubjectandthepoint.PartsofthebookareriddledwithGooglejargon,theonlyreasonforbeingabletofollowalongwasthefactthatIworkedatGoogleasasoftwareengineer,whichwon’tbetrueforthemajorityofreader WhilethebookcontainsalotofgreatadviceIhaveaveryhardtimetorecommendit.Thechaptersareallovertheplace,repeatingeachother,overlappingandvaryinginquality.Somearedefinite5starandsomeareforsomeit’shardtofigureoutwhatexactlyisthesubjectandthepoint.PartsofthebookareriddledwithGooglejargon,theonlyreasonforbeingabletofollowalongwasthefactthatIworkedatGoogleasasoftwareengineer,whichwon’tbetrueforthemajorityofreaders. ...more flag 1like · Like  · seereview Jan26,2019 ScottMaclellan ratedit reallylikedit Afantasticandin-depthresource.Greatforgoingdeeperandmaturinghowacompanybuildsandrunssoftwareatscale.Touchesonthespecifictacticalactionsyourteamcantaketobuildmorereliableproducts.Theextendedsectionsoncultureslowedmedownalot,buthaveledtosomeveryinterestingconversationsatwork. flag 1like · Like  · seereview Jun06,2017 TadasTalaikis ratedit likedit "Boring"(atleastfromtheoutsideworldperspective,okwithme),basicallycanbemuchshorter.Culture,automationofeverything,loadbalancing,monitoring,likeeverywhereelse,exceptmaybeBorgthing. "Boring"(atleastfromtheoutsideworldperspective,okwithme),basicallycanbemuchshorter.Culture,automationofeverything,loadbalancing,monitoring,likeeverywhereelse,exceptmaybeBorgthing. ...more flag 1like · Like  · seereview Apr22,2018 Luca ratedit likedit There’sinterestingcontentforsure.Butthewritingisn’tengaging(thebookislongsothatbecomesboringkindafast)andsomeaspectsofthegoogleculturearerealcreepy(bestexample:“humansareimperfectmachines”whiletalkingaboutpeoplemanagement...) flag 1like · Like  · seereview Jun12,2020 Mengyi ratedit itwasamazing ThisisacompletecollectionofeverythingaboutbuildingtheSREteam,fromtheirpracticestohowtoonboardanewSREtotheteam.IampersonallyreallyinspiredbytheconceptoferrorBudgetandthesharebydefaultculturefoldersbypracticessuchasblamelesspostmortem. flag 1like · Like  · seereview View1comment Feb05,2019 DavidRobillard ratedit itwasamazing Shelves: it Amustreadforanyoneinvolvedwithonlineservices. flag 1like · Like  · seereview Jan14,2018 GaryBoland ratedit likedit Ausefulchecklistforproductionengineeringistarnishedbytheundercurrentofmarketing/recruiting.Stilldeservesitsplaceontheshelfifyoudeliversoftwareforaliving flag 1like · Like  · seereview «previous12345678next» newtopicDiscussThisBook topics  posts  views  lastactivity    あみ영화다운로드┗┗tvnFREE.Com┛┛실시간티비보기 1 1 Jul05,201909:51PM   ねぽ미투디스크┏┏tvnFREE.Com┓┓노트북으로tv보기 1 1 Jul04,201902:45AM   티비앤프리무료영화다시보기─◆┼­TvnFRee.cOM­─┼◆─tv무료보기 1 1 Jun18,201909:59PM   티비앤프리미국드라마≪♡≫TVNFREE.COM≪♡≫중국드라마다시보기 1 5 Jun17,201907:10PM   티비앤프리티비다시보기어플≥♥≤tvnfree,com≥♥≤미국드라마무비 1 1 Jun16,201903:55PM   넷플릭스≫♡≪tvnfree,com≫♡≪미드예능 1 1 Jun15,201904:54PM   예능다시보기≫♧≪tvnfree,com≫♧≪실시간TVtv보기 1 1 Jun15,201904:27PM   Moretopics... Share RecommendIt  |  Stats  |  RecentStatusUpdates Readersalsoenjoyed Seesimilarbooks… Goodreadsishiring! Ifyoulikebooksandlovetobuildcoolproducts,wemaybelookingforyou. Learnmore» Genres Science> Technology 166users ComputerScience> Programming 92users Science> ComputerScience 78users ComputerScience> Technical 68users ComputerScience> Software 58users Nonfiction 54users Science> Engineering 22users ComputerScience> Computers 20users Business 19users Reference 15users Seetopshelves… AboutBetsyBeyer BetsyBeyer 26 followers BetsyBeyerisaTechnicalWriterforGoogleinNewYorkCityspecializinginSiteReliabilityEngineering.ShehaspreviouslywrittendocumentationforGoogle’sDataCenterandHardwareOperationsTeamsinMountainViewandacrossitsgloballydistributeddatacenters.BeforemovingtoNewYork,BetsywasalecturerontechnicalwritingatStanfordUniversity.Enroutetohercurrentcareer,Betsy BetsyBeyerisaTechnicalWriterforGoogleinNewYorkCityspecializinginSiteReliabilityEngineering.ShehaspreviouslywrittendocumentationforGoogle’sDataCenterandHardwareOperationsTeamsinMountainViewandacrossitsgloballydistributeddatacenters.BeforemovingtoNewYork,BetsywasalecturerontechnicalwritingatStanfordUniversity.Enroutetohercurrentcareer,BetsystudiedInternationalRelationsandEnglishLiterature,andholdsdegreesfromStanfordandTulane. ...more BooksbyBetsyBeyer More… News&Interviews6GreatBooksHittingShelvesThisWeek Needanotherexcusetotreatyourselftoanewbookthisweek?We'vegotyoucoveredwiththebuzziestnewreleasesoftheday. Tocreateour...Readmore...43likes·5comments TriviaAboutSiteReliability... Notriviaorquizzesyet.Addsomenow» QuotesfromSiteReliability... “Whenateammustallocateadisproportionateamountoftimetoresolvingticketsatthecostofspendingtimeimprovingtheservice,scalabilityandreliabilitysuffer.” — 1likes “teamsizeshouldnotscaledirectlywithservicegrowth.” — 1likes Morequotes… Welcomeback.JustamomentwhilewesignyouintoyourGoodreadsaccount.



請為這篇文章評分?