Writing Runbook Documentation When You're An SRE
文章推薦指數: 80 %
As The Site Reliability Workbook says, playbooks “reduce stress, ... as the Site Reliability Engineering: How Google Runs Production Systems ...
ProductSolutionsOptimizeServiceRequestManagementModernizeIncidentManagementScaleInfrastructureasCodePracticesResourcesResourcesLibraryDocsNewsBlogAllPostsCategoriesCompanyDevOpsIncidentManagementITSMSRECompanyAboutCareersContactTrustRequestDemoLoginBlog>SRE>WritingRunbookDocumentationWhenYou’reAnSREWritingRunbookDocumentationWhenYou’reAnSRETipsandtricksforwritingeffectiverunbookdocumentationwhenyouaren’tatechnicalwriterTaylorBarnett·Jan30th,2020Thesadrealityis,nooneactuallywantstoreadyourrunbookdocumentation.Engineerswhogetpagedwhileon-callwanttomitigateandresolveanincidentasfastaspossible,andmoveon.Nonetheless,runbooks,sometimescalledplaybooks,arenecessary.AsTheSiteReliabilityWorkbooksays,playbooks“reducestress,themeantimetorepair(MTTR),andtheriskofhumanerror.”OftenIhavefoundthatengineersdon’twanttowritedocumentationfortwomainreasons:Thereisn’tanincentivestructurefordoingthework,andtheyareunsureofhowtowritegooddocumentation.Focusingonthelatter,whilenorunbookwillbe“asubstituteforsmartengineersabletothinkonthefly,astheSiteReliabilityEngineering:HowGoogleRunsProductionSystemsbooksays,“clearandthoroughtroubleshootingstepsandtipsarevaluablewhenrespondingtoahigh-stakesortime-sensitivepage.”Unlikethetonsofcontentforengineersonhowtowritegoodcode,there’salotlessfordocumentation,especiallyrunbooks.WhetheryourteamispracticingDevOpsortraditionalIToperations,thisblogpostisfocusedonhelpingSiteReliabilityEngineers(SREs)andotherengineerswhoareinvolvedinon-callengineeringcreateclearerandmoreeffectiverunbookdocumentation.RunbookTemplates#Blankpagesarenofun.Usingatemplatecanbebeneficialbecausestartingfromablankdocumentisincrediblyhard.Atemplategivesyouanoutlinetostartoutwith.It’sguidanceonhowtogetstarted,whichisthehardestpartwhenwriting.Whattemplateyouuseforyourrunbooksheavilydependsonyourteam.It’sessentialtogetbuy-infromtheteam;otherwise,theteamwon’tuseit.Ifthere’sasectionthatamajorityoftheteamdoesn’tfinduseful,theyareunlikelytofillitout.I’mnotgoingtorecommendonetemplatetorulethemall,butIcanrecommendsometemplatesthatwillhopefullyinspireyou.IchoosethesebecauseIfeeltheyhavetherightbalanceofinformation.It’seasyforthetemplatetogrowtobeverylargeanddauntingforanyengineertofillout.Checkouttheseexamples:"WhySREDocumentsMatter"byShylajaNukalaandVivekRau,ACMQueueNovember/December2019issue(scrolltothebottomforthetemplates)RunbookTemplatefrom"TacklingAlertFatigue"byCaitieMcCaffrey,Monitorama2016"BuildingaBetterOpsRunbook"byShawnStaffordTheonethingIdorecommendisthatalertnames’maptotherunbookname.Thiscanbeveryhelpfulformakingyourrunbooksdiscoverable.Itcanalsohelpyouevaluatetherunbookcoverageyouhaveforyouron-callteam.TheCurseofKnowledge#TheCurseofKnowledgeisacognitivebiasthatoccurswhensomeoneiscommunicatingwithothersandunknowinglyassumesthelevelofknowledgeofthepeopletheyarecommunicatingwith.Asweprogressinourfields,wegainmoreexperience,andasthishappens,itbecomeshardertorecreateastateofmindwithoutthisnewknowledge.Itisasignificantbarriertoexpressingempathyindocumentation.TheramificationsoftheCurseofKnowledgecanbeprettyharmful.Forexample,itcancauseustoleaveoutwholestepsinstep-by-stepinstructions,likeneedingtoinstallaparticularpieceofsoftwareorscript.Itcanalsoleadtooversimplifyingthingsandusingwordssuchas“simply,”“easy,”“just,”andotherwordsthatcanvarybasedonexperiencelevel.So,what’sthesolution?Removethosewordsfromyourdocumentation.Atbest,theydon’thelpanyone.Atworst,theyaredemeaningwhenyouarestrugglinginanincident.Othersolutionsincludemakingsurepeopleatalllevelswhomightbeusingtherunbookshaveachancetoreviewandcatchanythingthatmighthavebeenmissed.Todothiseffectively,though,youneedtohaveacollaborativeenvironmentwheresomeonefeelscomfortablespeakinguponsomethingthatfeelsleftoutorisconfusing.Thebestpartofdoingthisworkisthatyouareworkingtowardsmoreempatheticdocumentation.SREDocumentationGlossaries#Glossariescanbehelpfulforafewreasons:Glossarieshelpyourepeatyourselfless.Whenyoucanrefertoadefinitionwithalinkedexplanation,youjustsavedyourselftimeandwords.Glossariesmakedescriptionsmoreconsistent.Ifsomethingisexplainedinfivedifferentways,itcangetconfusing.Glossariesallowarunbooktobemoreeasilyusedbyengineerswithdifferentlevelsofexperience.Byreferencingaglossaryinyourrunbook,youallowsomeonenewertotheon-callrotationtogettheexplanationofconceptsortermstheyneed.Formoreexperiencedon-callengineers,youremoveextraneousinformationfromtherunbook.Also,makesuretoadduniqueacronymstotheglossarytoo.Someteamsandorganizationsuseuniqueacronymsthatmightnotbewidelyknown.Aglossaryisagreatplacetoexplainthem.PreventRunbookSearchFailure#Usersquicklyglanceoverdocumentationtotrytofindwhattheyarelookingfor.Commonly,theymisscriticalinformationtheyarelookingforbecauseofthestructureorformatofthedocumentation,causingwhatIcall“searchfailure.”TheNielsenNormanGrouphasbeenresearchinghowpeoplereadonthewebthrougheye-trackingstudiesforyears.TheyfoundthatoftenpeoplereadinF-shapepatterns.Thetwoimplicationstheypointedoutfromthispatternisthatthe“firstlinesoftextonapagereceivemoregazesthansubsequentlinesoftextonthesamepage”andthe“firstfewwordsontheleftofeachlineoftextreceivemorefixationsthansubsequentwordsonthesameline.”So,whatdoesthismeanforyourrunbookdocumentation?Yourrunbooktemplatesmustincludeasectionatthetoptodescribeinonesentencetheintentoftherunbook.Thiscanhelpanengineerquicklyconfirmiftheyarelookingattherightinformation.Also,youshouldonlyhaveonestep,command,orinstructionperparagraphorlistitem.Itwillmakeiteasierforreadersnottomissastep.Alongwiththis,shortersentencesreducethechancesofsearchfailure.Long,drawn-outparagraphsandsentencesoftengetglancedover,somakesuretobreakupdifferentinformationintonewsentences,paragraphs,andlistitems.ReadableRunbookSteps#Oftenparagraphsinarunbookcanbecomemorereadableiftheyareturnedintoabulletedlist.Ifordermatters,makesuretonumberthelistitemsturningitintoanumberedlistofsteps(E.g.,1,2,3).Itmakesiteasiertofollowandtoreference.Itcanpreventreadersfromnotskippingstepsduringincidentswhentreatedasachecklist.Evenabasicthree-itemlistinasentencecanbeimproved.Forexample,quicklyreadthesentencebelowfromanotherTranspositblogpost:ThisblogpostisthesecondinaseriesofafewpostswhereI’llcoverhowTranspositusestheOpenAPISpecification,AWSAPIs,Boto,andwhywehadtosupportthemdifferentlywithOpenAPI,andhowwecreatedOpenAPIextensionsandwhatwelearnedfromthisprocess.Andnowquicklyreadthebulletedlistbelow:ThisblogpostisthesecondinaseriesofafewpostswhereI’llcover:HowTranspositusestheOpenAPISpecificationAWSAPIs,Boto,andwhywehadtosupportthemdifferentlywithOpenAPIHowwecreatedOpenAPIextensionsandwhatwelearnedfromthisprocessWhichdidyougetmoreinformationoutof?(Mostlikely,thelatter.)Anytimethereisalistinasentence,turnitintoabulletedlist.Itwillhelpsearchfailureandhelpreadersofyourlistabsorbtheinformation.Lastly,startsentenceswithanimperativeverb,alsoknownasacommand,inyourlists.Forexample,wordslike“download,”“configure,”“restart,”and“open.”Thishelpsreaderssincetheireyeswilllikelyonlyscanthefirstfewwordsontheleftofeachlineoftext.CodeinRunbooks#Ifyouhaveevercopiedandpastedanythingfromsomedocumentationtothecommandline,you’veprobablyencounteredsome“commandnotfound”problem.Whetheritisdocumentationincludingthe$oralibrarythatshouldhavebeeninstalledfirst,itisvitaltogivetheusercontext.Forexample,insteadofincludingthe$torepresentusingacommandonyourcommandline,instructtheuserwheretousethecommandinstead:“Inyourterminal…”Or,iftherearesomeinstallationprerequisites,describethemintherunbookoraddalinktothem.Lastly,ifascriptislongerthanasingleline,treatitlikecode,andcheckitintoarepositorytobesourcecontrolandpotentiallytested.Thiswillensurethequalityismaintainedandthatincorrectorevendangerousscriptsdon’tgetusedduringtheresponsetoanincident.WritingRunbooksDocumentationisHard#Hopefully,thesetipsandtrickswillhelpyouwhenyouarewritinganeworupdatinganexistingrunbook.Likelearninganynewskill,itcanbehardandtakespractice.Havingon-callteammatesofallskilllevelshelpyoureviewyourrunbookscanbeveryhelpfultoprogressyourskills.Nowgoimprovesomerunbooks!P.S.Checkouttheblogpostthatmyteammate,Dan,wroteaboutwhatmakesagoodrunbook.Learnmorefromourglossary#ITservicemanagement(ITSM)ITchangemanagementProblemmanagementIncidentmanagementInfrastructureascodeGetinsightsfromTranspositinyourinboxmonthly.Subscribe
延伸文章資訊
- 1awesome-sre/README.md at master - GitHub
Ben Treynor Sloss, VP Google Engineering, founder of Google SRE ... Incidents + Outages at Circle...
- 2Do you have an SRE team yet? How to start and assess your ...
SRE is an essential part of engineering at Google. ... An operational playbook/runbook should exi...
- 3Security Automation Lessons from Site Reliability Engineering ...
Examples span the range of building playbooks for response ... In fact, our SRE peers remind us t...
- 4The Essential Guide to SRE - Blameless
SRE is a practice first coined by Google in 2003 that seeks to create systems and ... To create y...
- 5Google SRE book - Dan Luu
Nat Welch (a former Google SRE) responded to this by saying that you can build confidence through...