awesome-sre/README.md at master - GitHub
文章推薦指數: 80 %
Ben Treynor Sloss, VP Google Engineering, founder of Google SRE ... Incidents + Outages at CircleCI: Our Playbook and What We've Learned · An introduction ... Skiptocontent {{message}} dastergon / awesome-sre Public Notifications Fork 1.2k Star 8.6k Code Issues 2 Pullrequests 2 Discussions Actions Security Insights More Code Issues Pullrequests Discussions Actions Security Insights Permalink master Branches Tags Couldnotloadbranches Nothingtoshow {{refName}} default Couldnotloadtags Nothingtoshow {{refName}} default awesome-sre/README.md Gotofile Gotofile T Gotoline L Copypath Copypermalink Thiscommitdoesnotbelongtoanybranchonthisrepository,andmaybelongtoaforkoutsideoftherepository. franciscoed AddingObservabilityengineeringbook Latestcommit e1a6130 Jun29,2022 History 78 contributors Userswhohavecontributedtothisfile +58 AwesomeSiteReliabilityEngineering WhatisSiteReliabilityEngineering? Contributing Contents Culture Education Books Hiring Reliability Monitoring&Observability&Alerting On-Call Post-Mortem CapacityPlanning ServiceLevelAgreement Performance Programming MiscArticles Real-timeMessaging Blogs Newsletters Conferences&Meetups Twitter SRETools Podcasts 551lines(524sloc) 59.7KB Raw Blame Editthisfile E OpeninGitHubDesktop OpenwithDesktop Viewraw Viewblame AwesomeSiteReliabilityEngineering AcuratedlistofawesomeSiteReliabilityandProductionEngineeringresources. WhatisSiteReliabilityEngineering? "Fundamentally,it'swhathappenswhenyouaskasoftwareengineertodesignanoperationsfunction."-BenTreynorSloss,VPGoogleEngineering,founderofGoogleSRE Contributing Pleasetakealookatthecontributionguidelinesfirst. Contributionsarealwayswelcome! Contents Culture Education Books Hiring Reliability Monitoring&Observability&Alerting On-Call Post-Mortem CapacityPlanning ServiceLevelAgreement Performance Programming MiscArticles Real-timeMessaging Blogs Newsletters Conferences&Meetups Twitter SRETools SREPodcasts Culture WhatisSiteReliabilityEngineering? KeysToSREbyBenTreynor GoogleSREResources NotesfromProductionEngineeringbyPedroCanahuati PostOps:RecoveryfromOperations LoveDevOps?Wait'tillyoumeetSRE[video] HowGoogleDoesPlanet-ScaleEngineeringforPlanet-ScaleInfra SiteReliabilityEngineeringatFacebook AHistoryofSiteReliabilityEngineeringatUber CaseStudy:AdoptingSREPrinciplesatStackOverflow SiteReliabilityEngineeringatDropbox SiteReliabilityEngineers—KeepingGoogleupandrunning24/7 SiteReliabilityEngineeringatSalesforce FromSysAdmintoNetflixSRE-videoandslides SRE@Google:ThousandsofDevOpsSince2004 TransactionalSystemAdministrationIsKillingUsandMustbeStopped AhierarchyofSREneeds PostOps:ANon-SurgicalTaleofSoftware,Fragility,andReliability SRE:AnincompleteguidetoculturalNarnia-[Video] PuttingTogetherGreatSRETeams WorkatGoogle:MeetourProductionEngineersforSiteReliabilityHangoutonAir Toil:AWordEveryEngineerShouldKnow EngineeringReliabilityintoWebSites:GoogleSRE DEVOPS&SREAMA-BuildingHighPerformanceOrganizations JohnAllspaw'sAMAonIncidentAnalysisandPostmortems SiteReliabilityEngineeringwithPaulNewson-Part1&Part2 HowSysAdminsDevalueThemselves TheSofterSideofDevOps SRE,noun.Seealso:confidence,trust. SiteReliabilityEngineeringwithStephenWeinberg WearetheGoogleSiteReliabilityteam.WemakeGoogle’swebsiteswork.AskusAnything! WearetheGoogleSiteReliabilityEngineeringteam.AskusAnything! TheOpsIdentityCrisis TheIrreproducibilityOfBugsInLarge-ScaleProductionSystems SE-RadioEpisode276:BjörnRabensteinonSiteReliabilityEngineering Microservices,DevOpsandProductionComplexity IntroducingGoogleCustomerReliabilityEngineering EvolutionorRebellion?TheriseofSiteReliabilityEngineers(SRE) ThedifferencebetweenSiteReliabilityEngineering,SystemAdministration,andDevOps SREintheSmallandintheLarge SBSREMeetup:DifferentSRErolesandchallenges(Netflix) Panel:Who/WhatIsSRE? HopeIsNotaStrategy TenetsofSRE SiteReliabilityEngineeringDemystified IsSiteReliabilityEngineeringtheTrue‘Ops’inDevOps? SREvs.DevOpsvs.CloudNative:TheServerCageMatch SRE:What’sTheBigIdea? BuildingtheSRECultureatLinkedIn Podcast#111–SRE:OccasionallyMaintainingInfrastructureThatYouHate SplicingSREDNASequencesintheBiggestSoftwareCompanyonthePlanet WhyshouldyourappgetSREsupport?-CRElifelessons HowSREsfindthelandminesinaservice-CRElifelessons MakingthemostofanSREservicetakeover-CRElifelessons TheCloudcast#301:SREandInfrastructureOperations(Podcast) TheSREmodel OnboardingNewSiteReliabilityEngineers BuildingBlocksforSiteReliabilityAtGoogle BeyondGoogleSRE:WhatisSiteReliabilityEngineeringlikeatMedium? IntelligentSiteReliabilityEngineering–AMachineLearningPerspective AcrashcourseinLinkedIn'sglobalsiteoperations Google’sSiteReliabilityEngineeringwithToddUnderwood WhatisSiteReliabilityEngineering?(VMware) AGentleIntroductiontoSRE UnderstandingSiteReliabilityEngineeringthroughMoviesandBooks GOTO2017•SiteReliabilityEngineeringatGoogle•ChristofLeng TheMakeupofSuccessfulGeographically-DistributedSRETeams-Part1&Part2 TechLeadershipinSRE TheAzurePodcast:Episode227-AzureSRE Thehumanscalabilityof"DevOps" Podcast:SiteReliabilityManagementwithMikeHiraga HowacatinspiredsystemreliabilityatKnowlarity GettingStartedwithSiteReliabilityEngineering "PracticalApplicationsoftheDickersonPyramid"byNatWelch LinkedIn’sKurtAndersenUncoversBlindspotsinSREImplementations InterviewwithBetsyBeyer,StephenThorneofGoogle LessRiskThroughGreaterHumanity-DaveRensin GettingStartedwithSRE-StephenThorne,Google BuildingSuccessfulSREinLargeEnterprises SolvingReliabilityFearswithSiteReliabilityEngineering SREvs.DevOps:competingstandardsorclosefriends? HowtoAvoidthe5SREImplementationTrapsthatCatchEventheBestTeams ReliabilityEngineering–TheEssentialDisciplineforComplexSystems TheModernSiteReliabilityWorkbenchonTopofOCI SREintheThirdAge AboutSREandhow(not)toapplyit TransitioningatypicalengineeringopsteamintoanSREpowerhouse MakingaLionBulletproof:SREinBanking IdentifyingandtrackingtoilusingSREprinciples FromOpstoSRE:EvolutionoftheOpenShiftDedicatedTeam MeetingreliabilitychallengeswithSREprinciples AquickintroductiontoSREprinciples TheSREIAspiretoBe TamingOperationalLoadwithVMwareCRE SRECulturalValues Arewethereyet?ThoughtsonassessinganSREteam’smaturity WhatSREshavetodowithproject-basedservices? Makingoperationalworkmorevisible SREvs.DevOps:What’stheDifferenceBetweenThem? Education Panel:EducatingSRE FromZerotoHero:RecommendedPracticesforTrainingyourEver-EvolvingSRETeams NewtoanSREteam? TheSystemsEngineeringSideofSiteReliabilityEngineering GraduatingfromBootcampandinterestedinbecomingaSiteReliabilityEngineer? SoyouwanttobeaSiteReliabilityEngineer? SpiralingOpsDebt&theSRECodingImperative SoyouwanttobeanSRE? CareerProfiles/SiteReliabilityEngineer WhatistheroleofaSiteReliabilityEngineer? Lynda.com:DevOpsFoundations:SiteReliabilityEngineering IncidentManagementTraining:WheelofMisfortune SiteUn-ReliabilityEngineering[VideoSeries] TheUltimateGuidetoStructuringa90-DayOnboardingPlan SREfundamentals:SLIs,SLAsandSLOs HowtoGetIntoSRE DoyouhaveanSREteamyet?Howtostartandassessyourjourney HowSREteamsareorganized,andhowtogetstarted WhySREDocumentsMatter Howtogetstartedwithsitereliabilityengineering(SRE) DutiesofaSiteReliabilityEngineeringManager DesigningdistributedsystemsusingNALSDflashcards TrainingSiteReliabilityEngineers:WhatYourOrganizationNeedstoCreateaLearningProgram SREClassroom:DistributedPubSubworkshop SchoolofSRE:Curriculumforonboardingnon-traditionalhiresandnewgrads Books PracticalLinuxInfrastructure SiteReliabilityEngineering:HowGoogleRunsProductionSystems TheSiteReliabilityWorkbook:PracticalWaystoImplementSRE ObservabilityEngineering:AchievingProductionExcellence ThePracticeOfCloudSystemAdministration:DesigningandOperatingLargeDistributedSystems WebOperations-KeepingtheDataOnTime TheChecklistManifesto:HowtoGetThingsRight MicroservicesinProduction-StandardPrinciplesandRequirements Production-ReadyMicroservices-BuildingStandardizedSystemsAcrossanEngineeringOrganization SystemsPerformance:EnterpriseandtheCloud[SamplechaptertitledCPUs MonitoringDistributedSystems:CaseStudiesfromGoogle'sSRETeams TheHumanSideofPostmortems:ManagingStressandCognitiveBiases ChaosEngineering:BuildingConfidenceinSystemBehaviorthroughExperiment Post-IncidentReviews:LearningfromFailureforImprovedIncidentResponses AntifragileSystemsandTeams HowtoMonitoringtheSREGoldenSignals(E-Book) IncidentManagementforOperations Real-WorldSRE SeekingSRE WhatisSRE? EngineeringReliableMobileApplications:StrategiesforDevelopingResilientNativeMobileApplications BuildingSecureandReliableSystems ChaosEngineering:Crashtestyourapplications 97ThingsEverySREShouldKnow FourStepstoCreatingEffectiveGameDayTests TheLinuxProgrammingInterface Hiring SREHiring HiringSREsatLinkedIn HiringSiteReliabilityEngineers HiringyourfirstSRE GrowingtheSiteReliabilityTeamatLinkedIn:HiringisHard EngineeringManager-SiteReliabilityEngineeringInterviewPreparation Reliability TheRealitiesoftheJobofDeliveringReliability FailatScalebyBenMaurer EmbracingFailure:Fault-InjectionandServiceReliability 10YearsofCrashingGoogle HowwebreakthingsatTwitter:failuretesting ReliableCronacrossthePlanet Pushourlimits-reliabilitytestingatTwitter TheVerificationofaDistributedSystembyCaitieMcCaffrey WeatheringtheUnexpected SREHour:TechTalksbyBox&Yelp Simplicity:APrerequisiteforReliability TheTwoSidestoGoogleInfrastructureforEveryoneElse HowEmbracingContinuousReleaseReducedChangeComplexity Making"PushOnGreen"aReality BeyondCorp:ANewApproachtoEnterpriseSecurity BrainstormingFailurebyJeffSmith TheRippleEffectOfOutagesAndDowntimeCannotBeUnderestimated TheinfrastructurebehindTwitter:efficiencyandoptimization Dickerson'sHierarchyofReliability TheMorningPaperonOperability Productionisallthatmatters Usingloadsheddingtosurviveasuccessdisaster-CRElifelessons Howtoavoidaself-inflictedDDoSAttack-CRElifelessons Don'tgamblewhenitcomestoreliability ResilienceEngineering:LearningtoEmbraceFailure TheInfrastructureBehindTwitter:Scale ScalingReliabilityatTwitter:SoYouWanttoAdda9 PrinciplesOfChaosEngineering ChaosEngineering Available...ornot?Thatisthequestion-CRElifelessons HowGoogleBacksUpTheInternetAlongWithExabytesOfOtherData Performance,Scalability,AndHighAvailability:3KeyInfrastructureAdaptabilityRequirements TheProductionEnvironmentatGoogle-Part1&Part2 Reliablereleasesandrollbacks-CRElifelessons Howreleasecanariescansaveyourbacon-CRElifelessons ThingsILearnedManagingSiteReliabilityforSomeoftheWorld’sBusiestGamblingSites EveryDayIsMondayinOperations UndertheHood:EnsuringSiteReliability Designingreliablesystemswithcloudinfrastructure(GoogleCloudNext'17) AGoogleSREexploresGitHubreliabilitywithBigQuery Knowthyenemy:howtoprioritizeandcommunicaterisks-CRElifelessons ChaosEngineeringresources CRElifelessons:Whatisadarklaunch,andwhatdoesitdoforme? Whyyoushouldpickstrongconsistency,wheneverpossible TheNetworkisReliable AreYouLoadBalancingWrong? HowproductionengineerssupportglobaleventsonFacebook Google:ACollectionOfBestPracticesForProductionServices CanaryAnalysisService TipsforHighAvailability ProgressiveServiceArchitectureAtAuth0 GoogleCloudProductionGuideline productionreadiness TrustByDesign:TheFusionofOperationalMaturityandRiskModeling TopSevenMythsofRobustSystems Tamingchaos:Preparingforyournextincident PIDLoopsandtheArtofKeepingSystemsStable Areyoureadyforproduction?-Slides ProductionChecklistforWebAppsonKubernetes FindingaproblematthebottomoftheGooglestack RethinkingTaskSizeinSRE Howmaintenancewindowsaffectyourerrorbudget TheProductionReadinessSpectrum Genericmitigations Howwe’rebuildingaproductionreadinessreviewprocessatGrafanaLabs ResiliencyPlanningforHigh-TrafficEvents Monitoring&Observability&Alerting AWorkingTheory-of-Monitoring TheEvolutionofMonitoringSystemsatGoogle-TonyRippy MonitoringwithoutInfrastructure@Airbnb Monitoringdistributedsystems ObservabilityatUberEngineering:Past,Present,Future The4GoldenSignalsofAPIHealthandPerformanceinCloud-NativeApplications MyPhilosophyonAlertingbyRobEwaschuk TimeToDetect-Netflix WhyPercentilesDon’tWorktheWayyouThink BuildingTwitter’sNext-GenAlertingSystem Instrumentation:Worstcaseperformancematters Instrumentation:Whatdoes'uptime'mean? Incidents+OutagesatCircleCI:OurPlaybookandWhatWe’veLearned Anintroductiontomonitoringandalertingwithtimeseriesatscale,withPrometheus DetectingoutliersandanomaliesinrealtimeatDatadog HowtoMonitortheSREGoldenSignals MonitoringinaDevOpsWorld MonitoringYourMonitoring’sMonitoring Observability:thenewwaveorbuzzword? MonitoringIsn'tObservability MonitoringinthetimeofCloudNative PrinciplesofMonitoringMicroservices TheManyWaysYourMonitoringIsLyingtoYou GitOpsPart3-Observability WanttoDebugLatency? DebuggingLatencyinGo1.11 AlertingonSLOslikePros AppliedAlertingPhilosophy ObservationsonObservability Deploys:It'sNotActuallyAboutFridays SiteReliabilityEngineeringBestPracticesforDataPipelines ElasticObservabilityinSREandIncidentResponse On-Call BeinganOn-CallEngineer:AGoogleSREPerspective InsideAtlassian:howoursitereliabilityengineersdoincidentmanagement InsideAtlassian:howIT&SREuseChatOpstorunincidentmanagement IncidentResponseatHeroku Who'sOnCall? SysAdvent-Day6-NoMoreOn-CallMartyrs OnBeingOnCall TheOn-CallHandbook IncidentmanagementatGoogle—adventuresinSRE-land RunBook/OperationsManualtemplate AutomatingYourOncall:OpenSourcingFossorandAsciiEtch ProjectSTAR*:StreamliningOurOn-CallProcess SRE@Xero:ManagingIncidentsPartI SRE@Xero:ManagingIncidentsPartII HowToEstablishaHighSeverityIncidentManagementProgram HowYourSystemsKeepRunningDayAfterDay-JohnAllspaw On-calldoesn’thavetosuck Why,asaNetflixinfrastructuremanager,amIoncall? OncallandSustainableSoftwareDevelopment OnCallRotations:HowBesttoWakeDevsUpintheMiddleoftheNight UnderstandingTheRoleOfTheIncidentManagerOn-Call(IMOC) 3WaystoMinimizetheImpactofHighSeverityIncidents AdvicetoManagementTeamsWhileEnrollingChangestoOn-CallSystems MovingPastShallowIncidentData SustainableOn-Call dotScale2017-AishRajDahal-Chaosmanagementduringamajorincident IncidentManagementatNetflixVelocity Incidents,fixes,andthedayafter 10StepstoDevelopanIncidentResponsePlanYou’llACTUALLYUse Checklists:astupidlysimplebutvaluableoperationalgift Howtowriteastatuspageupdate AtlassianIncidentHandbook PagerDutyIncidentResponseHandbook AvoidingBurnoutforSREs BetterOn-CalltheSREway ManagingIncidentsatMonzo MakingOn-CallNotSuck Howwe(Monzo)respondtoincidents Howwe’veevolvedon-callatMonzo CodeYellow:WhenOperationsIsn’tPerfect MTTRisdead,longliveCIRT ExtendedDreyfusModelforIncidentLifecycles InhumanityofRootCauseAnalysis IncidentinsightsfromNASA,NTSB,andtheCDC HowtoavoidOn-CallBurnouttheSREWay MyweekshadowingaGitLabSiteReliabilityEngineer Howourproductionteamrunstheweeklyon-callhandover WritingRunbookDocumentationWhenYou’reAnSRE Incidentresponse,programsandyou(rstartup) AnIncidentCommandTrainingHandbook Shrinkingthetimetomitigateproductionincidents Incidentwriteupassociologicalstorytelling ElephantintheBlamelessWarRoom:Accountability Namingnamesinincidentwriteups BuildingOn-CallCultureatGitHub Post-Mortem Acollectionofpost-mortems CollectionofKubernetesFailureStories BlamelessPostMortemsandaJustCulture ATaleofPostmortems BuildingaBlamelessPost-MortemCulturewithJasonHand Theinfinitehows FailureisAlwaysAnOption:HowaBlamelessCultureLeadstoBetterResults SysAdvent-Day1-WhyYouNeedaPostmortemProcess Etsy’sDebriefingFacilitationGuideforBlamelessPostmortems WritingYourFirstPostmortem HowtoWriteGreatOutagePost-Mortems Acollectionofpostmortemtemplates EmbracingFeedback PostmortemActionItems:PlantheWorkandWorkthePlan SocialIssuesInPostmortems GoogleHasanOfficialProcessinPlaceforLearningFromFailure--andIt'sAbsolutelyBrilliant Postmortemculture:howyoucanlearnfromfailure re:Work-Postmortemdiscussiontemplate Post-mortemstotherescue PostmortemActionItems:PlantheWorkandWorkthePlan WhyEveryCompanyCanBenefitfromaBlamelessCulture "It'sdead,Jim":Howwewriteanincidentpostmortem Ourincidentpostmortemtemplate Learnoutofmistakes.Postmortemstotherescue. ImprovingPostmortemPracticeswithVeteranGoogleSRE,SteveMcGhee InhumanityofRootCauseAnalysis CapacityPlanning CapacityPlanning SouthBaySRE:CloudCapacityPlanning Intent-basedCapacityPlanningandAutoscalingwithKubernetes HowdoyoudoCapacityPlanning HowBackMarketSREspreparedforBlackFriday ServiceLevelAgreement IfIt'sintheCloud,GetItonPaper:CloudComputingContractIssues ServiceLevelAgreementsintheCloud:Whocares? SysAdvent-Day20-HowtosetandmonitorSLAs SLOs,SLIs,SLAs,ohmy-CRElifelessons ServiceLevelsandErrorBudgets (Un)ReliabilityBudgets-FindingBalancebetweenInnovationandReliability TheCalculusofServiceAvailability AvailabilityCalculator:CalculatehowmuchdowntimeshouldbepermittedinyourSLA StandardizecloudSLAavailabilitywithnumericalperformancedata BestpracticestodevelopSLAsforcloudcomputing APracticalGuidetoSLAs BuildinggoodSLOs-CRElifelessons NoGrumpyHumansandOtherSiteReliabilityEngineeringLessonsfromGoogle ConsequencesofSLOviolations—CRElifelessons ServiceLevelObjectivesinPractice SREConsensusBuilding Anexampleescalationpolicy—CRElifelessons ErrorBudgetCalculator Understandingerrorbudgetoverspend-partone-CRElifelessons Goodhousekeepingforerrorbudgets-parttwo-CRElifelessons SREfundamentals:SLIs,SLAsandSLOs SLOs&You:AGuideToServiceLevelObjectives EarningOurWings:StoriesandFindingsFromOperatingaLarge-scaleConcourseDeployment NinesareNotEnough:MeaningfulMetricsforClouds Howmanyninesismystoragesystem? Don'tfollowthesun. TheTyrannyoftheSLA BackblazeDurabilityis99.999999999%—AndWhyItDoesn’tMatter DevOpsDaysChicago2019-TheArtofSLOs TheArtofSLOsWorkshopMaterials HowtoIncludeLatencyinSLO-BasedAlerting SucceedingWithServiceLevelObjectives PuttingcustomersfirstwithSLIsandSLOs SRELeadership:HaveTieredSLAs HowSLOsEnableFast,ReliableApplicationDelivery TheTailatScale TheTailatScaleRevisited DefiningSLOsforserviceswithdependencies ServiceLevelDisagreements HowWeUseSlothtodoSLOMonitoringandAlertingwithPrometheus SLIDeepDive MeasuringReliabilityinGCP:StepByStepSLOcreationguideusingCloudOperationSandbox SLOtracker SLOAlertingforMortals SREmethodsandclimatechange WhatmadeSLOssomessy(andwhatwecandoaboutit) SLICK:AdoptingSLOsforimprovedreliability CalculatingcompositeSLA BestpracticesforsettingSLOsandSLIsformodern,complexsystems Performance PerformanceChecklistsforSREs SouthBaySREMeetup-NetflixCloudPerformanceTeam SoftwarePerformanceAnalysisGuidedBySLOs Aframeworkforpragmaticperformanceengineering Programming GoLanguageforOpsandSiteReliabilityEngineering GoforSREsusingPython OperabilityinGo GoReliabilityandDurabilityatDropbox MiscArticles WhatisSRE(SiteReliabilityEngineering)? Here’sHowGoogleMakesSureIt(Almost)NeverGoesDown Aresitereliabilityengineersthenextdatascientists? SiteReliabilityEngineers:"solvingthemostinterestingproblems" SiteReliabilityEngineers:the"world’smostintensepitcrew" SitereliabilityengineeringkicksrotetasksoutofITops NotesonSiteReliabilityEngineering AdventuresinSRE-land:WelcometoGoogleMissionControl BookReview:SiteReliabilityEngineering-HowGoogleRunsProductionSystems SiteReliabilityEngineers:“Wesolvecoolerproblems” SREcon17:Bravenewworldofsitereliabilityengineering OpenAWSguide 20SRE/Devops/SystemEngineerTricks CommentaryonSiteReliabilityEngineering SiteReliabilityEngineering:4ThingstoKnow LookingforSRESuccess?ThenFindtheIntrapreneurs! WhatTeamStructureisRightforDevOpstoFlourish? InjuredonVacation?ApplyingPrinciplesfromSiteReliabilityEngineeringtoaTravelEmergency Buildingblamelessworkingenvironment SREAdoptionReport SREs:TheHappiest–andHighestPaid–intheIndustry TheRoleofSiteReliabilityEngineering,TodayandTomorrow SREasaLifestyleChoice SREConEMEA2019Recap LifeofanSREatGoogle-JCvanWinkel SiteReliabilityEngineeringforNativeMobileApps-AbhijithKrishnappa-Casestudy:HalodocadaptationofSREprinciplesforNativeMobileApps SREBestPracticesbyInfraCloud Real-timeMessaging #srechannelatHangopsSlack-DiscussionofSiteReliabilityEngineeringgenerally. #incident_responsechannelatHangopsSlack-DiscussionaboutIncidentResponse. USENIXSREconSlack Blogs BrendanGregg'sBlog-HighlyTechnicalBlogPostsAboutSystemsInternals,PerformanceandSRE. EverythingSysadmin-BlogPostsAboutSysAdmin/DevOps/SREbyTomLimoncelli. HighScalability-TechnicalBlogPostsAboutSystemsArchitecture. rachelbythebay-TechincalBlogPosts. SusanJ.Fowler-VariousblogpostsaboutSRE,SoftwareEngineeringandMicroservices. SysAdvent-OnearticleforeachdayofDecember,endingonthe25tharticle. StephenThorne'sBlog-BlogPostsAboutSRE Increment-Adigitalmagazineabouthowteamsbuildandoperatesoftwaresystemsatscale. GopherSRE-BlogPostsaboutGoandSRE. CindySridharan-Blogpostsaboutdistributedsystemsandtheirmanagement. BlamelessBlog-BlogpostsaboutSREcultureandpractices. ResilienceRoundup-WeeklyanalysisofResilienceEngineeringandHumanFactorsresearchdesignedforsoftwaresystems SquadcastBlog-BlogpostsaboutSREbestpractices,reliability,on-callandincidentmanagement. FireHydrantBlog-Postsaboutcomplexsystems,incidentresponse,andSREbestpractices. RootlyBlog-Incidentmanagementbestpracticesandguides. incident.ioBlog-Guides,adviceandresourcesonincidentmanagementandresponse. Logit.ioBlog-Resourcesonlogmanagement,SREanddevOps. Newsletters DevOpsLinks-AweeklynewsletteraboutSRE,SysAdminandDevOpsnews,tools,tutorialsandopinions. KubeWeekly-TheweeklynewslettersforallthingsKubernetes.KubeWeeklyiscuratedbyBobKillen,ChrisShort,CraigBox,KimMcMahonandMichaelHausenblas SREWeekly-WeeklySiteReliabilityNewsletter. O’ReillySystemsEngineeringandOperationsNewsletter-Weeklysystemsengineeringandoperationsnewsandinsightsfromindustryinsiders. ChaosEngineering.news-ChaosEngineeringnewsletter.AllthingsChaosEngineering,directlytoyourinbox! Conferences&Meetups SREConConferences-TheOfficialSREConference. LISAConferences-ProminentConferenceAboutSysAdmin/DevOps/SRE. SRETechTalks-SRETalksHostedbyGoogle. SouthBaySiteReliabilityEngineering(Sunnyvale,CA)Meetup-AGroupForIndividualsWhoTackleReliabilityChallengesForWeb-ScaleSystems. SanFranciscoReliabilityEngineering-AGroupOfPeopleWhoArePassionateAboutReliable,PerformantSoftwareSystems. SiteReliabilityEngineeringMunich,Germany-SREMeetupinthegreaterareaofOktoberfestcity. ADDO-AllDayDevOps-A24hourconferencethatiscompletelyonlineandfree. SiteReliabilityEngineeringParis,France-SREMeetupinthecityoflight. SiteReliabilityEngineeringIndia-SREMeetupIndia Twitter GoogleSRETwitterAccount-Google'sSRETwitterAccount. SREBook-TheOfficialTwitterAccountofSiteReliabilityEngineeringBook. SREcon-SRECon'sOfficialTwitterAccount. SREWorkbook-TheOfficialTwitterAccountofSiteReliabilityWorkbook. TheSREDev-SRE-relatedPostsfromdev.to. TwitterSRE-TheOfficialTwitterAccountofTwitter'sSREteam. TwitterSREWeekly-TheOfficialTwitterAccountofSREWeeklyNewsletter. USENIXAssociation-TheOfficialUSENIXTwitterAccount. SRETools AwesomeSRETools-AcuratedlistofSiteReliabilityandProductionEngineeringtools ListofContinuousIntegrationservices SREcheatsheet-AcheatsheetforSiteReliabilityEngineeringprinciplesandnumbers SRECapabilityMap-OverviewofallthingsSRE Podcasts Blameless/ResilienceinAction GoogleSREProdcast o11yObservabilityPodcast OnCallNightmares(retired) MakingoftheSREOmelette Go Youcan’tperformthatactionatthistime. Yousignedinwithanothertaborwindow.Reloadtorefreshyoursession. Yousignedoutinanothertaborwindow.Reloadtorefreshyoursession.
延伸文章資訊
- 1Chapter 8 - On-Call - Google - Site Reliability Engineering
In SRE, whenever an alert is created, a corresponding playbook entry is usually created. These gu...
- 2Do you have an SRE team yet? How to start and assess your ...
SRE is an essential part of engineering at Google. ... An operational playbook/runbook should exi...
- 3Writing Runbook Documentation When You're An SRE
As The Site Reliability Workbook says, playbooks “reduce stress, ... as the Site Reliability Engi...
- 4Google's Site Reliability Engineering Playbook - Karma Advisory
Google's Site Reliability Engineering Playbook. by Krishan Patel | Apr 6, ... Read on landing.goo...
- 5Google - Site Reliability Engineering
Thus, Google SRE relies on on-call playbooks, in addition to exercises such as the "Wheel of Misf...