Site Reliability Engineering [Book] - O'Reilly

文章推薦指數: 80 %
投票人數:10人

ll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficientâ??lessons directly applicable to your ... Skiptomaincontent GetfullaccesstoSiteReliabilityEngineeringand60K+othertitles,withfree10-daytrialofO'Reilly. There'salsoliveonlineevents,interactivecontent,certificationprepmaterials,andmore. Startyourfreetrial SiteReliabilityEngineering byBetsyBeyer,ChrisJones,NiallRichardMurphy,JenniferPetoff ReleasedApril2016 Publisher(s):O'ReillyMedia,Inc. ISBN:9781491929124 ReaditnowontheO’Reillylearningplatformwitha10-dayfreetrial. O’Reillymembersgetunlimitedaccesstoliveonlinetrainingexperiences,plusbooks,videos,anddigitalcontentfromO’Reillyandnearly200trustedpublishingpartners. BuyonAmazon Buyonebooks.com Startyourfreetrial Bookdescription Theoverwhelmingmajorityofasoftwaresystemâ??slifespanisspentinuse,notindesignorimplementation.So,whydoesconventionalwisdominsistthatsoftwareengineersfocusprimarilyonthedesignanddevelopmentoflarge-scalecomputingsystems?Inthiscollectionofessaysandarticles,keymembersofGoogleâ??sSiteReliabilityTeamexplainhowandwhytheircommitmenttotheentirelifecyclehasenabledthecompanytosuccessfullybuild,deploy,monitor,andmaintainsomeofthelargestsoftwaresystemsintheworld.Youâ??lllearntheprinciplesandpracticesthatenableGoogleengineerstomakesystemsmorescalable,reliable,andefficientâ??lessonsdirectlyapplicabletoyourorganization.Thisbookisdividedintofoursections:Introductionâ??LearnwhatsitereliabilityengineeringisandwhyitdiffersfromconventionalITindustrypracticesPrinciplesâ??Examinethepatterns,behaviors,andareasofconcernthatinfluencetheworkofasitereliabilityengineer(SRE)Practicesâ??UnderstandthetheoryandpracticeofanSREâ??sday-to-daywork:buildingandoperatinglargedistributedcomputingsystemsManagementâ??ExploreGoogle'sbestpracticesfortraining,communication,andmeetingsthatyourorganizationcanuse Showandhidemore Publisherresources View/SubmitErrata TableofcontentsProductinformation Tableofcontents Foreword Preface ConventionsUsedinThisBook UsingCodeExamples O’ReillySafari HowtoContactUs Acknowledgments I.Introduction 1.Introduction TheSysadminApproachtoServiceManagement Google’sApproachtoServiceManagement:SiteReliabilityEngineering TenetsofSRE EnsuringaDurableFocusonEngineering PursuingMaximumChangeVelocityWithoutViolatingaService’sSLO Monitoring EmergencyResponse ChangeManagement DemandForecastingandCapacityPlanning Provisioning EfficiencyandPerformance TheEndoftheBeginning 2.TheProductionEnvironmentatGoogle,fromtheViewpointofanSRE Hardware SystemSoftwareThat“Organizes”theHardware ManagingMachines Storage Networking OtherSystemSoftware LockService MonitoringandAlerting OurSoftwareInfrastructure OurDevelopmentEnvironment Shakespeare:ASampleService LifeofaRequest JobandDataOrganization II.Principles 3.EmbracingRisk ManagingRisk MeasuringServiceRisk RiskToleranceofServices IdentifyingtheRiskToleranceofConsumerServices IdentifyingtheRiskToleranceofInfrastructureServices MotivationforErrorBudgets FormingYourErrorBudget Benefits 4.ServiceLevelObjectives ServiceLevelTerminology Indicators Objectives Agreements IndicatorsinPractice WhatDoYouandYourUsersCareAbout? CollectingIndicators Aggregation StandardizeIndicators ObjectivesinPractice DefiningObjectives ChoosingTargets ControlMeasures SLOsSetExpectations AgreementsinPractice 5.EliminatingToil ToilDefined WhyLessToilIsBetter WhatQualifiesasEngineering? IsToilAlwaysBad? Conclusion 6.MonitoringDistributedSystems Definitions WhyMonitor? SettingReasonableExpectationsforMonitoring SymptomsVersusCauses Black-BoxVersusWhite-Box TheFourGoldenSignals WorryingAboutYourTail(or,InstrumentationandPerformance) ChoosinganAppropriateResolutionforMeasurements AsSimpleasPossible,NoSimpler TyingThesePrinciplesTogether MonitoringfortheLongTerm BigtableSRE:ATaleofOver-Alerting Gmail:Predictable,ScriptableResponsesfromHumans TheLongRun Conclusion 7.TheEvolutionofAutomationatGoogle TheValueofAutomation Consistency APlatform FasterRepairs FasterAction TimeSaving TheValueforGoogleSRE TheUseCasesforAutomation GoogleSRE’sUseCasesforAutomation AHierarchyofAutomationClasses AutomateYourselfOutofaJob:AutomateALLtheThings! SoothingthePain:ApplyingAutomationtoClusterTurnups DetectingInconsistencieswithProdtest ResolvingInconsistenciesIdempotently TheInclinationtoSpecialize Service-OrientedCluster-Turnup Borg:BirthoftheWarehouse-ScaleComputer ReliabilityIstheFundamentalFeature Recommendations 8.ReleaseEngineering TheRoleofaReleaseEngineer Philosophy Self-ServiceModel HighVelocity HermeticBuilds EnforcementofPoliciesandProcedures ContinuousBuildandDeployment Building Branching Testing Packaging Rapid Deployment ConfigurationManagement Conclusions It’sNotJustforGooglers StartReleaseEngineeringattheBeginning 9.Simplicity SystemStabilityVersusAgility TheVirtueofBoring IWon’tGiveUpMyCode! The“NegativeLinesofCode”Metric MinimalAPIs Modularity ReleaseSimplicity ASimpleConclusion III.Practices 10.PracticalAlertingfromTime-SeriesData TheRiseofBorgmon InstrumentationofApplications CollectionofExportedData StorageintheTime-SeriesArena LabelsandVectors RuleEvaluation Alerting ShardingtheMonitoringTopology Black-BoxMonitoring MaintainingtheConfiguration TenYearsOn… 11.BeingOn-Call Introduction LifeofanOn-CallEngineer BalancedOn-Call BalanceinQuantity BalanceinQuality Compensation FeelingSafe AvoidingInappropriateOperationalLoad OperationalOverload ATreacherousEnemy:OperationalUnderload Conclusions 12.EffectiveTroubleshooting Theory InPractice ProblemReport Triage Examine Diagnose TestandTreat NegativeResultsAreMagic Cure CaseStudy MakingTroubleshootingEasier Conclusion 13.EmergencyResponse WhattoDoWhenSystemsBreak Test-InducedEmergency Details Response Findings Change-InducedEmergency Details Response Findings Process-InducedEmergency Details Response Findings AllProblemsHaveSolutions LearnfromthePast.Don’tRepeatIt. KeepaHistoryofOutages AsktheBig,EvenImprobable,Questions:WhatIf…? EncourageProactiveTesting Conclusion 14.ManagingIncidents UnmanagedIncidents TheAnatomyofanUnmanagedIncident SharpFocusontheTechnicalProblem PoorCommunication Freelancing ElementsofIncidentManagementProcess RecursiveSeparationofResponsibilities ARecognizedCommandPost LiveIncidentStateDocument Clear,LiveHandoff AManagedIncident WhentoDeclareanIncident InSummary 15.PostmortemCulture:LearningfromFailure Google’sPostmortemPhilosophy CollaborateandShareKnowledge IntroducingaPostmortemCulture ConclusionandOngoingImprovements 16.TrackingOutages Escalator Outalator Aggregation Tagging Analysis UnexpectedBenefits 17.TestingforReliability TypesofSoftwareTesting TraditionalTests ProductionTests CreatingaTestandBuildEnvironment TestingatScale TestingScalableTools TestingDisaster TheNeedforSpeed PushingtoProduction ExpectTestingFail Integration ProductionProbes Conclusion 18.SoftwareEngineeringinSRE WhyIsSoftwareEngineeringWithinSREImportant? AuxonCaseStudy:ProjectBackgroundandProblemSpace TraditionalCapacityPlanning OurSolution:Intent-BasedCapacityPlanning Intent-BasedCapacityPlanning PrecursorstoIntent IntroductiontoAuxon RequirementsandImplementation:SuccessesandLessonsLearned RaisingAwarenessandDrivingAdoption TeamDynamics FosteringSoftwareEngineeringinSRE SuccessfullyBuildingaSoftwareEngineeringCultureinSRE:StaffingandDevelopmentTime GettingThere Conclusions 19.LoadBalancingattheFrontend PowerIsn’ttheAnswer LoadBalancingUsingDNS LoadBalancingattheVirtualIPAddress 20.LoadBalancingintheDatacenter TheIdealCase IdentifyingBadTasks:FlowControlandLameDucks ASimpleApproachtoUnhealthyTasks:FlowControl ARobustApproachtoUnhealthyTasks:LameDuckState LimitingtheConnectionsPoolwithSubsetting PickingtheRightSubset ASubsetSelectionAlgorithm:RandomSubsetting ASubsetSelectionAlgorithm:DeterministicSubsetting LoadBalancingPolicies SimpleRoundRobin Least-LoadedRoundRobin WeightedRoundRobin 21.HandlingOverload ThePitfallsof“QueriesperSecond” Per-CustomerLimits Client-SideThrottling Criticality UtilizationSignals HandlingOverloadErrors DecidingtoRetry LoadfromConnections Conclusions 22.AddressingCascadingFailures CausesofCascadingFailuresandDesigningtoAvoidThem ServerOverload ResourceExhaustion ServiceUnavailability PreventingServerOverload QueueManagement LoadSheddingandGracefulDegradation Retries LatencyandDeadlines SlowStartupandColdCaching AlwaysGoDownwardintheStack TriggeringConditionsforCascadingFailures ProcessDeath ProcessUpdates NewRollouts OrganicGrowth PlannedChanges,Drains,orTurndowns TestingforCascadingFailures TestUntilFailureandBeyond TestPopularClients TestNoncriticalBackends ImmediateStepstoAddressCascadingFailures IncreaseResources StopHealthCheckFailures/Deaths RestartServers DropTraffic EnterDegradedModes EliminateBatchLoad EliminateBadTraffic ClosingRemarks 23.ManagingCriticalState:DistributedConsensusforReliability MotivatingtheUseofConsensus:DistributedSystemsCoordinationFailure CaseStudy1:TheSplit-BrainProblem CaseStudy2:FailoverRequiresHumanIntervention CaseStudy3:FaultyGroup-MembershipAlgorithms HowDistributedConsensusWorks PaxosOverview:AnExampleProtocol SystemArchitecturePatternsforDistributedConsensus ReliableReplicatedStateMachines ReliableReplicatedDatastoresandConfigurationStores HighlyAvailableProcessingUsingLeaderElection DistributedCoordinationandLockingServices ReliableDistributedQueuingandMessaging DistributedConsensusPerformance Multi-Paxos:DetailedMessageFlow ScalingRead-HeavyWorkloads QuorumLeases DistributedConsensusPerformanceandNetworkLatency ReasoningAboutPerformance:FastPaxos StableLeaders Batching DiskAccess DeployingDistributedConsensus-BasedSystems NumberofReplicas LocationofReplicas CapacityandLoadBalancing MonitoringDistributedConsensusSystems Conclusion 24.DistributedPeriodicSchedulingwithCron Cron Introduction ReliabilityPerspective CronJobsandIdempotency CronatLargeScale ExtendedInfrastructure ExtendedRequirements BuildingCronatGoogle TrackingtheStateofCronJobs TheUseofPaxos TheRolesoftheLeaderandtheFollower StoringtheState RunningLargeCron Summary 25.DataProcessingPipelines OriginofthePipelineDesignPattern InitialEffectofBigDataontheSimplePipelinePattern ChallengeswiththePeriodicPipelinePattern TroubleCausedByUnevenWorkDistribution DrawbacksofPeriodicPipelinesinDistributedEnvironments MonitoringProblemsinPeriodicPipelines “ThunderingHerd”Problems MoiréLoadPattern IntroductiontoGoogleWorkflow WorkflowasModel-View-ControllerPattern StagesofExecutioninWorkflow WorkflowCorrectnessGuarantees EnsuringBusinessContinuity SummaryandConcludingRemarks 26.DataIntegrity:WhatYouReadIsWhatYouWrote DataIntegrity’sStrictRequirements ChoosingaStrategyforSuperiorDataIntegrity BackupsVersusArchives RequirementsoftheCloudEnvironmentinPerspective GoogleSREObjectivesinMaintainingDataIntegrityandAvailability DataIntegrityIstheMeans;DataAvailabilityIstheGoal DeliveringaRecoverySystem,RatherThanaBackupSystem TypesofFailuresThatLeadtoDataLoss ChallengesofMaintainingDataIntegrityDeepandWide HowGoogleSREFacestheChallengesofDataIntegrity The24CombinationsofDataIntegrityFailureModes FirstLayer:SoftDeletion SecondLayer:BackupsandTheirRelatedRecoveryMethods OverarchingLayer:Replication 1TVersus1E:Not“Just”aBiggerBackup ThirdLayer:EarlyDetection KnowingThatDataRecoveryWillWork CaseStudies Gmail—February,2011:RestorefromGTape GoogleMusic—March2012:RunawayDeletionDetection GeneralPrinciplesofSREasAppliedtoDataIntegrity Beginner’sMind TrustbutVerify HopeIsNotaStrategy DefenseinDepth Conclusion 27.ReliableProductLaunchesatScale LaunchCoordinationEngineering TheRoleoftheLaunchCoordinationEngineer SettingUpaLaunchProcess TheLaunchChecklist DrivingConvergenceandSimplification LaunchingtheUnexpected DevelopingaLaunchChecklist ArchitectureandDependencies Integration CapacityPlanning FailureModes ClientBehavior ProcessesandAutomation DevelopmentProcess ExternalDependencies RolloutPlanning SelectedTechniquesforReliableLaunches GradualandStagedRollouts FeatureFlagFrameworks DealingwithAbusiveClientBehavior OverloadBehaviorandLoadTests DevelopmentofLCE EvolutionoftheLCEChecklist ProblemsLCEDidn’tSolve Conclusion IV.Management 28.AcceleratingSREstoOn-CallandBeyond You’veHiredYourNextSRE(s),NowWhat? InitialLearningExperiences:TheCaseforStructureOverChaos LearningPathsThatAreCumulativeandOrderly TargetedProjectWork,NotMenialWork CreatingStellarReverseEngineersandImprovisationalThinkers ReverseEngineers:FiguringOutHowThingsWork StatisticalandComparativeThinkers:StewardsoftheScientificMethodUnderPressure ImprovArtists:WhentheUnexpectedHappens TyingThisTogether:ReverseEngineeringaProductionService FivePracticesforAspiringOn-Callers AHungerforFailure:ReadingandSharingPostmortems DisasterRolePlaying BreakRealThings,FixRealThings DocumentationasApprenticeship ShadowOn-CallEarlyandOften On-CallandBeyond:RitesofPassage,andPracticingContinuingEducation ClosingThoughts 29.DealingwithInterrupts ManagingOperationalLoad FactorsinDeterminingHowInterruptsAreHandled ImperfectMachines CognitiveFlowState DoOneThingWell Seriously,TellMeWhattoDo ReducingInterrupts 30.EmbeddinganSREtoRecoverfromOperationalOverload Phase1:LearntheServiceandGetContext IdentifytheLargestSourcesofStress IdentifyKindling Phase2:SharingContext WriteaGoodPostmortemfortheTeam SortFiresAccordingtoType Phase3:DrivingChange StartwiththeBasics GetHelpClearingKindling ExplainYourReasoning AskLeadingQuestions Conclusion 31.CommunicationandCollaborationinSRE Communications:ProductionMeetings Agenda Attendance CollaborationwithinSRE TeamComposition TechniquesforWorkingEffectively CaseStudyofCollaborationinSRE:Viceroy TheComingoftheViceroy Challenges Recommendations CollaborationOutsideSRE CaseStudy:MigratingDFPtoF1 Conclusion 32.TheEvolvingSREEngagementModel SREEngagement:What,How,andWhy ThePRRModel TheSREEngagementModel AlternativeSupport ProductionReadinessReviews:SimplePRRModel Engagement Analysis ImprovementsandRefactoring Training Onboarding ContinuousImprovement EvolvingtheSimplePRRModel:EarlyEngagement CandidatesforEarlyEngagement BenefitsoftheEarlyEngagementModel EvolvingServicesDevelopment:FrameworksandSREPlatform LessonsLearned ExternalFactorsAffectingSRE TowardaStructuralSolution:Frameworks NewServiceandManagementBenefits Conclusion V.Conclusions 33.LessonsLearnedfromOtherIndustries MeetOurIndustryVeterans PreparednessandDisasterTesting RelentlessOrganizationalFocusonSafety AttentiontoDetail SwingCapacity SimulationsandLiveDrills TrainingandCertification FocusonDetailedRequirementsGatheringandDesign DefenseinDepthandBreadth PostmortemCulture AutomatingAwayRepetitiveWorkandOperationalOverhead StructuredandRationalDecisionMaking Conclusions 34.Conclusion A.AvailabilityTable B.ACollectionofBestPracticesforProductionServices FailSanely ProgressiveRollouts DefineSLOsLikeaUser ErrorBudgets Monitoring Postmortems CapacityPlanning OverloadsandFailure SRETeams C.ExampleIncidentStateDocument D.ExamplePostmortem LessonsLearned Timeline Supportinginformation: E.LaunchCoordinationChecklist F.ExampleProductionMeetingMinutes Bibliography Index Showandhidemore Productinformation Title:SiteReliabilityEngineering Author(s):BetsyBeyer,ChrisJones,NiallRichardMurphy,JenniferPetoff Releasedate:April2016 Publisher(s):O'ReillyMedia,Inc. ISBN:9781491929124 Youmightalsolike book BuildingMicroservices,2ndEdition by SamNewman Distributedsystemshavebecomemorefine-grainedasorganizationsshiftfromcode-heavymonolithicapplicationstosmaller,self-contained… book 40AlgorithmsEveryProgrammerShouldKnow by ImranAhmad Learnalgorithmsforsolvingclassiccomputerscienceproblemswiththisconciseguidecoveringeverythingfromfundamental… book SoftwareEngineeringatGoogle by TitusWinters, TomManshreck, HyrumWright Today,softwareengineersneedtoknownotonlyhowtoprogrameffectivelybutalsohowto… book HeadFirstDesignPatterns,2ndEdition by EricFreeman, ElisabethRobson Youknowyoudon’twanttoreinventthewheel,soyoulooktodesignpatterns—thelessons… Don’tleaveempty-handed GetMarkRichards’sSoftwareArchitecturePatternsebooktobetterunderstandhowtodesigncomponents—andhowtheyshouldinteract. It’syours,free. Getitnow Close



請為這篇文章評分?