Site Reliability Engineering at Facebook

文章推薦指數: 80 %
投票人數:10人

Obviously, I don't run one of the largest sites in the world by myself; I'm part of a small team of Site Reliability Engineers (SRE) that ... Skiptocontent Searchthissite OpenSource OpenSource FacebookOpenSource Platforms Android iOS Web InfrastructureSystems CoreData DataInfrastructure DevInfra ProductionEngineering Security ResearchPublications PhysicalInfrastructure Connectivity DataCenterEngineering Networking&Traffic ResearchPublications VideoEngineering&AR/VR VideoEngineering VirtualReality ResearchPublications ArtificialIntelligence MLApplications AIResearch ResearchPublications WatchVideos ByMarkSchonbach IrecentlyreturnedfromvisitingfamilyandfriendsinDelaware,andIwasaskedbyeveryone,“WhatdoyoudoatFacebook?”ThebestanswerthatIcouldgivethemwithoutlaunchingintoa45-minutetechnicaldiscussionisthat:“I’mresponsibleformakingsurethatFacebookisupatalltimesandperformingatitspeak.”Obviously,Idon’trunoneofthelargestsitesintheworldbymyself;I’mpartofasmallteamofSiteReliabilityEngineers(SRE)thatworksdayandnighttoensurethatyouandtheother400+millionusersaroundtheworldareabletoaccessFacebook,thatthesiteloadsquickly,andallofthefeaturesareworking. OurSiteReliabilityEngineeringteamcurrentlyconsistsofteamsinPaloAlto,LondonandourbrandnewDublin,Irelandoffice.AtFacebook,weareveryproudofourlevelofengineeringimpactwithover1.2millionusersperengineer,butweareevenmoreproudofthefactthatwekeepFacebookupandrunningwithoneSREforevery18millionusers.Thatlevelofimpactisunrivaledcomparedtoothertechnologycompanies. TheworkthattheSiteReliabilityEngineeringteamdoescanbestbesummarizedthisway: Site–Doesitwork?Facebook’sSREteamistaskedwithmakingsurethesiteisupandrunning24hoursperday,365daysperyear.Tosupportaglobaluserbase,wekeepawatchfuleyeonourvariousinternalandexternalmonitoringtoolsandsystemssoyou’reabletoconnectandsharewithfriendsandfamilyregardlessofwhetheritisnooninNewYorkormidnightinManila.Wemanageourhightrafficloadbybalancingtheuserexperienceagainstouravailableworldwidecapacity.TheSREteamisempoweredwiththeknowledgeandresponsibilitytofixjustaboutanyoperationalissuewemayencounter,problemsolvewithotherTechnicalOperationsandEngineeringteamsasappropriate,andfollowanyissuethroughtoitscompletion. Reliability–Doesitworkwell?Facebook’sdirtylittlesecretisthatbehindthescenes,ourinfrastructureisextraordinarilycomplex.Whileitisextremelyrarethattheentiresiteisoffline,itismorecommonthatonefeatureistemporarilyunavailable.TheSREteamworkstirelesslytoensurethatnotonlyisthecoreofFacebookupandrunning,butalsothatyoucanuseallofthefeaturesofthesite,e.g.photouploads,chat,andFacebookConnect.Eventhoughweworkdirectlywithkeypartnersanddeveloperstomakesuretheirapplicationsareworkingwell,wedon’tgetanyspecialperksforourfarmsandmafias.WealsoworkwiththeReleaseEngineeringteamtocoordinatescheduledandemergencycodeupdatesandunderstandwhatisbeingchangedandhowitcouldaffectthesite. Engineering–Coulditworkbetter?Wealwayshaveoneeyelookingtowardsthefuture.Weregularlyhacktoolsontheflythathelpusmanageandperformcomplexmaintenanceproceduresononeofthelargest,ifnotthelargestmemcachedfootprintsintheworld.Wedevelopautomatedtoolstoprovisionnewservers,reallocateexistingones,anddetectandrepairapplicationsorserversthataremisbehaving.Weareonlyabletomaintainsuchahighusertoserverratioduetoaknowledgeableandexperiencedsetofengineers.Wealsotrackperformanceissuesandlookatlong-termtrendstocorrectissuesandlookforwaystomakeFacebookrunevenfasterandmoreefficiently. AfterIattempttoexplainwhatIdo,thenextquestionIamusuallyaskedis,“Whatdoyoulikemostaboutyourjob?”AsidefromtheawesomefoodeverydayandtheamazinglytalentedpeopleIworkwith,thethingIlikemostaboutbeinganSREisthatIneverknowwhatIamgoingtoencounterwhenIarrivetoworkinthemorning.Onedaycouldinvolvespendinghourstroubleshootingacomplicatednetworkingissue,andthenextcouldbespentwritingatooltoverifythatallofourserversarerespondingefficiently.ItbringsasmiletomyfaceeverytimeIgetafriendrequestfromanoldfriendI’dpreviouslylosttouchwith,becauseIknowthatmyhardworkisworthsomethingmeaningfultomillionsandmillionsofpeoplearoundtheworld.Facebookistrulyafast-paced,dynamicenvironment,yetoffersthefreedomtooperateanddowhatisnecessarytomakethingsbetter.Thisisbestsummedupbyexample:attheendofmyfirstweekasanSRE,Ihadalreadyinvestigatedandcorrectedatroublesomeissuethathadbeenplaguingtheteam.Itwasgratifyingtoseemyselfhavinganimpactinsuchashortspanoftime. AsFacebookcontinuestogrow,wearealwayslookingtoexpandourteamwithtalented,motivatedpeoplewhobelieveinwhatwedoandwhoareeagertojumpinandhelpusfaceourfuturechallenges.Ifthissoundslikeyou,takealookatourSREjobdescription;wewouldloveforyoutojoinourteam! MarkSchonbachisbalancingtrafficbetweendatacenterswhilesittingintrafficonInterstate280ontheFacebookshuttle. ReadMoreinDataCenterEngineering ViewAll LeaveaReplyYoumustbeloggedintopostacomment. RelatedPostsRelatedPositions SeeAllJobs AvailablePositions SeeAllJobs StayConnected EngineeringatMeta Like MetaOpenSource Follow MetaResearch Like MetaforDevelopers Like RSS Subscribe OpenSource Metabelievesinbuildingcommunitythroughopensourcetechnology.ExploreourlatestprojectsinArtificialIntelligence,DataInfrastructure,DevelopmentTools,FrontEnd,Languages,Platforms,Security,VirtualReality,andmore. ANDROID iOS WEB BACKEND HARDWARE LearnMore Tohelppersonalizecontent,tailorandmeasureads,andprovideasaferexperience,weusecookies.Byclickingornavigatingthesite,youagreetoallowourcollectionofinformationonandoffFacebookthroughcookies.Learnmore,includingaboutavailablecontrols:CookiesPolicy IAgree



請為這篇文章評分?