Site Reliability Engineering at Facebook
文章推薦指數: 80 %
Obviously, I don't run one of the largest sites in the world by myself; I'm part of a small team of Site Reliability Engineers (SRE) that ... Skiptocontent Searchthissite OpenSource OpenSource FacebookOpenSource Platforms Android iOS Web InfrastructureSystems CoreData DataInfrastructure DevInfra ProductionEngineering Security ResearchPublications PhysicalInfrastructure Connectivity DataCenterEngineering Networking&Traffic ResearchPublications VideoEngineering&AR/VR VideoEngineering VirtualReality ResearchPublications ArtificialIntelligence MLApplications AIResearch ResearchPublications WatchVideos ByMarkSchonbach IrecentlyreturnedfromvisitingfamilyandfriendsinDelaware,andIwasaskedbyeveryone,“WhatdoyoudoatFacebook?”ThebestanswerthatIcouldgivethemwithoutlaunchingintoa45-minutetechnicaldiscussionisthat:“I’mresponsibleformakingsurethatFacebookisupatalltimesandperformingatitspeak.”Obviously,Idon’trunoneofthelargestsitesintheworldbymyself;I’mpartofasmallteamofSiteReliabilityEngineers(SRE)thatworksdayandnighttoensurethatyouandtheother400+millionusersaroundtheworldareabletoaccessFacebook,thatthesiteloadsquickly,andallofthefeaturesareworking. OurSiteReliabilityEngineeringteamcurrentlyconsistsofteamsinPaloAlto,LondonandourbrandnewDublin,Irelandoffice.AtFacebook,weareveryproudofourlevelofengineeringimpactwithover1.2millionusersperengineer,butweareevenmoreproudofthefactthatwekeepFacebookupandrunningwithoneSREforevery18millionusers.Thatlevelofimpactisunrivaledcomparedtoothertechnologycompanies. TheworkthattheSiteReliabilityEngineeringteamdoescanbestbesummarizedthisway: Site–Doesitwork?Facebook’sSREteamistaskedwithmakingsurethesiteisupandrunning24hoursperday,365daysperyear.Tosupportaglobaluserbase,wekeepawatchfuleyeonourvariousinternalandexternalmonitoringtoolsandsystemssoyou’reabletoconnectandsharewithfriendsandfamilyregardlessofwhetheritisnooninNewYorkormidnightinManila.Wemanageourhightrafficloadbybalancingtheuserexperienceagainstouravailableworldwidecapacity.TheSREteamisempoweredwiththeknowledgeandresponsibilitytofixjustaboutanyoperationalissuewemayencounter,problemsolvewithotherTechnicalOperationsandEngineeringteamsasappropriate,andfollowanyissuethroughtoitscompletion. Reliability–Doesitworkwell?Facebook’sdirtylittlesecretisthatbehindthescenes,ourinfrastructureisextraordinarilycomplex.Whileitisextremelyrarethattheentiresiteisoffline,itismorecommonthatonefeatureistemporarilyunavailable.TheSREteamworkstirelesslytoensurethatnotonlyisthecoreofFacebookupandrunning,butalsothatyoucanuseallofthefeaturesofthesite,e.g.photouploads,chat,andFacebookConnect.Eventhoughweworkdirectlywithkeypartnersanddeveloperstomakesuretheirapplicationsareworkingwell,wedon’tgetanyspecialperksforourfarmsandmafias.WealsoworkwiththeReleaseEngineeringteamtocoordinatescheduledandemergencycodeupdatesandunderstandwhatisbeingchangedandhowitcouldaffectthesite. Engineering–Coulditworkbetter?Wealwayshaveoneeyelookingtowardsthefuture.Weregularlyhacktoolsontheflythathelpusmanageandperformcomplexmaintenanceproceduresononeofthelargest,ifnotthelargestmemcachedfootprintsintheworld.Wedevelopautomatedtoolstoprovisionnewservers,reallocateexistingones,anddetectandrepairapplicationsorserversthataremisbehaving.Weareonlyabletomaintainsuchahighusertoserverratioduetoaknowledgeableandexperiencedsetofengineers.Wealsotrackperformanceissuesandlookatlong-termtrendstocorrectissuesandlookforwaystomakeFacebookrunevenfasterandmoreefficiently. AfterIattempttoexplainwhatIdo,thenextquestionIamusuallyaskedis,“Whatdoyoulikemostaboutyourjob?”AsidefromtheawesomefoodeverydayandtheamazinglytalentedpeopleIworkwith,thethingIlikemostaboutbeinganSREisthatIneverknowwhatIamgoingtoencounterwhenIarrivetoworkinthemorning.Onedaycouldinvolvespendinghourstroubleshootingacomplicatednetworkingissue,andthenextcouldbespentwritingatooltoverifythatallofourserversarerespondingefficiently.ItbringsasmiletomyfaceeverytimeIgetafriendrequestfromanoldfriendI’dpreviouslylosttouchwith,becauseIknowthatmyhardworkisworthsomethingmeaningfultomillionsandmillionsofpeoplearoundtheworld.Facebookistrulyafast-paced,dynamicenvironment,yetoffersthefreedomtooperateanddowhatisnecessarytomakethingsbetter.Thisisbestsummedupbyexample:attheendofmyfirstweekasanSRE,Ihadalreadyinvestigatedandcorrectedatroublesomeissuethathadbeenplaguingtheteam.Itwasgratifyingtoseemyselfhavinganimpactinsuchashortspanoftime. AsFacebookcontinuestogrow,wearealwayslookingtoexpandourteamwithtalented,motivatedpeoplewhobelieveinwhatwedoandwhoareeagertojumpinandhelpusfaceourfuturechallenges.Ifthissoundslikeyou,takealookatourSREjobdescription;wewouldloveforyoutojoinourteam! MarkSchonbachisbalancingtrafficbetweendatacenterswhilesittingintrafficonInterstate280ontheFacebookshuttle. ReadMoreinDataCenterEngineering ViewAll LeaveaReplyYoumustbeloggedintopostacomment. RelatedPostsRelatedPositions SeeAllJobs AvailablePositions SeeAllJobs StayConnected EngineeringatMeta Like MetaOpenSource Follow MetaResearch Like MetaforDevelopers Like RSS Subscribe OpenSource Metabelievesinbuildingcommunitythroughopensourcetechnology.ExploreourlatestprojectsinArtificialIntelligence,DataInfrastructure,DevelopmentTools,FrontEnd,Languages,Platforms,Security,VirtualReality,andmore. ANDROID iOS WEB BACKEND HARDWARE LearnMore Tohelppersonalizecontent,tailorandmeasureads,andprovideasaferexperience,weusecookies.Byclickingornavigatingthesite,youagreetoallowourcollectionofinformationonandoffFacebookthroughcookies.Learnmore,includingaboutavailablecontrols:CookiesPolicy IAgree
延伸文章資訊
- 1Site Reliability Engineering Taiwan | SRE 讀書會Round 3 #1
- 2Mechatronics 2017: Recent Technological and Scientific Advances
Sco fB,a fB,Sr. fB,a fB,Sr. As explained above, the goal of the optimization is to find the ideal...
- 3Site Reliability Engineering at Facebook
Obviously, I don't run one of the largest sites in the world by myself; I'm part of a small team ...
- 4後端維運工程師(SRE)|新北市新店區|獵才網站的首選 - 獵頭
- 5整理一些過去讀過、累積關於Chaos Engineering 的資料