Site Reliability Engineering at Facebook
文章推薦指數: 80 %
Obviously, I don't run one of the largest sites in the world by myself; I'm part of a small team of Site Reliability Engineers (SRE) that ... Skiptocontent Searchthissite OpenSource OpenSource FacebookOpenSource Platforms Android iOS Web InfrastructureSystems CoreData DataInfrastructure DevInfra ProductionEngineering Security ResearchPublications PhysicalInfrastructure Connectivity DataCenterEngineering Networking&Traffic ResearchPublications VideoEngineering&AR/VR VideoEngineering VirtualReality ResearchPublications ArtificialIntelligence MLApplications AIResearch ResearchPublications WatchVideos ByMarkSchonbach IrecentlyreturnedfromvisitingfamilyandfriendsinDelaware,andIwasaskedbyeveryone,“WhatdoyoudoatFacebook?”ThebestanswerthatIcouldgivethemwithoutlaunchingintoa45-minutetechnicaldiscussionisthat:“I’mresponsibleformakingsurethatFacebookisupatalltimesandperformingatitspeak.”Obviously,Idon’trunoneofthelargestsitesintheworldbymyself;I’mpartofasmallteamofSiteReliabilityEngineers(SRE)thatworksdayandnighttoensurethatyouandtheother400+millionusersaroundtheworldareabletoaccessFacebook,thatthesiteloadsquickly,andallofthefeaturesareworking. OurSiteReliabilityEngineeringteamcurrentlyconsistsofteamsinPaloAlto,LondonandourbrandnewDublin,Irelandoffice.AtFacebook,weareveryproudofourlevelofengineeringimpactwithover1.2millionusersperengineer,butweareevenmoreproudofthefactthatwekeepFacebookupandrunningwithoneSREforevery18millionusers.Thatlevelofimpactisunrivaledcomparedtoothertechnologycompanies. TheworkthattheSiteReliabilityEngineeringteamdoescanbestbesummarizedthisway: Site–Doesitwork?Facebook’sSREteamistaskedwithmakingsurethesiteisupandrunning24hoursperday,365daysperyear.Tosupportaglobaluserbase,wekeepawatchfuleyeonourvariousinternalandexternalmonitoringtoolsandsystemssoyou’reabletoconnectandsharewithfriendsandfamilyregardlessofwhetheritisnooninNewYorkormidnightinManila.Wemanageourhightrafficloadbybalancingtheuserexperienceagainstouravailableworldwidecapacity.TheSREteamisempoweredwiththeknowledgeandresponsibilitytofixjustaboutanyoperationalissuewemayencounter,problemsolvewithotherTechnicalOperationsandEngineeringteamsasappropriate,andfollowanyissuethroughtoitscompletion. Reliability–Doesitworkwell?Facebook’sdirtylittlesecretisthatbehindthescenes,ourinfrastructureisextraordinarilycomplex.Whileitisextremelyrarethattheentiresiteisoffline,itismorecommonthatonefeatureistemporarilyunavailable.TheSREteamworkstirelesslytoensurethatnotonlyisthecoreofFacebookupandrunning,butalsothatyoucanuseallofthefeaturesofthesite,e.g.photouploads,chat,andFacebookConnect.Eventhoughweworkdirectlywithkeypartnersanddeveloperstomakesuretheirapplicationsareworkingwell,wedon’tgetanyspecialperksforourfarmsandmafias.WealsoworkwiththeReleaseEngineeringteamtocoordinatescheduledandemergencycodeupdatesandunderstandwhatisbeingchangedandhowitcouldaffectthesite. Engineering–Coulditworkbetter?Wealwayshaveoneeyelookingtowardsthefuture.Weregularlyhacktoolsontheflythathelpusmanageandperformcomplexmaintenanceproceduresononeofthelargest,ifnotthelargestmemcachedfootprintsintheworld.Wedevelopautomatedtoolstoprovisionnewservers,reallocateexistingones,anddetectandrepairapplicationsorserversthataremisbehaving.Weareonlyabletomaintainsuchahighusertoserverratioduetoaknowledgeableandexperiencedsetofengineers.Wealsotrackperformanceissuesandlookatlong-termtrendstocorrectissuesandlookforwaystomakeFacebookrunevenfasterandmoreefficiently. AfterIattempttoexplainwhatIdo,thenextquestionIamusuallyaskedis,“Whatdoyoulikemostaboutyourjob?”AsidefromtheawesomefoodeverydayandtheamazinglytalentedpeopleIworkwith,thethingIlikemostaboutbeinganSREisthatIneverknowwhatIamgoingtoencounterwhenIarrivetoworkinthemorning.Onedaycouldinvolvespendinghourstroubleshootingacomplicatednetworkingissue,andthenextcouldbespentwritingatooltoverifythatallofourserversarerespondingefficiently.ItbringsasmiletomyfaceeverytimeIgetafriendrequestfromanoldfriendI’dpreviouslylosttouchwith,becauseIknowthatmyhardworkisworthsomethingmeaningfultomillionsandmillionsofpeoplearoundtheworld.Facebookistrulyafast-paced,dynamicenvironment,yetoffersthefreedomtooperateanddowhatisnecessarytomakethingsbetter.Thisisbestsummedupbyexample:attheendofmyfirstweekasanSRE,Ihadalreadyinvestigatedandcorrectedatroublesomeissuethathadbeenplaguingtheteam.Itwasgratifyingtoseemyselfhavinganimpactinsuchashortspanoftime. AsFacebookcontinuestogrow,wearealwayslookingtoexpandourteamwithtalented,motivatedpeoplewhobelieveinwhatwedoandwhoareeagertojumpinandhelpusfaceourfuturechallenges.Ifthissoundslikeyou,takealookatourSREjobdescription;wewouldloveforyoutojoinourteam! MarkSchonbachisbalancingtrafficbetweendatacenterswhilesittingintrafficonInterstate280ontheFacebookshuttle. ReadMoreinDataCenterEngineering ViewAll LeaveaReplyYoumustbeloggedintopostacomment. RelatedPostsRelatedPositions SeeAllJobs AvailablePositions SeeAllJobs StayConnected EngineeringatMeta Like MetaOpenSource Follow MetaResearch Like MetaforDevelopers Like RSS Subscribe OpenSource Metabelievesinbuildingcommunitythroughopensourcetechnology.ExploreourlatestprojectsinArtificialIntelligence,DataInfrastructure,DevelopmentTools,FrontEnd,Languages,Platforms,Security,VirtualReality,andmore. ANDROID iOS WEB BACKEND HARDWARE LearnMore Tohelppersonalizecontent,tailorandmeasureads,andprovideasaferexperience,weusecookies.Byclickingornavigatingthesite,youagreetoallowourcollectionofinformationonandoffFacebookthroughcookies.Learnmore,includingaboutavailablecontrols:CookiesPolicy IAgree
延伸文章資訊
- 1SRE | Facebook
SRE 已經註冊了Facebook。加入Facebook 來聯絡SRE 及更多你可能認識的朋友。 ... SRE 的大頭貼照,可能是顯示的文字是「 RELACIONES EXTERIORES S...
- 2The life of general sir Howard Douglas, from his notes, ...
By Sre F.B. HEAD . 28 . 10. SALE'S BRIGADE . By Rev. 20. LIFE OF MUNRO . By Rev. G. G.R. GLEIG . ...
- 3Site Reliability Engineering Taiwan - Facebook
SRE: Site Reliability Engineering Taiwan 入口網站: https://www.sre.tw 社群連結: http://fb.sre.tw 讀書會: htt...
- 4Site Reliability Engineering Taiwan - Facebook
- 5整理一些過去讀過、累積關於Chaos Engineering 的資料