How much storage would be required to store a human ...

文章推薦指數: 80 %
投票人數:10人

The 2.9 billion base pairs of the haploid human genome correspond to a maximum of about 725 megabytes of data, since every base pair can be ... Home Public Questions Tags Users Collectives ExploreCollectives FindaJob Jobs Companies Teams StackOverflowforTeams –Collaborateandshareknowledgewithaprivategroup. CreateafreeTeam WhatisTeams? Teams CreatefreeTeam CollectivesonStackOverflow Findcentralized,trustedcontentandcollaboratearoundthetechnologiesyouusemost. Learnmore Teams Q&Aforwork Connectandshareknowledgewithinasinglelocationthatisstructuredandeasytosearch. Learnmore Howmuchstoragewouldberequiredtostoreahumangenome? AskQuestion Asked 9years,10monthsago Active 8daysago Viewed 91ktimes 97 22 I'mlookingfortheamountofstorageinbytes(MB,GB,TB,etc.)requiredtostoreasinglehumangenome.IreadafewarticlesonWikipediaaboutDNA,chromosomes,basepairs,genes,andhavesomeroughguess,butbeforedisclosinganythingI'dliketoseehowotherswouldapproachthisissue. AnalternativequestionwouldbehowmanyatomsarethereinhumanDNA,butthatwouldbeofftopicforthissite. Iunderstandthatthiswillbeanapproximation,soI'mlookingfortheminimalvaluethatwouldbeabletostoreDNAofanyhuman. storagebioinformaticsdna-sequencegenetics Share Improvethisquestion Follow editedFeb3'19at15:06 Elijah 1,1561313silverbadges1919bronzebadges askedJan21'12at16:22 MilanBabuškovMilanBabuškov 56.4k4848goldbadges122122silverbadges177177bronzebadges 6 Asforthenumberofatoms,thisdependsonthecomposition.AandTaresmallermoleculesthanGandC.Thestructureofthemoleculeisthebeef,though,notitsatomiccomposition,sothisisn'treallyaveryusefulcalculation.(Forwhatit'sworth,e.g.theAmoleculeakadeoxyadenosineisC10H13N5O3so31atoms.) – tripleee Aug30'15at9:28 Seealsobiostars.org/p/5514 – OndraŽižka Dec2'15at1:59 Exceptforusersslayton,PaulAmstrongandrauchenallotheranswersgivenaredeadwronginitsessenceorfarfromcomplete.Intheanswersuser(failto)mentionedcompressionmethodsorispoorlyexplained.Seemyanswertoclarifythe4timesdownsizingofthegenomeasseeninmanyanswers. – ZF007 Mar1'18at10:43 I'mvotingtoclosethisquestionasoff-topicbecauseitisoff-topichere,shouldbeonbioinformatics.stackexchange.com – Chris_Rands Nov27'19at8:53 938Megabytescompressed.Hereisalinktoarepositorycontainingitinafilecalled:hg38.chromFa.tar.gz – SurpriseDog May27at3:23  |  Show1morecomment 11Answers 11 Active Oldest Votes 76 Ifyoutrustsuchthings,hereiswhatWikipediaclaims(fromhttp://en.wikipedia.org/wiki/Human_genome#Information_content): The2.9billionbasepairsofthehaploidhumangenomecorrespondtoa maximumofabout725megabytesofdata,sinceeverybasepaircanbe codedby2bits.Sinceindividualgenomesvarybylessthan1%from eachother,theycanbelosslesslycompressedtoroughly4megabytes. Share Improvethisanswer Follow answeredJan21'12at16:26 OliverCharlesworthOliverCharlesworth 257k2929goldbadges541541silverbadges659659bronzebadges 9 10 Justtoaddsomebiologicalcommentary,"haploid"heremeansonlyonecopyofeachchromosome.Thehumanreferenceassemblyishaploid(andamosaicofmultiplepeople).Anactualindividualgenomewillbediploid(2copiesofeachchromosome,exceptXandY)butagainonlyvariantbetweenthetwocopiesatasmallsubsetofsites. – AlexStoddard Jan23'12at19:58 16 Thoughtaboutitforaday,andrealizedthis:IfyoustoredsomebasecasehumanDNA,anysubsequenthuman'sDNAwouldonlyneedtobestoredasthediffbetweenitandthebasecase.ForsamesexexamplesDNAis99.9%thesame.Andacrosssexesit'slike98.5%. – CostaMichailidis May22'15at15:14 5 AlsoworthtorememberthatnotallinformationencodedwithinDNAbasepairsthereisalsoepigeneticinformation. – Annarfych Jun19'17at6:45 1 thismakessense.basepairsarebasically4-nary.a4-narynumberis2bits,sodoublethesize.sothat's5.8gigabitsor5.8/8gigabyteswhichis0.725GBor725MB.the'compression'isonlypossiblebecauseyoucanstoreadiffagainstthemappedgenomeinsteadofstoringyourentiregenome. – DaveCousineau Oct2'17at4:49 2 @cowlinatorThesedefinitionsare…bad.“Heritable”inthiscasemeans“heritable”betweendividingmotheranddaughtercells,notheritablebetweenmulti-cellularorganismsandtheiroffspring(thatwouldbetransgenerationalepigeneticinheritance,whichexistsbutisincrediblyrare,andmostclaimedcasesofitarebasedonbadscienceandaregenerallynotacceptedbyexperts).Butthepersonwhowrotethatsentenceisprobablynotentirelyclearonwhattheymean,becausethere’snoexcuseforthesentence’sbadphrasing.Checkoutthethe“talk”pageoftheWikipediaarticle. – KonradRudolph Jun18'20at21:59  |  Show4morecomments 27 YoudonotstorealltheDNAinonestream,rathermostthetimeitisstorebychromosomes. Alargechromosometakeabout300MBandasmalloneabout50MB. Edit: Ithinkthefirstreasonwhyitisnotsavedin2bitsperbasepairisthatitwouldcauseanhurdletoworkwiththedata.Mostofthepeoplewouldnotknowhowtoconvertit.Andevenwhenaprogramforconversionwouldbegiven,alotofpeopleinlargecompaniesorresearchinstitutesarenotallowedto/needtoaskordonotknowhowtoinstallprograms... 1GBstoragecostsnothing,eventhedownloadof3GBtakesonly4minuteswith100Mbitspsandmostcompanieshavefasterspeeds. Anotherpointisthatthedataisn'tassimpleasyougettold. e.g.ThemethodforsequencinginventedbyCraig_Venterwasagreatbreakthroughbuthasitsdownsides.Itcouldnotseparatelongchainsofthesamebasepair,soitisnotalways100%clearifthereare8A'sor9A's.Thingsyouhavetotakecareoflateron... AnotherexampleistheDNAmethylationbecauseyoucan'tstorethisInformationina2-bitrepresentation. Share Improvethisanswer Follow editedFeb3'19at1:08 CommunityBot 111silverbadge answeredJan21'12at16:32 rauschenrauschen 3,76622goldbadges1111silverbadges1313bronzebadges 5 2 +1fromme.However,Ihavenocluewhatdoes"large"or"small"chromosomemean? – MilanBabuškov Jan23'12at9:54 2 Thesenumbersdon'ttallywithwhatWikipediasays(seethetableaten.wikipedia.org/wiki/Human_genome#Information_content);I'mnotsayingyou'rewrong,butcanyouexplainthediscrepancy? – OliverCharlesworth Jan23'12at11:25 ItlookslikeheisquotingMbp(millionofbase-pairs,eachbase-pairbeingasinglepositioninthegenome)ratherthanMBwhichcanassumea2-bitencodingofeachposition – AlexStoddard Jan23'12at20:02 1 Someofagenome'sDNAmethylationchangesoverthelifetimeoftheorganism.IncludingDNAmethylationdataforahumangenomewouldbemorelikeadetailedsnapshotofapersonataparticularmoment,ratherthanagenericdescriptionoftheindividual.Although,theOPdidn'tspecifywhichtheywanted. – cowlinator Jun18'20at19:53 Whywouldyoustorethewholethingforeveryindividual?99%ofDNAisthesamebetweenhumanssoyouwouldonlyhavetostoreeachperson'sdeviationsfromtheaverage. – SurpriseDog May27at3:17 Addacomment  |  15 Basically,eachbasepairtakes2bits(youcanuse00,01,10,11forT,G,C,andA).Sincethereareabout2.9billionbasepairsinthehumangenome,(2*2.9billion)bits~=691megabytes. I'mnoexpert,however,theHumanGenomepageonWikipediastatesthefollowing: RawMB: Male(XY):770MB Female(XX):756MB I'mnotsurewheretheirvariancecomesfrom,butI'msureyoucanfigureitout. Share Improvethisanswer Follow answeredJan21'12at16:33 PaulArmstrongPaulArmstrong 6,45811goldbadge1919silverbadges3434bronzebadges 9 6 Realistically,morethan2bitsarerequired,asthereareotherbasesstoredinsequenceinformation(N,forexample,wheredataisnotmappableandthereforeunknown).TheIUPACnucleotidecodesincludemorethanthestandardfour,andthiscanincreasestorageoverhead.ebi.ac.uk/2can/tutorials/aa.html – AlexReynolds Jan30'12at8:37 @AlexReynoldsbrokenlink:/ – o0'. May1'15at13:13 3 @AlexReynolds@o0'bioinformatics.org/sms2/iupac.htmlisabetterlinkforthoseIUPACcodes.AIUI,aparticulargenome"scan"needsmorethan2bitsduetoimprecision,thusRforeitherAorG,Nforanybase,.foragap,etc.Ifwecouldreadagenomeperfectly,itwouldbejust2bitsperbase. – skierpage Jan12'17at4:37 1 TheXchromosomeissingleforfemales.MaleshaveasextratheYchrom.tobecoded,whichasweallknowdistinctfromXcrhom. – ZF007 Mar1'18at10:06 ItalsodependsonhowyoudefineMegabyte:binary2^20ormetric10^6bytes.Youusebinary,soyournumberislower. – il--ya Jul6'18at19:01  |  Show4morecomments 10 Yes,theminimumRAMneededforwholehumanDNAisabout770MB. However,the2-bitrepresentationisimpractical.Itishardtosearchthroughordosomecomputationsonit.Thereforesomemathematiciansdesignedmoreeffectivewaytostorethosesequenciesofbases...andusetheminsearchingandcomparationalgorithmssuchasforexampleGARLI(www.bio.utexas.edu/faculty/antisense/garli/garli.html). ThisapplicationrunsonmyPCrightnow,soIcansaytoYou...thatitpracticallyhastheDNAstoredinabout:1563MB. Share Improvethisanswer Follow editedMay26'20at14:05 cowlinator 4,73844goldbadges2929silverbadges4545bronzebadges answeredJan25'14at21:20 FilipOvertoneSingerRydloFilipOvertoneSingerRydlo 29044silverbadges1111bronzebadges Addacomment  |  4 justdidittoo.therawsequenceis~700MB.ifoneusesafixedstoragesequenceorafixedsequencestoragealgoritm-andthefactthatthechangesare1%icalcuated~120MBwithaperchromosome-sequenceoffset-statedeltastorage.that'sitforthestorage. Share Improvethisanswer Follow editedMar14'14at14:19 answeredMar14'14at14:03 betheguestbetheguest 4122bronzebadges Addacomment  |  3 Thereare4nucleotidebasesthatmakeupourDNAtheseareA,C,G,TthereforeforeachbaseintheDNAtakesup2bits.Therearearound2.9billionbasessothatsaround700megabytes.Theweirdthingisthatwouldfillanormaldatacd!coincidence?!? Share Improvethisanswer Follow answeredApr24'12at23:38 MatthewMcGuinnessMatthewMcGuinness 4566bronzebadges 0 Addacomment  |  3 Mostanswersexceptusersslayton,rauchen,PaulAmstrongaredeadwrongifitsaboutpurestorageone-on-onewithoutcompressiontechniques. Thehumangenomewith3Gbofnucleotidescorrespondwith3Gbofbytesandnot~750MB.Theconstructed"haploid"genomeaccordingtoNCBIiscurrently3436687kbor3.436687Gbinsize.Checkhereforyourself. Haploid=singlecopyofachromosome. Diploid=twoversionsofhaploid. Humanshave22uniquechromosomesx2=44. Male23rdchromosomeisX,Yandmakes46intotal. Females23rdchrom.isX,Xandthusmakes46intotal. Formalesitwouldbe23+1chromosomeindatastorageonaHDDandforfemales23chromosomes,explainingthelittledifferencesmentionednowandtheninanswers.TheXchrom.frommalesisequaltoXchrom.fromthefemales. Thusloadingthegenome(23+1)intomemoryisdoneinpartsviaBLASTusingconstructeddatabasesfromfasta-files.Regardlessofzippedversionsornotnucleotidesarehardlytobecompressed.Backintheearlydaysoneofthetricksusedwastoreplacetandemrepeats(GACGACGACwithshortercodinge.g."3GAC";9byteto4byte).Thereasonwastosaveharddrivespace(areaofthe500bm-2GBHDDDplatterswith7.200rpmandSCSIconnectors).Forsequencesearchingthiswasalsodonewiththequery. If"codednucleotide"storagewouldbe2-bitperletterthenyougetforabyte: A=00 C=01 G=10 T=11 Onlythiswayyoufullyprofitfrompositions1,2,3,4,5,6,7and8for1byteofcoding.Forexamplethecombination00.01.10.11(asbyte00011011)wouldthencorrespondfor"ACTG"(andshowinatextfileasanunrecognizablecharacter).Thisaloneisresponsibleforafourtimesreductioninfile-sizeasweseeinotheranswers.Thus3.4Gbwillbedownsizedto0.85917175Gb...~860MBincludingathenrequiredconversionprogram(23kb-4mb). But...inbiologyyouwanttobeabletoreadsomethingthuscompressiongzippedismorethanenough.Unzippedyoucanstillreadit.Ifthisbytefillingwasuseditbecomeshardertoreadthedata.That'swhyfasta-filesareplain-textfilesinreality. Share Improvethisanswer Follow editedDec18'19at10:14 answeredMar1'18at10:30 ZF007ZF007 3,38588goldbadges2929silverbadges4242bronzebadges 9 2 Youcanaswellstoreitasapictireoraudiorecording,orevenvideo-anditwilltaketerabatestostore.Butthisisnotrequiredandminimal,asitwasasked. – il--ya Jul6'18at18:51 @il--ya...I'mmissingthepointyoutrytomake...(Iguessyoulikemovingaround250kmofTDKtape..weighing600kgandtakesthreehourstorewind)? – ZF007 Jul9'18at14:11 2 Thepointis,that1outof4basepairsarecodedwith2bitsofinformation.Thisishowmuchdataisrequiredtocodeit-youcannotcodewithless.Butyoumaychoosetocodeitinadifferentway:youmayuseawholebyte,ordrawapicturewhichtakesfewkB,ormakeanaudiorecording.Allthiswouldstillallowtostorerequiredinformation,butthatwouldnotberequiredorminimalcoding.Youarbitrarilyimposedreadabilitycriteria(usingstandardtexteditor),whichisnotwhatwasaskedinoriginalquestion. – il--ya Jul11'18at11:09 Thatisunfortunatelynothowitworksinbiology.Themethodofcommunicationbetweenscientistsiseitherverbally,paperortextfile-formatsthatcaneasilybereadfromascreen.Inthecaseyouhaveonebase-pairs,fillingabytewithzerosoroneswillsuffice.However,thereare4bases(2pairs).Inabyteyouhave4positionsforabasepairand4positionsthatindicatethetypeofbasepair.Data-compressionworksbuthumansneedreadability.AsinglepixelinRGBcode(3valuesandanintensityvalue)uses32byte.Mere8bitsforaletter.ThusnopointtomakeitaMonaLisa,right? – ZF007 Jul19'18at6:40 6 ZF007,youmissedmypointaboutminimality.Thequestionwas:"HowmuchmemorywouldberequiredtostorehumanDNA?"withfurtherdetail"...I'mlookingforminimalvaluethatwouldbeabletostoreDNAofanyhuman."Youaretryingtoansweradifferentquestion,namely"HowmuchmemorywouldittaketostorehumanDNAinareadableformusedbybiologiststocommunicategenomedata?"ifyoucompressthereadabletextdatawithgoodcompressionalgorithm,thatwillbringitssizewellbelow2bitsperbasepair. – il--ya Jul20'18at10:12  |  Show4morecomments 3 Thehumangenomecontainsover3billionbasepairs.Soifyourepresentedeachbasepairastwobitsthenitwouldtakeover6.15×10⁹bitsorapproximately770MB. Share Improvethisanswer Follow editedNov20at22:16 Tikolu 2511silverbadge88bronzebadges answeredJan21'12at16:26 slaytonslayton 19.9k88goldbadges5858silverbadges8787bronzebadges 6 bits~=bytes.2.9billionbitsisaround350MB – SDGuero Apr22'14at23:01 5 @SDGuero,base-pairsarebase4notbase2,soyouneedatleast2bitstorepresentabasepair. – slayton Apr24'14at13:41 BSonthebitlingo...eachnucleotidebaseis1characterandthus1byte,regardlessofcharacterconversiontable(AscII,UTF-8,etc)used;notincluding2byteAsiancoding. – ZF007 Mar1'18at10:10 3 @zf007BasepairsarerepresentedbytheTOKENSofa,c,gandt.Atokenisnotthesameasacharacter.Thereisnoreasonacan'tbeencodedas00,cas01,gas10andtas11 – MatBailie Dec18'19at2:01 2 There'sthediscrepancy;you'reassertingtheneedforahumanreadablefile,whichisnotintheoriginalpost. – MatBailie Dec19'19at11:46  |  Show1morecomment 0 AllanswersareleavingoffthefactthatnuDNAisnottheonlyDNAthatdefinesahumangenome.mtDNAisalsoinheritedanditcontributesanadditional16,500basepairstoahumangenome,bringingitmoreinlinewiththeWikipediaguessof770MBformales,and756MBforfemales. Thisdoesnotmeanthatahumangenomecaneasilybestoredonan4GBUSBstick.Bitsdonotrepresentinformationbythemselves,itisthecombinationofbitsthatrepresentinformation.SointhecaseofnuDNAandmtDNA,thebitsareencoded(nottobeconfusedwithcompressed)torepresentproteinsandenzymesthatinthemselveswouldrequiresmanyMBsofrawdatatorepresent,especiallyintermsoffunctionality. Foodforthought:80%ofthehumangenomeiscalled"non-coding"DNA,sodidyouactuallyreallybelievethattheentirehumanbodyandbraincanberepresentedinamere151to154MBsofrawdata? Share Improvethisanswer Follow answeredFeb17'19at15:00 ar18ar18 33522silverbadges55bronzebadges Addacomment  |  -3 Onebase--T,C,A,G(inthebase-4numbersystem:0,1,2,3)--isencodedastwobits(notone),soonebasepairisencodedbyfourbits. Share Improvethisanswer Follow answeredApr29'18at5:14 HenryKONormanHenryKONorman 1 5 2 Exceptthatbasesinapaircompplementeachother,sodon'taddanyinformation.Sobothbaseandbasepaircanbeencodedwithtwobits. – il--ya Jul6'18at18:43 Ifyouhavean"A"whatdoyoucomplementitwith?"AC""AG""AT"areallvalid.Likewise,ifyouhave"T"the"TG""TC""TA"arevalid,Sowhatdoyoudo? – RogerJohansson Nov1'18at12:40 1 @RogerJohanssonNo,onlythe“AT”basepairisvalidinDNA.Likewisefor“TA”,“CG”and“GC”.Nootherbasepaircombinationexists. – KonradRudolph Feb18'19at9:47 @KonradRudolphthereareatleastninepurines(en.wikipedia.org/wiki/Purine).AllofthemcanbeusedtosubstituteAorG.ThiswouldmakethesolutiontoOP'squestionmorecomplex.IagreetokeepitsimpleandsticktoA,G,TandC. – ZF007 Jul2'19at6:49 1 @ZF007Theyexistbuttheydonotoccurstablyinhumangenomesandarethereforenotrelevantforgenomestorage.Theirbiologicalrelevanceisimportantonlyinthecontextofmutations(andthereonlytransiently)andRNAmodifications.Inparticular(inthecontextofthisanswer),genomicdataisn’tstoredas“basepairs”,it’sstoredasasequenceofsinglebases,andeachpositioncanbeencodedintwobits.Thisisn’ttheoretical,thisishowit’sactuallydone(exceptthat,formostapplications,geneticdataisstoredin(gzipped)ASCII,notbit-compressed). – KonradRudolph Jul2'19at9:49 Addacomment  |  -4 Thereisonly2typesofbasepairs,CytosinecanonlybindtoGuanine,andAdeninecanonlybindtothymine, Soeachbasepaircanbeconsideredasinglebit. ThismeansthatanentirestrandofHumanDNA~3billion"Bits"wouldberightaround~350megabytes. Share Improvethisanswer Follow answeredMay18'17at20:56 TheLinuxFanboyTheLinuxFanboy 1 1 3 Youhave2typesofpairs,andtheycanbeintwodirections-soyouneedtwobitsforeachpair.Thisiswhymostpostsabovewrite~700MB,andnot350MB. – Trondster Oct23'17at7:52 Addacomment  |  Highlyactivequestion.Earn10reputation(notcountingtheassociationbonus)inordertoanswerthisquestion.Thereputationrequirementhelpsprotectthisquestionfromspamandnon-answeractivity. Nottheansweryou'relookingfor?Browseotherquestionstaggedstoragebioinformaticsdna-sequencegeneticsoraskyourownquestion. TheOverflowBlog Thefourengineeringmetricsthatwillstreamlineyoursoftwaredelivery FeaturedonMeta Reducingtheweightofourfooter TwoBornottwoB-Farewell,BoltClockandBhargav! Related 1 GeneticProgramminghelp 1 efficientdiskstorageofdecimalnumbersinC(C89) 0 HowtoreadaDNAsequencefromatextFileandstoreitinanarrayinC? 2 perlregex:howtogetopen-reading-frameswithoutinternalstop-codons? 2 afastwaytogethumangenomesequencebycoordinate 0 Howtousebedtoolscoveragetoassessgenomeassembly? 0 Scalablewebhostingsolution HotNetworkQuestions Whatisthelargestnumberofstabilizersapurestatecanhave? Rsyncfolderswithnamesthatbeginwithasingleand/ordouble-dash Whyaren'tlegamputationsusuallydonerightattheknee-joint?(turkey-legstyle) CustomPythonExpressionfunctiondefinedasmacro WhatsizearethemosaicsofJustinianandTheodora? Whattolookforinafirsttelescopeforachild? HowcanIdetecttheUSB-Cpowerdeliveryvoltage? Dowehaveanycharacterlimitforthemessagewedisplayintoastpopupusingforce:showToast? DidIcheatonanexambyknowingasolutioninadvance? Aligninsideenumerate Hasionpropulsioneverbeenusedinadeepspacetrajectorycorrectionmaneuverproper? MultiplestaccatodotsonminimwithtremolorepeatinLilypond Whatdoestheword"spring"meanin"aspringofactivity"and"aspringofsuffering"? WastheHarcourtCOVID-19isolatepapereverpublished? SimulatingsolarsystemwithNewtonslaw Distributionoftheexponentialofanexponentiallydistributedrandomvariable? Isthereareasonwhygiantmechshaveopticsthesizeofapersoninsteadof'normal'sizedones? DoIneedtoswitchpowerandgroundonUSB3.0? Whatkindofaltimeterareusedinmodernairliners? Wouldauthorsbetoopowerfulforyourtypicalmedievalfantasysetting? IsittruethatRecklessAttackrendersACboostslesseffective? ModularTransformationinTikZ HowtodealwithaPhDsupervisorthatactslikeacompanymanager? Findshortestpathbetweentwoverticesthatusesatmostonenegativeedge morehotquestions Questionfeed SubscribetoRSS Questionfeed TosubscribetothisRSSfeed,copyandpastethisURLintoyourRSSreader. Yourprivacy Byclicking“Acceptallcookies”,youagreeStackExchangecanstorecookiesonyourdeviceanddiscloseinformationinaccordancewithourCookiePolicy. Acceptallcookies Customizesettings  



請為這篇文章評分?