Differential gene expression in disease: a comparison ...

文章推薦指數: 80 %
投票人數:10人

Differential gene expression is important to understand the biological differences between healthy and diseased states. Skiptomaincontent Advertisement SearchallBMCarticles Search Differentialgeneexpressionindisease:acomparisonbetweenhigh-throughputstudiesandtheliterature DownloadPDF DownloadPDF Researcharticle OpenAccess Published:11October2017 Differentialgeneexpressionindisease:acomparisonbetweenhigh-throughputstudiesandtheliterature RaulRodriguez-Esteban  ORCID:orcid.org/0000-0002-9494-96091&XiaoyuJiang2  BMCMedicalGenomics volume 10,Article number: 59(2017) Citethisarticle 12kAccesses 29Citations 4Altmetric Metricsdetails AbstractBackgroundDifferentialgeneexpressionisimportanttounderstandthebiologicaldifferencesbetweenhealthyanddiseasedstates.Twocommonsourcesofdifferentialgeneexpressiondataaremicroarraystudiesandthebiomedicalliterature.MethodsWiththeaidoftextminingandgeneexpressionanalysiswehaveexaminedthecomparativepropertiesofthesetwosourcesofdifferentialgeneexpressiondata.ResultsTheliteratureshowsapreferenceforreportinggenesassociatedtohigherfoldchangesinmicroarraydata,ratherthangenesthataresimplysignificantlydifferentiallyexpressed.Thus,theresemblancebetweentheliteratureandmicroarraydataincreaseswhenthefold-changethresholdformicroarraydataisincreased.Moreover,theliteraturehasareportingpreferencefordifferentiallyexpressedgenesthat(1)areoverexpressedratherthanunderexpressed;(2)areoverexpressedinmultiplediseases;and(3)arepopularinthebiomedicalliteratureatlarge.Additionally,thedegreetowhichdiseasesaresimilardependsonwhethermicroarraydataortheliteratureisusedtocomparethem.Finally,vaguely-qualifiedreportsofdifferentialexpressionmagnitudesintheliteraturehaveonlysmallcorrelationwithmicroarrayfold-changedata.ConclusionsReportingbiasesofdifferentialgeneexpressionintheliteraturecanbeaffectingourappreciationofdiseasebiologyandofthedegreeofsimilaritythatactuallyexistsbetweendifferentdiseases. PeerReviewreports BackgroundInvestigatingthedifferencesbetweendiseasedandhealthystatehelpsusunderstandthepathologyofdiseasesand,eventually,treatthem.Oneparticularfocusofinvestigationisdifferentially-expressedgenes(DEGs),whichinvolvestheidentificationofgenesthataredifferentiallyexpressedindisease.Inpharmaceuticalandclinicalresearch,DEGscanbevaluabletopinpointcandidatebiomarkers,therapeutictargetsandgenesignaturesfordiagnostics.Whileparticulargeneexpressionchangesmaynotalwaystranslateintoconsequentialbiologicalactivity,suchdatacannonethelessbepooledwithotherbiologicaldatainahigh-throughputfashiontocreateintegratedanalyses,suchasbuildingthetargetlandscapeofadisease[1,2].OurgoalinthisstudywastocomparetwowidelyusedsourcesofDEGinformation,namelyhigh-throughputmicroarrayexpressionstudiesandthescientificliterature.Forthatpurpose,weminedthescientificliteratureandanalyzedmicroarraydatasetsonasetofdiseasestostudythesimilaritiesanddifferencesofthesetwotypesofdatawithinspecificbiologicalcontexts.Inthescientificliterature,informationaboutDEGsislargelyfoundinunstructuredformandscatteredacrosspublications.Itcanappearintheformofgradablestatements,whicharestatementsthatdescribeameasurementwithrespecttoabaseline,scaleornorm[3].Forexample,thesentence“Theexpressionofprotectinwasfoundtobedecreasedintheepitheliumofpatientswithulcerativecolitis.”[4]comparesthepathologicalexpressionofprotectintoanimplicitbaseline,presumablytheexpressionlevelofprotectininhealthystate.Suchasentencedescribesa“negativeregulationofgeneexpression”asdefinedbytheGeneRegulationOntology[5].DEGinformationcanalsobefoundinnon-gradablestatementsinwhichacomparisonisimplicit.Forexample,inthesentence“ExpressionoftheCOX-2enzymehasbeenreportedinanimalmodelsofinflammatoryboweldisease(IBD)aswellasinpatientsaffectedbyulcerativecolitisandCrohn'sdisease.”[6]itisimpliedthatthereislackofexpressioninwild-typeanimalsandhealthypatienttissue.StatementsaboutDEGsintheliteratureoftenlackdetailorspecificity,whichisachallengeforhumaninterpretationandfortheirautomaticextractionbycomputers.Thus,theycanreferindistinctlytoproteinorRNA[7],andusebaselinesthatarenotdefinedorvague.Vaguenessisageneralfeatureofnaturallanguageandisaspecialproblemwithgradablestatements[8].Forexample,inthesentence“Involucrin[...]ismarkedlyincreasedininflammatoryskindiseasessuchaspsoriasis.”[9],themagnitudeimpliedby“markedly”isdifficulttoevaluate.Furthermore,thebaselineofthestatementisimplicit,althoughitisprobablytheexpressionlevelinhealthyskintissue.Finally,thesourceofsupportingevidenceandexperimentaldetails,suchasthetechniqueemployed,ismissinginthesentenceandinthearticleinwhichthesentenceappears.Suchastatementshowsalowlevelofpresentedevidenceasdefinedby[10].IncontrasttoDEGinformationfoundinscientifictext,microarrayexpressiondatatypicallyappearinstructuredforminnumericaldatasetsthatcoverthousandsofgenesandcanbestoredinrepositoriessuchastheGeneExpressionOmnibus(GEO)[11]andArrayExpress[12].Suchrepositoriesallowthepoolingofmultipledatasetstocreateanaggregateviewacrossdifferentexperimentalsettings[13].Whilebetterorganizedthantheliterature,microarrayexpressiondatasetspresenttheirownchallenges.Therawexpressiondatafromthesedatasetsrequireprocessingandqualityassessment[14],andresultingexpressionvaluesconveyarelativeratherthananabsolutemeasure.Thus,theanalysisofmicroarrayexpressionisusuallyrestrictedtoidentifyingexpressionvalueswithlargestchangebetweensamples(e.g.,[15])orthatchangebeyondacertainstatistically-significantthresholdorafixedfold-changethreshold.AnimportantlimitationofmicroarrayexpressionstudiesisthattheyconcernonlymRNAandnotprotein,andinparticularonlywhole-cellmRNA[16].Therefore,theylackthedetailandgranularityofexperimentalmethods,suchasimmunohistochemistry,thatcandescribedetailedspatialdistributions.Moreover,interpretationofmicroarrayexpressionresultsiscomplicatedbythenaturalvariationthatexistsacrossbiologicalsamples,aswellasbydifferencesintechnicalsettingsacrossexperimentsandlaboratories[17].Finally,microarrayexpressiondatasetsnotstoredinstandardrepositoriescanbehardtoobtain.TheadventofgeneexpressionmeasurementwithRNAsequencing(RNA-seq)technologyhasaffectedthenumberofmicroarraystudiesbeingundertaken.However,in2016,GEOstillreleased4945arrayexpressionprofilingseries(“expressionprofilingbyarray”[DataSetType]AND“gse”[EntryType]),oraboutthesamequantityreleasedforhigh-throughputsequencingseries(“expressionprofilingbyhighthroughputsequencing,”n = 4894).Moreover,alargetroveofmicroarraystudieshasbeenaccumulatinginGEOovertime,with49,026arrayseriesavailable(searchperformedon2017-2-14).WhileRNA-seqisincreasinglyfavoredforhigh-throughputexpressionanalysis,modernmicroarrayandRNA-seqplatformsproduceexpressionvaluesthatarehighlycorrelatedandeachpossessesitsowntechnicaladvantages[18,19].MethodsForthetextminingpartofourstudy,thereisnopriorworkfocusedspecificallyonDEGsindisease.Theclosestworkconcernstheextractionofpopulationpercentagesoflymphomatumorsthatshowexpressionofageneinimmunohistochemistryexperiments[20].Inthatwork,genenamesweretaggedusingdictionary-matchingandasetofruleswasdevisedtoidentifysentenceswithpotentiallyrelevantinformationaboutgeneexpression.Therehavealsobeenstudiesonidentifyinggenesexpressedincelltypes[21]andanatomicallocations[22],orinboth[23].Theidentificationofsentencesthatdescribegeneexpression,withoutanyothercontextualdetails,hasalsobeenaddressedaspartofmore-generaleventextractiontasks[24,25].OurapproachwastoidentifysentencesfromMedlineabstractsthatprovideinformationofthetype“XisdifferentiallyregulatedinY”withrespecttohealthycontrols,withXbeingageneandYadisease.Suchinformationcanbemappedtovectorsofthetype(PMID,X,Y,Δ)wherePMIDisthecorrespondingPubMedIDoftheabstractandΔreferstothedirectionandmagnitudeofexpressionchangebetweendiseasedandhealthystates.Thevaluesthatthevectors(PMID,X,Y,Δ)cantakewerebased,inourcase,onthecontentofthesentencesidentified,thuscoreferencesorinformationfromtherestofthedocumentwerenotconsideredexceptincasesofambiguityinthegenenameoranatomicallocationoftheexpression.Insuchcasesthisinformationcouldcomefromtherestoftheabstractifappearingtherein.Redundant(PMID,X,Y,Δ)statementswerediscarded.Typically,qualifierkeywordsandphrases(suchas“overexpressed,”“decreasedexpression”and“greatlyelevated”)helpeddeterminethevalueofΔ.ThesetofpossiblevaluesforΔweredefinedtobethefollowing:{highincrease,increase,decrease,highdecrease}.WetrackedqualifiersthatindicatedaverylargechangeinexpressiontoassigntheΔvalueshighdecreaseandhighincrease.Forexample,expressionthatwasdescribedas“greatlyelevated”indiseasewasmappedtohighincrease,while“elevated”or“significantlyoverexpressed”wasmappedtoincrease.SinceDEGinformationcanbeconveyedthroughtextinmanyways,wedevisedagenerally-inclusivemethodbasedontri-occurrence.WesearchedfirstforabstractsmentioningadiseaseYandageneXusingdiseaseandhumangeneannotationsfromNCBI’sPubTator(download2016-01-25)[26].ThoseabstractswerethensplitintosentenceswiththeaidoftheJULIESentenceBoundaryDetector[27].ForeachsentencewedetectedwhethertherewerementionsofgeneX,diseaseY(orabbreviation)andtriggerword(orsubstring).Thetriggerwordswereselectedafter[22]tobethefollowing:{express,production,produce,transcription,transcribe}.Finally,theresultingsentencesweremanuallyreviewed.Ourgoalwastoproduceasampleofsentencesthatrepresentedanunbiasedviewoftheliterature.Undefinedexpressionchangeswerenotconsidered(e.g.expressiondescribedasaltered/alteration,aberrant,abnormal,dysregulation,expresseddifferentially,modulated,discordant).Moreover,namesindicatingproteincomplexesorfamiliesofproteinsorgenesthatcouldnotbemappedtoatmostthreegeneswerenotconsidered.HumanmicroarrayexpressionseriesrelativetoeachdiseaseweresearchedinGEObyusingthecorrespondingdiseasenamesaskeywords.Tomaintainconsistency,allseriesselectedwerebasedonthesameplatform,AffymetrixHumanGenomeU133Plus2.0Array(GPL570),andincludedbothdiseasedtissuesamplesandnormalsamples.Unaffected-tissuesamples,samplesafterdrugtreatmentorsamplesfromnon-primarydiseasetissue(suchasbloodperipheralsamples)werenotconsidered.Fromtheseriesfoundfollowingthesecriteria,thosewiththelargestsamplesizewereprioritized.Theseriesselectedforin-depthanalysiswere:GSE36842foratopicdermatitis,GSE36807forCrohn’sdisease,GSE13355forpsoriasisandGSE38713forulcerativecolitis.DEGswereidentifiedusingthelimmaBioconductorRpackageusingBenjamini&Hochberg(falsediscoveryrate)tocorrectformultipletestingandadjustedp-value<0.05.Thefoldchange(FC)inexpressionwasusedasavariablefilter(cutoff)throughoutthestudy.Wedidnotidentifyanycovariatesthatrequiredbatchcorrection.Boxplotsandprincipalcomponentanalysisforeachdatasetareprovidedinthesupplementaryinformation(seeAdditional files 1,2,3and4).Forcalculatingthepositivelikelihoodratiobetweenmicroarraydataandtheliteraturewetookintoaccountonlythesubsetofgenes(HUGOgenesymbols)sharedbybothPubTatorandGPL570(n = 17,126).ResultsThefocusofourworkwasonfourdiseases:Crohn’sdisease(CD),ulcerativecolitis(UC),psoriasis(PS)andatopicdermatitis(AD).Theirchoicestemmedpartiallyfromtheirspecificitytoparticulartissues:psoriasisandatopicdermatitistotheskin,Crohn’sdiseaseandulcerativecolitistothegastrointestinaltract.Anotherreasonfortheirselectionwasourinterestinexploringsimilardiseasesthatareoftencomparedtoeachother,inourcasethepairsPS-ADandUC-CD.WecollectedDEGstatementsfromtheliteratureandmicroarraydatasetsconcerningthesefourdiseases(seeMethods),focusingonlyonthemainaffectedtissues(e.g.,wediscardedserummeasurements).Wethencomparedthedatareportedintheliteraturewiththeinformationcontainedinmicroarraydatasets.Throughourtextminingapproach,wecreatedasampleofDEGstatementscomingfrom200MedlineabstractsforAD,308forCD,429forPSand273forUC.Thesestatementsconcerned173uniquegenesforAD,240forCD,327forPSand285forUC.(Thetextminingresultsareavailableassupplementaryinformation,seeAdditional file 5.)ThemicroarraydatasetspresenteddifferentquantitiesofDEGsdependingonfoldchange(FC)filtering.Forexample,for|FC| > 2,110uniquegenesweredifferentiallyexpressedinAD,92inCD,998inPSand2339inUC.OverexpressionismorereportedthanunderexpressionAscanbeseeninFig. 1andTable 1,DEGreportsfavoroverexpressedgenes3-4timesmorethanunderexpressedgenes.Intriguingly,themagnitudeofthisbiasdoesnotdiffermuchbetweendiseases.Microarrayexpressiondatashowsnosuchsystematicimbalance.Fig.1Ratioofoverexpressedvs.underexpresseduniqueDEGsinmicroarraydatasetsvs.theliterature.|FC| > nindicatesmicroarrayDEGswithabsolutefoldchangeaboven Fullsizeimage Table1Percentageofoverexpressedvs.underexpresseduniqueDEGsinmicroarraydataandtheliterature.|FC| > nindicatesmicroarrayDEGswithabsolutefoldchangeaboven Fullsizetable Tosimplifythediscussion,thefocusinthenextsectionsisonoverexpressedgenes,forwhichthereexistmoredataintheliterature.ThereportingofhighoverexpressioncorrelateswiththereportingofoverexpressionandonlyweaklywithmicroarrayfoldchangeThemoreageneismentionedasoverexpressedinadiseasethemorelikelyitwillbementionedashighlyoverexpressed(highlyincreased)inthesamedisease(Fig. 2).Onepotentialexplanationforthisisthathighlyoverexpressedgenesarethefocusofmorescrutinyduetotheirpresumedheightenedbiologicalrelevance.Thereareexamplesofthisphenomenonthatcanbeobservedintheliterature.SuchisthecaseofgeneS100A7inpsoriasis,whichfirstraisedinterestasahighlyexpressedgeneinpsoriaticskin[28].Ontheotherhand,itisalsopossiblethatoverexpressedgenesthatareoftenstudiedendupbeingconsideredhighlyoverexpressedastheresultofsheermultipletesting.Usingthedataavailableinourstudyweusedasimplelinearmodeltodisentanglethisquestion:$$high\kern0.28emincrease\kern0.28emmentions\simf\kern0.28em\left(FC;increase\kern0.28emmentions\right)=\alpha\cdotFC+\beta\cdotincrease\kern0.28emmentions+\gamma.$$ (1) Fig.2Relationbetweenoverexpressionmentionsintheliteratureandthesubsetofthosewhicharehighincrease.Thefigureshowstherelationbetweengeneoverexpressionmentionsandmeannumberofhighincreasementionsforgeneswithuptonineoverexpressionmentions.Slopeofthezero-y-intercepttrendlineis0.21anditsassociatedr 2is0.89Fullsizeimage Throughthismodelwesawthat,onceweaccountforthefactthatagenehasbeenmentionedasoverexpressedintheliterature,themicroarrayFCvaluestillinfluenceswhetheritwillbedescribedashighlyincreasedornot.TheαcoefficientforthelinearmodelvariesfromsmallestforUCtolargestforPSandisinallcasessmallerthan0.01.Thus,vaguehighincreasestatementsareonlytoasmalldegreelinkedtotheFCvaluefrommicroarraydata.Tobereportedasoverexpressedagene’spopularityismoreimportantthanitsfoldchangeOnewaytoseetherelationbetweenmicroarrayFCandtheliteratureisbylookingattheprobabilitythatagenewillbereportedasoverexpressedforFCvaluesaboveacertainthreshold.AscanbeseeninFig. 3forthecaseofAD,thecumulativeprobabilityincreaseswithFC,whichmeansthatgenesassociatedtohigherFCsaremorelikelytobereportedasoverexpressedintheliterature.Fig.3CumulativeprobabilityofagenebeingreportedasoverexpressedinADgivenitsmicroarrayFC.TheabscissacorrespondstomicroarrayFCandtheordinatetothecumulativeprobabilityofagenebeingreportedasoverexpressedwhenitsassociatedmicroarrayFCisaboveacertainvalue,p(overexpressioninAD|FCinAD > x) Fullsizeimage However,thereisalsoacorrelationbetweenthefrequencywithwhichageneismentionedasoverexpressedanditspopularityintheoverallbiomedicalliterature,ascanbeseeninTable 2.Thus,genesthatarereportedasoverexpressedinadiseasetendtobepopularinthebiomedicalliteratureatlarge.MicroarraydataFC,ontheotherhand,exhibitlowercorrelationwithoverexpressionreportingornone.OnlythePSandUCmicroarraydatasetsshowedstatisticallysignificantcorrelationandofsmallermagnitudethanthatassociatedtopopularity.Table2Pearsoncorrelationcoefficient(r)betweenagene’spopularity(totalnumberofmentionsinthebiomedicalliterature)anditsoverexpressioninadiseaseaccordingtotheliterature(0 = notmentioned,1 = mentioned)ortomicroarraydata(0 = notoverexpressed,1 = overexpressed)Fullsizetable Tofurthertesttherelationbetweenoverexpressionreports,microarrayFCandpopularity;wecreatedalinearmodelinwhichoverexpressionmentionswereafunctionofthevariableslog2FCandpopularity:$$increase\kern0.28emmentions\simf\left({log}_2FC;popularity\right)=\alpha\cdot{log}_2FC+\beta\cdotpopularity+\gamma.$$ (2) Bothofthesevariablesturnedouttobesignificantforeachdisease,exceptinthecaseofPS,forwhichlog2FCwasnotsignificant.Thus,agene’schancestobementionedasoverexpressedcanincreasebothwithitsmicroarrayFCvalueandwithitspopularityinthegeneralliterature,butpopularityhasgreaterinfluence.Intermsofoverexpression,theliteratureshowsdiseasestobemoresimilarthanmicroarraysdoAscanbeseeninFig. 4,fromthepointofviewofgeneoverexpression,similaritiesbetweenanypairofdiseasesaregenerallyhigherintheliteraturethaninmicroarrays.Thiscanbequantifiedusingthepositivelikelihoodratio(LR+)followingtheequation:$$LR+\left({Y}_1/{Y}_2\right)=\frac{p\kern0.28em\left(overexpression\kern0.28emin\kern0.28em{Y}_1\kern0.28em|\kern0.28emoverexpression\kern0.28emin\kern0.28em{Y}_2\operatorname{}\right)}{p\kern0.28em\left(overexpression\kern0.28emin\kern0.28em{Y}_1|\kern0.28emno\kern0.28emoverexpression\kern0.28emin\kern0.28em{Y}_2\operatorname{}\right)}$$ (3) Fig.4Numberofoverexpressedgenesforeachdisease.Numberofoverexpressedgenesforeachdisease(a)asreportedintheliteratureand(bandc)asappearinginmicroarraydatasets(FC > 0andFC > 2,respectively).ThetablesshowtheLR+forgenesoverexpressedinonedisease(tableheaders)thatareoverexpressedinanotherdisease(rownames)basedon(d)theliteratureor(e)microarraydatawithFC > 0or(f)FC > 2Fullsizeimage Forexample,basedonmicroarraydatawithFC > 0cutoff,theLR+forADbasedonCD(AD/CD)is1.6,whichmeansthatageneis1.6timesmorelikelytobeoverexpressedinADwhenthatgeneisoverexpressedinCD.Meanwhile,intheliterature,thevalueofLR+(AD/CD)is48.4,whichismuchhigher.Thus,theliteratureisenrichedforgenesthatareoverexpressedinmorethanonedisease,ascanbeseeninFig.4.Overall,forFC > 0,microarraydatasetsshowLR+valuesbetween1and5whiletheliteratureyieldsLR+valuesbetween32and110.ForFC > 2,ontheotherhand,microarraydatashowshigherLR+values,althoughstilllowerthanthosefortheliterature.ThedifferencesinLR+betweentheliteratureandmicroarraydataarelargerwhenitcomestogenesreportedtobeoverexpressedinthreeoutofourfourdiseases.TheLR+formicroarraydatawithFC > 0rangesbetween1and5(mean ~ 2.6)whileitrangesbetween40and91(mean ~ 64)fortheliterature.Finally,forgenesoverexpressedinallfourdiseasestheLR+formicroarraydatawithFC > 0rangesbetween1and7(mean ~ 3.4)whilefortheliteratureitrangesbetween75and122(mean ~ 110).Naturally,certaindiseasepairswillsharemoreoverexpressedgenesduetobiologicalsimilarities.However,wefoundthatthelevelofsimilaritybetweendiseasesdiffersdependingonwhethermicroarraydataortheliteraturewasconsidered.Forexample,takingmicroarraydatawithFC > 0cutoffasa“true”baseline,theliteraturewouldbeoverstatingthesimilarityofPSandADthemost,whilethesimilaritiesbetweenPSandCDwouldbetheleastemphasized.Thus,itispossiblethatthesimilaritiesbetweenPSandCDhavereceivedinsufficientattention(seeforexample[29])incomparisontothesimilaritiesbetweenPSandAD,ifmicroarraydataistobeusedasguidance.Asthemicroarrayfold-changecutoffincreases,microarraydataandtheliteratureincreaseinresemblanceTheLR+canalsohelpusdeterminefurthertherelationshipbetweenoverexpressionintheliteratureandinmicroarrays.WecancomputetheLR+ofagenebeingoverexpressedintheliteraturewhenitisoverexpressedinmicroarraydataandviceversa.Ourinterestisinknowingwhethertheoddsofagenebeingoverexpressedinoneofthesourceschangewhenitisknowntobeoverexpressedintheothersource.OurfindingwasthattheLR+dependsontheFCcutoffchosen.Forexample,theLR+ofmicroarrayoverexpressionforFC > 0giventheliterature(andviceversa)isnotsignificantforADandCD.ForPSandUCtheLR+issignificantandrangesbetween1.5and4(seeFig. 5).Thus,theinformationconveyedbythesetwosourcescanbequitedistinctwhenchoosingaFC > 0cutoff.InthecaseshowinghighestLR+(UC),theprobabilityofagenebeingoverexpressedinthemicroarraydatasetgoesupfrom21to50%whentheliteraturestatesthatitisoverexpressed.TheprobabilityofagenebeingoverexpressedintheUCliteraturegoesupfrom0.09to0.34%whenitisoverexpressedinthemicroarraydataset.Fig.5Positivelikelihoodratiogivenmicroarraydataandtheliterature.Positivelikelihoodratio(LR+)of(a)microarraydatagiventheliteratureand(b)theliteraturegivenmicroarraydatafordifferentvaluesoflog2FCthresholdandforeachdisease:AD(diamonds),CD(squares),PS(triangles),UC(crosses).ThehighertheLR+themorelikelyonedatasourcecanpredictanotheroneFullsizeimage AdifferentpictureariseswithincreasedFCthresholds,ascanbeseeninFig.5.TheLR+thenincreasessubstantially,whichmeansthattheliteraturebecomesmorerelatedtomicroarraydataastheFCthresholdincreases.Thisisprobablyduetothefactthat,ashasbeenalreadystated,theprobabilitythatageneismentionedasoverexpressedintheliteratureincreaseswithhighermicroarrayFC.DifferencesbetweenmicroarraydataandtheliteraturetranslateintoalternativeviewsoftheunderlyingdiseasebiologyAscouldbeexpected,thedifferencesthathavebeendescribedbetweenmicroarrayandliteraturedatatranslateintodifferentrepresentationsofthepathologicalprocessesthatcharacterizeeachdisease.Tomeasurethisquantitatively,welookedatthelevelofenrichmentofGeneOntology(GO)functionalclassesassociatedtothegenesoverexpressedinmicroarraydataandintheliterature.Figure 6showsthetop20statisticallyoverrepresentedGOfunctionalclassesinmicroarrayandliteraturedataforUCbasedonthePANTHERstatisticaloverrepresentationtestwithBonferronicorrection[30].ForUCandFC > 0,16functionalclassesweresharedbetweenthe38overrepresentedintheliteratureandthe36overrepresentedinthemicroarraydataset.ForPSandFC > 0,ontheotherhand,onlythe“unclassified”functionalclasswassharedbetweenthe17overrepresentedinthemicroarraydatasetandthe15overrepresentedintheliterature.Fig.6StatisticallyoverrepresentedGeneOntologyfunctionalclasses.Top-20statisticallyoverrepresentedGeneOntologyfunctionalclassesbasedonoverexpressedgenesintheUCliterature(left)andintheUCmicroarraydataset(right)Fullsizeimage ForFC > 2,thesimilaritiesbetweenmicroarraydataandtheliteratureweregreater.ForUCtherewere28sharedfunctionalclassesbetweenthe47overrepresentedinthemicroarraydatasetandthe38overrepresentedintheliterature.ForPS,therewere11sharedfunctionalclassesbetweenthe28overrepresentedintheliteratureandthe18overrepresentedinthemicroarraydataset.DiscussionOurgoalwastoexploretherelationshipbetweenmicroarrayexpressiondataandtheexpressiondatareportedintheliteraturebecauseinourdailyworkbothofthesedatasourcesareusedascomplementarysourcesofinformation.Fromthetherapeuticpointofview,forexample,everyDEGindiseaseisapotentialpointofinterventionortarget.Thus,thesoleuseofmicroarraydataoroftheliteraturecouldleadtomissingoutonpotentialtargetsthatappearinonesourceandnottheother.Forinstance,EGFRdoesnotappearupregulatedinthePSmicroarraydataset,whileitisoneofthemostfrequentlymentionedupregulatedgenesinthePSliteraturedataset.Ontheotherhand,defensinbeta4B(DEFB4B)doesnotappearinthePSliteraturedatasetdespiteshowingthesecond-highestlevelofoverexpressioninthePSmicroarraydataset.Ourstrategyforgatheringmicroarraydatawastoselectonedatasetforeachdiseaseofinterest,eachdatasetcreatedwiththesameplatformtoavoidvariabilityacrossmanufacturers.Forliteraturedata,ourapproachwastogatherarepresentativesampleoftheliterature,ratherthantocreateanexhaustiverepresentation.We,moreover,focusedonabstracts,ratherthanonfulltextarticles,duetolimitedfulltextavailability.Thus,thetruenumberofstatementsregardingdifferentialexpressionintheliteratureislargerthanwhatisreportedhere.Thefactthatmoreliteratureresultswereorientedtowardsoverexpressionthanunderexpression,unlikeinmicroarraydata,indicatesascientificbiastowardsreportingoverexpression.Thisbiascouldberelatedtothefactthatmostdrugsareinhibitorsandthereforeanoverexpressedgeneismorelikelytorepresentapotentialtarget.Since,inprinciple,downregulationmayhaveasmuchfunctionalimportanceindiseaseasupregulation,thisbiascouldbedistortinginourunderstandingofdiseases.Wealsonotedthatpopulargenestendtobemoreoftendescribedintheliteratureasoverexpressedindisease,aneffectthatismuchmilderornon-existentforoverexpressedgenesfrommicroarraydata.Thiscouldexplainpartiallywhydifferentialexpressionsimilaritiesbetweendiseasesarehigherwithintheliteratureincomparisontomicroarraydata.Thequestforhigherresearchimpactcouldbeoneofthedriversfortheadditionalattentionpaidtopopulargenes[31,32,33],leadingtofurtheramplificationoftheirpresumedbiologicalimportancebeyondactualbiologicalevidence.Ouranalysisalsohintsthatourperceptionofthelevelofsimilaritybetweencertaindiseasescouldbebiasedbygeneralpropertiesofthediseasesthatarenotreflectedintheexpressiondata.Thus,PSandAD,whichshareanatomicallocation,appearmoresimilarintheliteraturethanUCandAD,contrarytowhatisreflectedinmicroarraydata.Wealsofoundthatmicroarraydataandtheliteraturecanproducedivergentviewsofthepathologicalmechanismsdrivingdiseasesdependingonthefold-changecutoff.ForFC > 0,thefunctionalclassesassociatedtooverexpressedgenesintheliteraturecanbeverydifferentfromthoseassociatedtomicroarraydata.AsthethresholdforFCincreases,thesimilaritybetweentheliteratureandmicroarraydataincreases,whichisthenreflectedinhigherLR+valuesandoverlappingfunctionalclasses.Oneexplanationforthedivergencesbetweenmicroarraydataandtheliteraturecomesobviouslyfromthedifferencesinexperimentalsettings.Expressiondatafromtheliteraturestemfromavarietyofsourcesinvolvingmethodssuchasimmunohistochemistry,flowcytometry,insituhybridization,RT-PCR,next-generationsequencing--andalsomicroarrays.Eachofthesesourcesdiffersinlevelofgranularityandmoleculemeasured(e.g.mRNAvs.protein).Ontheotherhand,eventhoughallmicroarraydatainourstudycamefromthesameplatformfromthesamemanufacturer,andeachdatasetwascreatedwithinasingleresearchstudy,microarraydatavariabilityhasbeenshowntobeachallengeforreproducibility[34,35,36,37].Moreover,becauseexperimentsintheliteraturecanbemorefine-grainedthanmicroarraystudies,itispossiblethatagenemightbefoundtobeupregulatedinsomepartsofadiseasedtissueanddownregulatedinothers,confoundingthesimplifiedrepresentationusedhereandhamperingcomparisonswithmicroarraydata.Oneadditionalaspectnotconsideredinthisstudywasthehistoricaldimension.High-throughputtechniqueshavebeengaininginpopularityonlyrecently;thereforeolderpublicationswouldhavebeenlessaffectedbyfindingscomingfromhigh-throughputstudies.ConclusionAtthestartofthisstudywehadtheexpectationthattherewouldbecertainbiasesintheliteratureincomparisontomicroarraydata.Theliteratureevidentlyhasafocusthatis,attheveryleast,biasedbypastresearchhistory,whichdoesnotaffectmicroarraydata.Ourgoalwastoquantifythisbias,usingmicroarraydataastheunbiased“groundtruth.”However,wedidnotexpectthattherelationshipbetweenmicroarraydataandtheliteraturecouldbedependentonFCcutoff(whichinretrospectappearstobenaïve),andthereforethatweshouldnotnecessarilyconsidermicroarraydataagroundtruththattheliteratureonlypartiallyrepresents.TheuseofanFCthresholddoesnotinprinciplehaveafixedbiologicalmeaninganditslinktobiologicalactivitycanchangefromgenetogene.Moreover,differentFCthresholdsyielddifferentoutcomesfromanexpressionstudy[38].Basedonourwork,theliteraturehasacloserconnectionwithmicroarrayexpressiondatafilteredwithhigherFCthresholds,whichmeansthatitmaynottrackbiologicalphenomenaappropriatelywhentheFCthresholdsdonotactuallyseparatemeaningfulandnon-meaningfulexpressionchanges. AbbreviationsAD: Atopicdermatitis DEFB4B: Defensinbeta4B DEG: Differentially-expressedgene FC: Foldchange GEO: GeneExpressionOmnibus GO: GeneOntology IBD: Inflammatoryboweldisease LR: Likelihoodratio NCBI: NationalCenterforBiotechnologyInformation PCA: Principalcomponentanalysis PMID: PubMedID PS: Psoriasis RA: Rheumatoidarthritis RNA-seq: RNAsequencing UC: Ulcerativecolitis ReferencesLogingW,HarlandL,Williams-JonesB.High-throughputelectronicbiology:mininginformationfordrugdiscovery.NatRevDrugDiscov.2007;6(3):220–30.CAS  Article  PubMed  GoogleScholar  CampbellSJ,GaultonA,MarshallJ,BichkoD,MartinS,BrouwerC,HarlandL.Visualizingthedrugtargetlandscape.DrugDiscovToday.2010;15(1-2):3–15.CAS  Article  PubMed  GoogleScholar  KennedyC.Comparatives,Semanticsof.In:BrownK,editor.EncyclopediaofLanguagesandLinguistics.2nded.Oxford:Elsevier;2006. GoogleScholar  ScheininT,BöhlingT,HalmeL,KontiainenS,BjørgeL,MeriS.Decreasedexpressionofprotectin(CD59)ingutepitheliuminulcerativecolitisandCrohn'sdisease.HumPathol.1999;30(12):1427–30.CAS  Article  PubMed  GoogleScholar  BeisswangerE,LeeV,KimJJ,Rebholz-SchuhmannD,SplendianiA,DameronO,SchulzS,HahnU.GeneRegulationOntology(GRO):Designprinciplesandusecases.StudHealthTechnolInform.2008;136:9–14.PubMed  GoogleScholar  LeschCA,KrausER,SanchezB,GilbertsenR,GugliettaA.LackofbeneficialeffectofCOX-2inhibitorsinanexperimentalmodelofcolitis.MethodsFindExpClinPharmacol.1999;21(2):99–104.CAS  Article  PubMed  GoogleScholar  Rodriguez-EstebanR,RobertsPM,CrawfordME.Identifyingandclassifyingbiomedicalperturbationsintext.NucleicAcidsRes.2009;37(3):771–7.CAS  Article  PubMed  GoogleScholar  BarkerC.Vagueness.In:BrownK,editor.EncyclopediaofLanguagesandLinguistics.2nded.Oxford:Elsevier;2006. GoogleScholar  TakahashiH,HashimotoY,Ishida-YamamotoA,IizukaH.Roxithromycinsuppressesinvolucrinexpressionbymodulationofactivatorprotein-1andnuclearfactor-kappaBactivitiesofkeratinocytes.JDermatolSci.2005;39(3):175–82.CAS  Article  PubMed  GoogleScholar  WilburWJ,RzhetskyA,ShatkayH.Newdirectionsinbiomedicaltextannotation:definitions,guidelinesandcorpusconstruction.BMCBioinformatics.2006;7:356.Article  PubMed  PubMedCentral  GoogleScholar  EdgarR,DomrachevM,LashAE.GeneExpressionOmnibus:NCBIgeneexpressionandhybridizationarraydatarepository.NucleicAcidsRes.2002;30(1):207–10.CAS  Article  PubMed  PubMedCentral  GoogleScholar  ParkinsonH,KapusheskyM,KolesnikovN,RusticiG,ShojatalabM,AbeygunawardenaN,BerubeH,DylagM,EmamI,FarneA,HollowayE,LukkM,MaloneJ,ManiR,PilichevaE,RaynerTF,RezwanF,SharmaA,WilliamsE,BradleyXZ,AdamusiakT,BrandiziM,BurdettT,CoulsonR,KrestyaninovaM,KurnosovP,MaguireE,NeogiSG,Rocca-SerraP,SansoneSA,SklyarN,ZhaoM,SarkansU,BrazmaA.ArrayExpressupdate--fromanarchiveoffunctionalgenomicsexperimentstotheatlasofgeneexpression.NucleicAcidsRes.2009;37(Databaseissue):D868–72.CAS  Article  PubMed  GoogleScholar  KodamaK,HorikoshiM,TodaK,YamadaS,HaraK,IrieJ,SirotaM,MorganAA,ChenR,OhtsuH,MaedaS,KadowakiT,ButteAJ.Expression-basedgenome-wideassociationstudylinksthereceptorCD44inadiposetissuewithtype2diabetes.ProcNatlAcadSciUSA.2012;109(18):7049–54.CAS  Article  PubMed  PubMedCentral  GoogleScholar  GusnantoA,CalzaS,PawitanY.Identificationofdifferentiallyexpressedgenesandfalsediscoveryrateinmicroarraystudies.CurrOpinLipidol.2007;18(2):187–93.CAS  Article  PubMed  GoogleScholar  RivasMV,JarvisED,MorisakiS,CarbonaroH,GottliebAB,KruegerJG.IdentificationofaberrantlyregulatedgenesindiseasedskinusingthecDNAdifferentialdisplaytechnique.JInvestDermatol.1997;108(2):188–94.CAS  Article  PubMed  GoogleScholar  TraskHW,Cowper-Sal-lariR,SartorMA,GuiJ,HeathCV,RenukaJ,HigginsAJ,AndrewsP,KorcM,MooreJH,TomlinsonCR.MicroarrayanalysisofcytoplasmicversuswholecellRNArevealsaconsiderablenumberofmissedandfalsepositivemRNAs.RNA.2009;15(10):1917–28.CAS  Article  PubMed  PubMedCentral  GoogleScholar  BammlerT,BeyerRP,BhattacharyaS,BoormanGA,BoylesA,etal.Standardizingglobalgeneexpressionanalysisbetweenlaboratoriesandacrossplatforms.NatMethods.2005;2(5):351–6.Article  PubMed  GoogleScholar  FuX,FuN,GuoS,YanZ,XuY,HuH,MenzelC,ChenW,LiY,ZengR,KhaitovichP.EstimatingaccuracyofRNA-Seqandmicroarrayswithproteomics.BMCGenomics.2009;10:161.Article  PubMed  PubMedCentral  GoogleScholar  NazarovPV,MullerA,KaomaT,NicotN,MaximoC,BirembautP,TranNL,DittmarG,VallarLRNA.sequencingandtranscriptomearraysanalysesshowopposingresultsforalternativesplicinginpatientderivedsamples.BMCGenomics.2017;18(1):443.Article  PubMed  PubMedCentral  GoogleScholar  ChangJF,PopescuM,ArthurGL.Automatedextractionofpreciseproteinexpressionpatternsinlymphomabytextminingabstractsofimmunohistochemicalstudies.JPatholInform.2013;4:20.Article  PubMed  PubMedCentral  GoogleScholar  HunterL,LuZ,FirbyJ,BaumgartnerWAJr,JohnsonHL,OgrenPV,CohenKB.OpenDMAP:anopensource,ontology-drivenconceptanalysisengine,withapplicationstocapturingknowledgeregardingproteintransport,proteininteractionsandcell-type-specificgeneexpression.BMCBioinformatics2008;9:78.GernerM,NenadicG,BergmanCM.Anexplorationofmininggeneexpressionmentionsandtheiranatomicallocationsfrombiomedicaltext.Proceedingsofthe2010WorkshoponBiomedicalNaturalLanguageProcessing.2010. GoogleScholar  NevesM,DamaschunA,MahN,LekschasF,SeltmannS,StachelscheidH,FontaineJF,KurtzA,LeserU.PreliminaryevaluationoftheCellFinderliteraturecurationpipelineforgeneexpressioninkidneycellsandanatomicalparts.Database(Oxford).2013;2013(0):bat020.Article  GoogleScholar  KimJD,OhtaT,PyysaloS,KanoY,TsujiiJ.OverviewofBioNLP’09SharedTaskonEventExtraction.ProceedingsoftheWorkshoponCurrentTrendsinBiomedicalNaturalLanguageProcessing:SharedTask.2009. GoogleScholar  KimJ,PyysaloS,OhtaT,BossyR,NguyenN,TsujiiJ.OverviewofBioNLPSharedTask2011.ProceedingsoftheBioNLPSharedTask2011Workshop;2011.p.1–6. GoogleScholar  WeiCH,HarrisBR,LiD,BerardiniTZ,HualaE,KaoHY,LuZ.Acceleratingliteraturecurationwithtext-miningtools:acasestudyofusingPubTatortocurategenesinPubMedabstracts.Database(Oxford).2012;2012:bas041.Article  GoogleScholar  TomanekK,WermterJ,HahnU.Sentenceandtokensplittingbasedonconditionalrandomfields.Proceedingsofthe10thConferenceofthePacificAssociationforComputationalLinguistics;2007.p.49–57. GoogleScholar  CelisJE,CrügerD,KiilJ,LauridsenJB,RatzG,BasseB,CelisA.Identificationofagroupofproteinsthatarestronglyup-regulatedintotalepidermalkeratinocytesfrompsoriaticskin.FEBSLett.1990;262(2):159–64.CAS  Article  PubMed  GoogleScholar  NajarianDJ,GottliebAB.ConnectionsbetweenpsoriasisandCrohn'sdisease.JAmAcadDermatol.2003;48(6):805–21.Article  PubMed  GoogleScholar  MiH,MuruganujanA,ThomasPD.PANTHERin2013:modelingtheevolutionofgenefunction,andothergeneattributes,inthecontextofphylogenetictrees.NucleicAcidsRes.2013;41(Databaseissue):D377–86.CAS  Article  PubMed  GoogleScholar  CokolM,Rodriguez-EstebanR,RzhetskyA.Arecipeforhighimpact.GenomeBiol.2007;8(5):406.Article  PubMed  PubMedCentral  GoogleScholar  CokolM,Rodriguez-EstebanR.Visualizingevolutionandimpactofbiomedicalfields.JBiomedInform.2008;41(6):1050–2.Article  PubMed  PubMedCentral  GoogleScholar  Rodriguez-EstebanR,LogingWT.Quantifyingthecomplexityofmedicalresearch.Bioinformatics.2013;29(22):2918–24.CAS  Article  PubMed  GoogleScholar  FrantzS.Anarrayofproblems.NatRevDrugDiscov.2005;4(5):362–3.CAS  Article  PubMed  GoogleScholar  MichielsS,KoscielnyS,HillC.Predictionofcanceroutcomewithmicroarrays:amultiplerandomvalidationstrategy.Lancet.2005;365(9458):488–92.CAS  Article  PubMed  GoogleScholar  CouzinJ.Genomics.Microarraydatareproduced,butsomeconcernsremain.Science.2006;313(5793):1559.CAS  Article  PubMed  GoogleScholar  TanPK,DowneyTJ,SpitznagelELJr,XuP,FuD,DimitrovDS,LempickiRA,RaakaBM,CamMC.Evaluationofgeneexpressionmeasurementsfromcommercialmicroarrayplatforms.NucleicAcidsRes2003;31(19):5676-5684.DalmanMR,DeeterA,NimishakaviG,DuanZH.Foldchangeandp-valuecutoffssignificantlyaltermicroarrayinterpretations.BMCBioinformatics.2012;13(Suppl2):S11.Article  PubMed  PubMedCentral  GoogleScholar  DownloadreferencesAcknowledgementsNotapplicable. Availabilityofdataandmaterial ThedatasetssupportingtheconclusionsofthisarticleareavailableintheGeneExpressionOmnibusrepository,https://www.ncbi.nlm.nih.gov/geo/,orincludedwithinthearticle(anditsadditionalfiles). Funding Theauthorsreceivednospecificfundingforthiswork. AuthorinformationAuthorsandAffiliationsRochePharmaceuticalResearchandEarlyDevelopment,RocheInnovationCenterBasel,Grenzacherstrasse124,4070,Basel,SwitzerlandRaulRodriguez-EstebanBiogen,Cambridge,MA,USAXiaoyuJiangAuthorsRaulRodriguez-EstebanViewauthorpublicationsYoucanalsosearchforthisauthorin PubMed GoogleScholarXiaoyuJiangViewauthorpublicationsYoucanalsosearchforthisauthorin PubMed GoogleScholarContributionsConceivedanddesignedtheanalysis:RRandXJ.Gatheredthedata:RRandXJ.Analyzedthedata:RR.Wrotethepaper:RR.Allauthorsrevieweddraftsofthemanuscriptandreadandapprovedthefinalmanuscript.CorrespondingauthorCorrespondenceto RaulRodriguez-Esteban.Ethicsdeclarations Ethicsapprovalandconsenttoparticipate Notapplicable. Consentforpublication Notapplicable. Competinginterests Theauthorsdeclarethattheyhavenocompetinginterests. Publisher’sNote SpringerNatureremainsneutralwithregardtojurisdictionalclaimsinpublishedmapsandinstitutionalaffiliations. Additionalfiles Additionalfile1:BoxplotandPCAforAD.BoxplotandprincipalcomponentanalysisfortheGSE36842study.(TIFF189 kb)Additionalfile2:BoxplotandPCAforCD.BoxplotandprincipalcomponentanalysisfortheGSE36807study.(TIFF120 kb)Additionalfile3:BoxplotandPCAforPS.BoxplotandprincipalcomponentanalysisfortheGSE13355study.(TIFF100 kb)Additionalfile4:BoxplotandPCAforUC.BoxplotandprincipalcomponentanalysisfortheGSE38713study.(TIFF226 kb)Additionalfile5:Textminingresults.Curatedresultsproducedbythetextminingalgorithm.(XLSX162 kb)Rightsandpermissions OpenAccessThisarticleisdistributedunderthetermsoftheCreativeCommonsAttribution4.0InternationalLicense(http://creativecommons.org/licenses/by/4.0/),whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedyougiveappropriatecredittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommonslicense,andindicateifchangesweremade.TheCreativeCommonsPublicDomainDedicationwaiver(http://creativecommons.org/publicdomain/zero/1.0/)appliestothedatamadeavailableinthisarticle,unlessotherwisestated. ReprintsandPermissionsAboutthisarticleCitethisarticleRodriguez-Esteban,R.,Jiang,X.Differentialgeneexpressionindisease:acomparisonbetweenhigh-throughputstudiesandtheliterature. BMCMedGenomics10,59(2017).https://doi.org/10.1186/s12920-017-0293-yDownloadcitationReceived:31March2017Accepted:02October2017Published:11October2017DOI:https://doi.org/10.1186/s12920-017-0293-ySharethisarticleAnyoneyousharethefollowinglinkwithwillbeabletoreadthiscontent:GetshareablelinkSorry,ashareablelinkisnotcurrentlyavailableforthisarticle.Copytoclipboard ProvidedbytheSpringerNatureSharedItcontent-sharinginitiative KeywordsMicroarrayFCDifferentialGeneExpressionDataDifferentiallyExpressedGenes(DEG)MicroarrayDatasetsGeneRegulationOntology DownloadPDF Advertisement BMCMedicalGenomics ISSN:1755-8794 Contactus Submissionenquiries:[email protected] Generalenquiries:[email protected]



請為這篇文章評分?