CUDA FAQ | NVIDIA Developer

2024-10-07

文章推薦指數： 80 %

投票人數：10人

The compute capability of a GPU determines its general specifications and available features. For a details, see the Compute Capabilities section in the CUDA C ... SkiptomaincontentHomeCUDAFAQSectionsGeneralQuestionsHardwareandArchitectureProgrammingQuestionsGeneralQuestionsQ:WhatisCUDA?CUDA®isaparallelcomputingplatformandprogrammingmodelthatenablesdramaticincreasesincomputingperformancebyharnessingthepowerofthegraphicsprocessingunit(GPU). Sinceitsintroductionin2006,CUDAhasbeenwidelydeployedthroughthousandsofapplicationsandpublishedresearchpapers,andsupportedbyaninstalledbaseofhundredsofmillionsofCUDA-enabledGPUsinnotebooks,workstations,computeclustersandsupercomputers. Applications usedin astronomy,biology,chemistry,physics,datamining,manufacturing,finance,andothercomputationallyintensefieldsareincreasinglyusingCUDAtodeliverthebenefitsofGPUacceleration. Q:WhatisNVIDIATesla™?Withtheworld’sfirstteraflopmany-coreprocessor,NVIDIA®Tesla™computingsolutionsenablethenecessarytransitiontoenergyefficientparallelcomputingpower.WiththousandsofCUDAcoresperprocessor,Teslascalestosolvetheworld’smostimportantcomputingchallenges—quicklyandaccurately.Q:WhatisOpenACC?OpenACCisanopenindustrystandardforcompilerdirectivesorhintswhichcanbeinsertedincodewritteninCorFortranenablingthecompilertogeneratecodewhichwouldruninparallelonmulti-CPUandGPUacceleratedsystem.OpenACCdirectivesareeasyandpowerfulwaytoleveragethepowerofGPUComputingwhilekeepingyourcodecompatiblefornon-acceleratedCPUonlysystems.Learnmoreat/openacc.Q:WhatkindofperformanceincreasecanIexpectusingGPUComputingoverCPU-onlycode?Thisdependsonhowwelltheproblemmapsontothearchitecture.Fordataparallelapplications,accelerationsofmorethantwoordersofmagnitudehavebeenseen.Youcanbrowseresearch,developer,applicationsandpartnersonourCUDAInActionPage Q:WhatoperatingsystemsdoesCUDAsupport?CUDAsupportsWindows,LinuxandMacOS.ForfulllistseethelatestCUDAToolkit ReleaseNotes.Thelatestversionisavailableathttp://docs.nvidia.comQ:WhichGPUssupportrunningCUDA-acceleratedapplications?CUDAisastandardfeatureinallNVIDIAGeForce,Quadro,andTeslaGPUsaswellasNVIDIAGRIDsolutions. AfulllistcanbefoundontheCUDAGPUsPage.Q:Whatisthe"computecapability"?ThecomputecapabilityofaGPUdeterminesitsgeneralspecificationsandavailablefeatures.Foradetails,seetheComputeCapabilitiessectionintheCUDACProgrammingGuide.Q: WherecanIfindagoodintroductiontoparallelprogramming?Thereareseveraluniversitycoursesonline,technicalwebinars,articleseriesandalsoseveralexcellentbooksonparallelcomputing.ThesecanbefoundonourCUDAEducationPage.HardwareandArchitectureQ:WillIhavetore-writemyCUDAKernelswhenthenextnewGPUarchitectureisreleased?No.CUDAC/C++providesanabstraction;it’sameansforyoutoexpresshowyouwantyourprogramtoexecute.ThecompilergeneratesPTXcodewhichisalsonothardwarespecific.Atrun-timethePTXiscompiledforaspecifictargetGPU-thisistheresponsibilityofthedriverwhichisupdatedeverytimeanewGPUisreleased.Itispossiblethatchangesinthenumberofregistersorsizeofsharedmemorymayopenuptheopportunityforfurtheroptimizationbutthat'soptional.Sowriteyourcodenow,andenjoyitrunningonfutureGPU'sQ:DoesCUDAsupportmultiplegraphicscardsinonesystem?Yes.ApplicationscandistributeworkacrossmultipleGPUs.Thisisnotdoneautomatically,however,sotheapplicationhascompletecontrol.Seethe"multiGPU"exampleintheGPUComputingSDKforanexampleofprogrammingmultipleGPUs.Q:WherecanIfindmoreinformationonNVIDIAGPUarchitecture?Twogoodplacestostartare:KeplerArchitectureWhitePaperFermiArchitectureWhitePaper ProgrammingQuestionsQ:IthinkI'vefoundabuginCUDA,howdoIreportit?SignupasaCUDAregistereddeveloper,onceyourapplicationhasbeenapprovedyoucanfilebugswhichwillbereviewedbyNVIDIAengineering.Yourbugreportshouldincludeasimple,self-containedpieceofcodethatdemonstratesthebug,alongwithadescriptionofthebugandtheexpectedbehavior.Pleaseincludethefollowinginformationwithyourbugreport:Machineconfiguration(CPU,Motherboard,memoryetc.)OperatingsystemCUDAToolkitversionDisplaydriverversionForLinuxusers,pleaseattachannvidia-bug-report.log,whichisgeneratedbyrunning"nvidia-bug-report.sh". Q:HowdoesCUDAstructurecomputation?CUDAbroadlyfollowsthedata-parallelmodelofcomputation.Typicallyeachthreadexecutesthesameoperationondifferentelementsofthedatainparallel.Thedataissplitupintoa1D,2Dor3Dgridofblocks.Eachblockcanbe1D,2Dor3Dinshape,andcanconsistof over512threadsoncurrenthardware.Threadswithinathreadblockcancooperateviathesharedmemory.Threadblocksareexecutedassmallergroupsofthreadsknownas"warps".Q:CantheCPUandGPUruninparallel?KernelinvocationinCUDAisasynchronous,sothedriverwillreturncontroltotheapplicationassoonasithaslaunchedthekernel.The"cudaThreadSynchronize()"APIcallshouldbeusedwhenmeasuringperformancetoensurethatalldeviceoperationshavecompletedbeforestoppingthetimer.CUDAfunctionsthatperformmemorycopiesandthatcontrolgraphicsinteroperabilityaresynchronous,andimplicitlywaitforallkernelstocomplete.Q:CanItransferdataandrunakernelinparallel(forstreamingapplications)?Yes,CUDAsupportsoverlappingGPUcomputationanddatatransfersusingCUDAstreams.SeetheAsynchronousConcurrentExecutionsectionoftheCUDACProgrammingGuideformoredetails.Q:IsitpossibletoDMAdirectlyintoGPUmemoryfromanotherPCI-Edevice?GPUDirectallowsyoutoDMAdirectlytoGPUhostmemory. SeetheGPUDirecttechnologypagefordetails.Q:WhatarethepeaktransferratesbetweentheCPUandGPU?Theperformanceofmemorytransfersdependsonmanyfactors,includingthesizeofthetransferandtypeofsystemmotherboardused.OnPCI-Express2.0systemswehavemeasuredupto6.0GB/sectransferrates.YoucanmeasurethebandwidthonyoursystemusingthebandwidthtestsamplefromtheSDK.Transfersfrompage-lockedmemoryarefasterbecausetheGPUcanDMAdirectlyfromthismemory.Howeverallocatingtoomuchpage-lockedmemorycansignificantlyaffecttheoverallperformanceofthesystem,soallocateitwithcare.Q:WhatistheprecisionofmathematicaloperationsinCUDA?AllthecurrentrangeofNVIDIAGPUsandsinceGT200 havedoubleprecisionfloatingpoint.Seetheprogrammingguideformoredetails. Allcompute-capableNVIDIAGPUssupport32-bitintegerandsingleprecisionfloatingpointarithmetic.TheyfollowtheIEEE-754standardforsingle-precisionbinaryfloating-pointarithmetic,withsomeminordifferences.Q: WhyaretheresultsofmyGPUcomputationslightlydifferentfromtheCPUresults? Therearemanypossiblereasons.Floatingpointcomputationsarenotguaranteedtogiveidenticalresultsacrossanysetofprocessorarchitectures.TheorderofoperationswilloftenbedifferentwhenimplementingalgorithmsinadataparallelwayontheGPU.Thisisaverygoodreferenceonfloatingpointarithmetic: Precision&Performance:FloatingPointandIEEE754ComplianceforNVIDIAGPUsQ:DoesCUDAsupportdoubleprecisionarithmetic?Yes.GPUswithcomputecapability1.3andhighersupportdoubleprecisionfloatingpointinhardware.Q:HowdoIgetdoubleprecisionfloatingpointtoworkinmykernel?Youneedtoaddtheswitch"-archsm_13"(orahighercomputecapability)toyournvcccommandline,otherwisedoubleswillbesilentlydemotedtofloats.Seethe"Mandelbrot"sampleincludedintheCUDAInstallerforanexampleofhowtoswitchbetweendifferentkernelsbasedonthecomputecapabilityoftheGPU.Q:CanIreaddoubleprecisionfloatsfromtexture?Thehardwaredoesn'tsupportdoubleprecisionfloatasatextureformat,butitispossibletouseint2andcastittodoubleaslongasyoudon'tneedinterpolation:texturemy_texture;static__inline____device__doublefetch_double(texturet,inti){int2v=tex1Dfetch(t,i);return__hiloint2double(v.y,v.x);}Q:DoesCUDAsupportlongintegers?Yes,CUDAsupports64bitintegers(longlongs).OperationsonthesetypescompiletomultipleinstructionsequencesonsomeGPUdependingoncomputecapability.Q:WherecanIfinddocumentationonthePTXassemblylanguage?ThisisincludedintheCUDAToolkitdocumentation.Q: HowcanIseethePTXcodegeneratedbymyprogram? Add"-keep"tothenvcccommandline(orcustombuildsetupinVisualStudio)tokeeptheintermediatecompilationfiles.Thenlookatthe".ptx"file.Q: HowcanIfindouthowmanyregisters/howmuchshared/constantmemorymykernelisusing?Addtheoption"--ptxas-options=-v"tothenvcccommandline.Whencompiling,thisinformationwillbeoutputtotheconsole.Q: Isitpossibletoexecutemultiplekernelsatthesametime? Yes.GPUsofcomputecapability2.xorhighersupportconcurrentkernelexecutionandlaunches.Q:WhatisthemaximumlengthofaCUDAkernel? SincethiscouldbedependentonthecomputecapabilityofyourGPU-thedefinitiveanswertothiscanbefoundintheFeatures&technicalspecificationsectionoftheCUDACprogrammingguide.Q:HowcanIdebugmyCUDAcode?Thereareseveralpowerfuldebuggingtoolswhichallowthecreationofbreakpointsandtraces.Toolsexistforallthemajoroperatingsystemsandmulti-GPUsolutionsandclusters.PleasevisittheCUDAToolsandEcosystemPageforthelatestdebuggingtools.Q:HowcanIoptimizemyCUDAcode?TherearenowextensiveguidesandexamplesonhowtooptimizeyourCUDAcode.Findsomeusefullinksbelow:CUDACProgrammingGuideCUDAEducationPagesPerformanceAnalysisToolsOptimizedLibrariesQ:HowdoIchoosetheoptimalnumberofthreadsperblock?FormaximumutilizationoftheGPUyoushouldcarefullybalancethenumberofthreadsperthreadblock,theamountofsharedmemoryperblock,andthenumberofregistersusedbythekernel.YoucanusetheCUDAOccupancyCalculatortooltocomputethemultiprocessoroccupancyofaGPUbyagivenCUDAkernel.ThisisincludedaspartofthelatestCUDAToolkit. Q:Whatisthemaximumkernelexecutiontime? OnWindows,individualGPUprogramlauncheshaveamaximumruntimeofaround5seconds.ExceedingthistimelimitusuallywillcausealaunchfailurereportedthroughtheCUDAdriverortheCUDAruntime,butinsomecasescanhangtheentiremachine,requiringahardreset.ThisiscausedbytheWindows"watchdog"timerthatcausesprogramsusingtheprimarygraphicsadaptertotimeoutiftheyrunlongerthanthemaximumallowedtime.ForthisreasonitisrecommendedthatCUDAisrunonaGPUthatisNOTattachedtoadisplayanddoesnothavetheWindowsdesktopextendedontoit.Inthiscase,thesystemmustcontainatleastoneNVIDIAGPUthatservesastheprimarygraphicsadapter.Q:HowdoIcomputethesumofanarrayofnumbersontheGPU?Thisisknownasaparallelreductionoperation.Seethe"reduction"sampleformoredetails.Q:HowdoIoutputavariableamountofdatafromeachthread?Thiscanbeachievedusingaparallelprefixsum(alsoknownas"scan")operation.TheCUDADataParallelPrimitiveslibrary(CUDPP)includeshighlyoptimizedscanfunctions:http://www.gpgpu.org/developer/cudpp/The"marchingCubes"sampledemonstratestheuseofscanforvariableoutputperthread.Q:HowdoIsortanarrayontheGPU? Theprovided"particles"sampleincludesafastparallelradixsort.Tosortanarrayofvalueswithinablock,youcanuseaparallelbitonicsort.Alsoseethe"bitonic"sample.TheThrustlibrariesalsoincludessortfunctions.Seemoresampleinfoonouronlinesampledocumentation.Q:WhatdoIneedtodistributemyCUDAapplication? ApplicationsthatusethedriverAPIonlyneedtheCUDAdriverlibrary("nvcuda.dll"underWindows),whichisincludedaspartofthestandardNVIDIAdriverinstall.ApplicationsthatusetheruntimeAPIalsorequiretheruntimelibrary("cudart.dll"underWindows),whichisincludedintheCUDAToolkit.ItispermissibletodistributethislibrarywithyourapplicationunderthetermsoftheEndUserLicenseAgreementincludedwiththeCUDAToolkit.Q:HowcanIgetinformationonGPUtemperaturefrommyapplication?OnMicrosoftWindowsplatforms,NVIDIA'sNVAPIgivesaccesstoGPUtemperatureandmanyotherlow-levelGPUfunctionsUnderLinux,the"nvidia-smi"utility,whichisincludedwiththestandarddriverinstall,alsodisplaysGPUtemperatureforallinstalleddevices.Tools,LibrariesandSolutionsQ:WhatisCUFFT?CUFFTisaFastFourierTransform(FFT)libraryforCUDA.SeetheCUFFTdocumentationformoreinformation.Q:WhattypesoftransformsdoesCUFFTsupport?Thecurrentreleasesupportscomplextocomplex(C2C),realtocomplex(R2C)andcomplextoreal(C2R).Q:Whatisthemaximumtransformsize?For1Dtransforms,themaximumtransformsizeis16Melementsinthe1.0release.Q:WhatisCUBLAS?CUBLASisanimplementationofBLAS(BasicLinearAlgebraSubprograms)ontopoftheCUDAdriver.ItallowsaccesstothecomputationalresourcesofNVIDIAGPUs.ThelibraryisselfcontainedattheAPIlevel,thatis,nodirectinteractionwiththeCUDAdriverisnecessary.Q:DoesNVIDIAhaveaCUDAdebuggeronLinuxandMAC?YesCUDA-GDBisCUDADebuggerforLinuxdistrosandMACOSXplatforms.Q:DoesCUDA-GDBsupportanyUIs?CUDA-GDBisacommandlinedebuggerbutcanbeusedwithGUIfrontendslikeDDD-DataDisplayDebuggerandEmacsandXEmacs.Therearealsothirdpartysolutions,seethelistofoptionsonourTools&EcosystemPageQ:WhatarethemaindifferencesbetweenParellelNsightandCUDA-GDB?Bothsharethesame featuresexceptforthefollowing:ParallelNsightrunsonWindowsandcandebugbothgraphicsandCUDAcodeontheGPU(noCPUcodedebugging).CUDA-GDBrunsonLinuxandMacOSandcandebugbothCPUcodeandCUDAcodeontheGPU(nographicsdebuggingontheGPU).Q:HowdoesonedebugOGL+CUDAapplicationwithaninteractivedesktop?YoucansshorusenxclientorvnctoremotelydebuganOGL+CUDAapplication.ThisrequiresuserstodisableinteractivesessioninXserverconfigfile.FordetailsrefertotheCUDA-GDBuserguide.Q:WhichdebuggerdoIuseforClusterdebugging?NVIDIAworkswithitspartnerstoprovideclustersdebugger.TherearetwoclusterdebuggersthatsupportCUDA-DDTfromAllineaandTotalViewdebuggerfromRogeWavesoftware.Q:Whatimpactdoesthe-Gflaghaveoncodeoptimizations?The-GflagturnsoffmostofthecompileroptimizationsontheCUDAcode.Someoptimizationscannotbeturnedoffbecausetheyarerequiredfortheapplicationtokeeprunningproperly.Forinstance:localvariableswillnotbespilledtolocalmemory,andinsteadarepreservedinregisterswhichthedebuggertracksliverangesfor.Itisrequiredtoensurethatanapplicationwillnotrunoutofmemorywhencompiledindebugmodewhenitcouldbelaunchedwithoutincidentwithoutthedebugflag.Q:Isthereawaytoreachthedebuggerteamforadditionalquestionsorissues?Anyoneinterestedcanemailtocuda-debugger-bugs@nvidia.comEngagingwithNVIDIAQ:HowcanIsendsuggestionsforimprovementstotheCUDAToolkit?Becomearegistereddeveloper,thenyoucandirectlyuseourbugreportingsystemtomakesuggestionsandrequests,inadditiontoreportingbugsetc.Q:IwouldliketoasktheCUDATeamsomequestionsdirectly?YoucangetdirectfacetofacetimewithourteamatGTCwhichweholdeveryyear,findoutwhenthenextoneisawww.gputechconf.com AlsoattendoneofourLiveQ&AWebinarswhenyoucanaskquestionsdirectlytosomeofourleadingCUDAengineers.Toattendbecomea registereddeveloper .