CUDA - Wikipedia

2024-11-26

文章推薦指數： 80 %

投票人數：10人

CUDA is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing unit ... CUDA FromWikipedia,thefreeencyclopedia Jumptonavigation Jumptosearch Parallelcomputingplatformandprogrammingmodel CUDADeveloper(s)NvidiaInitialreleaseJune 23,2007;14yearsago (2007-06-23)Stablerelease11.6.1 /February 22,2022;22daysago (2022-02-22) OperatingsystemWindows,LinuxPlatformSupportedGPUsTypeGPGPULicenseProprietaryWebsitedeveloper.nvidia.com/cuda-zone CUDA(orComputeUnifiedDeviceArchitecture)isaparallelcomputingplatformandapplicationprogramminginterface(API)thatallowssoftwaretousecertaintypesofgraphicsprocessingunit(GPU)forgeneralpurposeprocessing,anapproachcalledgeneral-purposecomputingonGPUs(GPGPU).CUDAisasoftwarelayerthatgivesdirectaccesstotheGPU'svirtualinstructionsetandparallelcomputationalelements,fortheexecutionofcomputekernels.[1] CUDAisdesignedtoworkwithprogramminglanguagessuchasC,C++,andFortran.ThisaccessibilitymakesiteasierforspecialistsinparallelprogrammingtouseGPUresources,incontrasttopriorAPIslikeDirect3DandOpenGL,whichrequiredadvancedskillsingraphicsprogramming.[2]CUDA-poweredGPUsalsosupportprogrammingframeworkssuchasOpenMP,OpenACCandOpenCL;[3][1]andHIPbycompilingsuchcodetoCUDA. CUDAwascreatedbyNvidia.[4]Whenitwasfirstintroduced,thenamewasanacronymforComputeUnifiedDeviceArchitecture,[5]butNvidialaterdroppedthecommonuseoftheacronym. Contents 1Background 2Programmingabilities 3Advantages 4Limitations 5GPUssupported 6Versionfeaturesandspecifications 7Example 8CurrentandfutureusagesofCUDAarchitecture 9Seealso 10References 11Externallinks Background[edit] Furtherinformation:Graphicsprocessingunit Thegraphicsprocessingunit(GPU),asaspecializedcomputerprocessor,addressesthedemandsofreal-timehigh-resolution3Dgraphicscompute-intensivetasks.By2012,GPUshadevolvedintohighlyparallelmulti-coresystemsallowingveryefficientmanipulationoflargeblocksofdata.Thisdesignismoreeffectivethangeneral-purposecentralprocessingunit(CPUs)foralgorithmsinsituationswhereprocessinglargeblocksofdataisdoneinparallel,suchas: push–relabelmaximumflowalgorithm fastsortalgorithmsoflargelists two-dimensionalfastwavelettransform moleculardynamicssimulations machinelearning Programmingabilities[edit] ExampleofCUDAprocessingflowCopydatafrommainmemorytoGPUmemoryCPUinitiatestheGPUcomputekernelGPU'sCUDAcoresexecutethekernelinparallelCopytheresultingdatafromGPUmemorytomainmemory TheCUDAplatformisaccessibletosoftwaredevelopersthroughCUDA-acceleratedlibraries,compilerdirectivessuchasOpenACC,andextensionstoindustry-standardprogramminglanguagesincludingC,C++andFortran.C/C++programmerscanuse'CUDAC/C++',compiledtoPTXwithnvcc,Nvidia'sLLVM-basedC/C++compiler.[6]Fortranprogrammerscanuse'CUDAFortran',compiledwiththePGICUDAFortrancompilerfromThePortlandGroup. Inadditiontolibraries,compilerdirectives,CUDAC/C++andCUDAFortran,theCUDAplatformsupportsothercomputationalinterfaces,includingtheKhronosGroup'sOpenCL,[7]Microsoft'sDirectCompute,OpenGLComputeShaderandC++AMP.[8]ThirdpartywrappersarealsoavailableforPython,Perl,Fortran,Java,Ruby,Lua,CommonLisp,Haskell,R,MATLAB,IDL,Julia,andnativesupportinMathematica. Inthecomputergameindustry,GPUsareusedforgraphicsrendering,andforgamephysicscalculations(physicaleffectssuchasdebris,smoke,fire,fluids);examplesincludePhysXandBullet.CUDAhasalsobeenusedtoacceleratenon-graphicalapplicationsincomputationalbiology,cryptographyandotherfieldsbyanorderofmagnitudeormore.[9][10][11][12][13] CUDAprovidesbothalowlevelAPI(CUDADriverAPI,nonsingle-source)andahigherlevelAPI(CUDARuntimeAPI,single-source).TheinitialCUDASDKwasmadepublicon15February2007,forMicrosoftWindowsandLinux.MacOSXsupportwaslateraddedinversion2.0,[14]whichsupersedesthebetareleasedFebruary14,2008.[15]CUDAworkswithallNvidiaGPUsfromtheG8xseriesonwards,includingGeForce,QuadroandtheTeslaline.CUDAiscompatiblewithmoststandardoperatingsystems. CUDA8.0comeswiththefollowinglibraries(forcompilation&runtime,inalphabeticalorder): cuBLAS–CUDABasicLinearAlgebraSubroutineslibrary CUDART–CUDARuntimelibrary cuFFT–CUDAFastFourierTransformlibrary cuRAND–CUDARandomNumberGenerationlibrary cuSOLVER–CUDAbasedcollectionofdenseandsparsedirectsolvers cuSPARSE–CUDASparseMatrixlibrary NPP–NVIDIAPerformancePrimitiveslibrary nvGRAPH–NVIDIAGraphAnalyticslibrary NVML–NVIDIAManagementLibrary NVRTC–NVIDIARuntimeCompilationlibraryforCUDAC++ CUDA8.0comeswiththeseothersoftwarecomponents: nView–NVIDIAnViewDesktopManagementSoftware NVWMI–NVIDIAEnterpriseManagementToolkit GameWorksPhysX–isamulti-platformgamephysicsengine CUDA9.0–9.2comeswiththeseothercomponents: CUTLASS1.0–customlinearalgebraalgorithms, NVCUVID–NVIDIAVideoDecoderwasdeprecatedinCUDA9.2;itisnowavailableinNVIDIAVideoCodecSDK CUDA10comeswiththeseothercomponents: nvJPEG–Hybrid(CPUandGPU)JPEGprocessing CUDA11-11.6comeswiththeseothercomponents:[16][17][18][19] CUBisnewoneofmoresupportedC++libraries MIGmultiinstanceGPUsupport nvJPEG2000–JPEG2000encoderanddecoder Advantages[edit] CUDAhasseveraladvantagesovertraditionalgeneral-purposecomputationonGPUs(GPGPU)usinggraphicsAPIs: Scatteredreads –codecanreadfromarbitraryaddressesinmemory. Unifiedvirtualmemory(CUDA 4.0andabove) Unifiedmemory(CUDA 6.0andabove) Sharedmemory –CUDAexposesafastsharedmemoryregionthatcanbesharedamongthreads.Thiscanbeusedasauser-managedcache,enablinghigherbandwidththanispossibleusingtexturelookups.[20] FasterdownloadsandreadbackstoandfromtheGPU Fullsupportforintegerandbitwiseoperations,includingintegertexturelookups OnRTX20and30seriescards,theCUDAcoresareusedforafeaturecalled"RTXIO"WhichiswheretheCUDAcoresdramaticallydecreasegame-loadingtimes. Limitations[edit] WhetherforthehostcomputerortheGPUdevice,allCUDAsourcecodeisnowprocessedaccordingtoC++syntaxrules.[21]Thiswasnotalwaysthecase.EarlierversionsofCUDAwerebasedonCsyntaxrules.[22]AswiththemoregeneralcaseofcompilingCcodewithaC++compiler,itisthereforepossiblethatoldC-styleCUDAsourcecodewilleitherfailtocompileorwillnotbehaveasoriginallyintended. InteroperabilitywithrenderinglanguagessuchasOpenGLisone-way,withOpenGLhavingaccesstoregisteredCUDAmemorybutCUDAnothavingaccesstoOpenGLmemory. Copyingbetweenhostanddevicememorymayincuraperformancehitduetosystembusbandwidthandlatency(thiscanbepartlyalleviatedwithasynchronousmemorytransfers,handledbytheGPU'sDMAengine). Threadsshouldberunningingroupsofatleast32forbestperformance,withtotalnumberofthreadsnumberinginthethousands.Branchesintheprogramcodedonotaffectperformancesignificantly,providedthateachof32threadstakesthesameexecutionpath;theSIMDexecutionmodelbecomesasignificantlimitationforanyinherentlydivergenttask(e.g.traversingaspacepartitioningdatastructureduringraytracing). Noemulatororfallbackfunctionalityisavailableformodernrevisions. ValidC++maysometimesbeflaggedandpreventcompilationduetothewaythecompilerapproachesoptimizationfortargetGPUdevicelimitations.[citationneeded] C++run-timetypeinformation(RTTI)andC++-styleexceptionhandlingareonlysupportedinhostcode,notindevicecode. Insingle-precisiononfirstgenerationCUDAcomputecapability1.xdevices,denormalnumbersareunsupportedandareinsteadflushedtozero,andtheprecisionofboththedivisionandsquarerootoperationsareslightlylowerthanIEEE754-compliantsingleprecisionmath.Devicesthatsupportcomputecapability2.0andabovesupportdenormalnumbers,andthedivisionandsquarerootoperationsareIEEE754compliantbydefault.However,userscanobtainthepriorfastergaming-grademathofcomputecapability1.xdevicesifdesiredbysettingcompilerflagstodisableaccuratedivisionsandaccuratesquareroots,andenableflushingdenormalnumberstozero.[23] UnlikeOpenCL,CUDA-enabledGPUsareonlyavailablefromNvidia.[24]AttemptstoimplementCUDAonotherGPUsinclude: ProjectCoriander:ConvertsCUDAC++11sourcetoOpenCL1.2C.AforkofCUDA-on-CLintendedtorunTensorFlow.[25][26][27] CU2CL:ConvertCUDA3.2C++toOpenCLC.[28] GPUOpenHIP:AthinabstractionlayerontopofCUDAandROCmintendedforAMDandNvidiaGPUs.HasaconversiontoolforimportingCUDAC++source.SupportsCUDA4.0plusC++11andfloat16. GPUssupported[edit] SupportedCUDAlevelofGPUandcard.SeealsoatNvidia: CUDASDK1.0supportforcomputecapability1.0–1.1(Tesla)[29] CUDASDK1.1supportforcomputecapability1.0–1.1+x(Tesla) CUDASDK2.0supportforcomputecapability1.0–1.1+x(Tesla) CUDASDK2.1–2.3.1supportforcomputecapability1.0–1.3(Tesla)[30][31][32][33] CUDASDK3.0–3.1supportforcomputecapability1.0–2.0(Tesla,Fermi)[34][35] CUDASDK3.2supportforcomputecapability1.0–2.1(Tesla,Fermi)[36] CUDASDK4.0–4.2supportforcomputecapability1.0–2.1+x(Tesla,Fermi,more?). CUDASDK5.0–5.5supportforcomputecapability1.0–3.5(Tesla,Fermi,Kepler). CUDASDK6.0supportforcomputecapability1.0–3.5(Tesla,Fermi,Kepler). CUDASDK6.5supportforcomputecapability1.1–5.x(Tesla,Fermi,Kepler,Maxwell).Lastversionwithsupportforcomputecapability1.x(Tesla). CUDASDK7.0–7.5supportforcomputecapability2.0–5.x(Fermi,Kepler,Maxwell). CUDASDK8.0supportforcomputecapability2.0–6.x(Fermi,Kepler,Maxwell,Pascal).Lastversionwithsupportforcomputecapability2.x(Fermi)(PascalGTX1070TiNotSupported). CUDASDK9.0–9.2supportforcomputecapability3.0–7.2(Kepler,Maxwell,Pascal,Volta)(PascalGTX1070TiNotSupported.CUDASDK9.0andsupportCUDASDK9.2). CUDASDK10.0–10.2supportforcomputecapability3.0–7.5(Kepler,Maxwell,Pascal,Volta,Turing).Lastversionwithsupportforcomputecapability3.x(Kepler).10.2isthelastofficialreleaseformacOS,assupportwillnotbeavailableformacOSinnewerreleases. CUDASDK11.0supportforcomputecapability3.5–8.0(Kepler(inpart),Maxwell,Pascal,Volta,Turing,Ampere(inpart)).[37]Newdatatypes:Bfloat16andTF32onthird-generationsTensorCores.[38] CUDASDK11.1–11.6supportforcomputecapability3.5–8.6(Kepler(inpart),Maxwell,Pascal,Volta,Turing,Ampere).[39] Computecapability(version) Micro-architecture GPUs GeForce Quadro,NVS Tesla/Datacenter Tegra,Jetson,DRIVE 1.0 Tesla G80 GeForce8800Ultra,GeForce8800GTX,GeForce8800GTS(G80) QuadroFX5600,QuadroFX4600,QuadroPlex2100S4 TeslaC870,TeslaD870,TeslaS870 1.1 G92,G94,G96,G98,G84,G86 GeForceGTS250,GeForce9800GX2,GeForce9800GTX,GeForce9800GT,GeForce8800GTS(G92),GeForce8800GT,GeForce9600GT,GeForce9500GT,GeForce9400GT,GeForce8600GTS,GeForce8600GT,GeForce8500GT,GeForceG110M,GeForce9300MGS,GeForce9200MGS,GeForce9100MG,GeForce8400MGT,GeForceG105M QuadroFX4700X2,QuadroFX3700,QuadroFX1800,QuadroFX1700,QuadroFX580,QuadroFX570,QuadroFX470,QuadroFX380,QuadroFX370,QuadroFX370LowProfile,QuadroNVS450,QuadroNVS420,QuadroNVS290,QuadroNVS295,QuadroPlex2100D4,QuadroFX3800M,QuadroFX3700M,QuadroFX3600M,QuadroFX2800M,QuadroFX2700M,QuadroFX1700M,QuadroFX1600M,QuadroFX770M,QuadroFX570M,QuadroFX370M,QuadroFX360M,QuadroNVS320M,QuadroNVS160M,QuadroNVS150M,QuadroNVS140M,QuadroNVS135M,QuadroNVS130M,QuadroNVS450,QuadroNVS420,[40]QuadroNVS295 1.2 GT218,GT216,GT215 GeForceGT340*,GeForceGT330*,GeForceGT320*,GeForce315*,GeForce310*,GeForceGT240,GeForceGT220,GeForce210,GeForceGTS360M,GeForceGTS350M,GeForceGT335M,GeForceGT330M,GeForceGT325M,GeForceGT240M,GeForceG210M,GeForce310M,GeForce305M QuadroFX380LowProfile,QuadroFX1800M,QuadroFX880M,QuadroFX380M,NvidiaNVS300,NVS5100M,NVS3100M,NVS2100M,ION 1.3 GT200,GT200b GeForceGTX295,GTX285,GTX280,GeForceGTX275,GeForceGTX260 QuadroFX5800,QuadroFX4800,QuadroFX4800forMac,QuadroFX3800,QuadroCX,QuadroPlex2200D2 TeslaC1060,TeslaS1070,TeslaM1060 2.0 Fermi GF100,GF110 GeForceGTX590,GeForceGTX580,GeForceGTX570,GeForceGTX480,GeForceGTX470,GeForceGTX465,GeForceGTX480M Quadro6000,Quadro5000,Quadro4000,Quadro4000forMac,QuadroPlex7000,Quadro5010M,Quadro5000M TeslaC2075,TeslaC2050/C2070,TeslaM2050/M2070/M2075/M2090 2.1 GF104,GF106GF108,GF114,GF116,GF117,GF119 GeForceGTX560Ti,GeForceGTX550Ti,GeForceGTX460,GeForceGTS450,GeForceGTS450*,GeForceGT640(GDDR3),GeForceGT630,GeForceGT620,GeForceGT610,GeForceGT520,GeForceGT440,GeForceGT440*,GeForceGT430,GeForceGT430*,GeForceGT420*,GeForceGTX675M,GeForceGTX670M,GeForceGT635M,GeForceGT630M,GeForceGT625M,GeForceGT720M,GeForceGT620M,GeForce710M,GeForce610M,GeForce820M,GeForceGTX580M,GeForceGTX570M,GeForceGTX560M,GeForceGT555M,GeForceGT550M,GeForceGT540M,GeForceGT525M,GeForceGT520MX,GeForceGT520M,GeForceGTX485M,GeForceGTX470M,GeForceGTX460M,GeForceGT445M,GeForceGT435M,GeForceGT420M,GeForceGT415M,GeForce710M,GeForce410M Quadro2000,Quadro2000D,Quadro600,Quadro4000M,Quadro3000M,Quadro2000M,Quadro1000M,NVS310,NVS315,NVS5400M,NVS5200M,NVS4200M 3.0 Kepler GK104,GK106,GK107 GeForceGTX770,GeForceGTX760,GeForceGT740,GeForceGTX690,GeForceGTX680,GeForceGTX670,GeForceGTX660Ti,GeForceGTX660,GeForceGTX650TiBOOST,GeForceGTX650Ti,GeForceGTX650,GeForceGTX880M,GeForceGTX870M,GeForceGTX780M,GeForceGTX770M,GeForceGTX765M,GeForceGTX760M,GeForceGTX680MX,GeForceGTX680M,GeForceGTX675MX,GeForceGTX670MX,GeForceGTX660M,GeForceGT750M,GeForceGT650M,GeForceGT745M,GeForceGT645M,GeForceGT740M,GeForceGT730M,GeForceGT640M,GeForceGT640MLE,GeForceGT735M,GeForceGT730M QuadroK5000,QuadroK4200,QuadroK4000,QuadroK2000,QuadroK2000D,QuadroK600,QuadroK420,QuadroK500M,QuadroK510M,QuadroK610M,QuadroK1000M,QuadroK2000M,QuadroK1100M,QuadroK2100M,QuadroK3000M,QuadroK3100M,QuadroK4000M,QuadroK5000M,QuadroK4100M,QuadroK5100M,NVS510,Quadro410 TeslaK10,GRIDK340,GRIDK520,GRIDK2 3.2 GK20A Tegra K1,Jetson TK1 3.5 GK110,GK208 GeForceGTXTitanZ,GeForceGTXTitanBlack,GeForceGTXTitan,GeForceGTX780Ti,GeForceGTX780,GeForceGT640(GDDR5),GeForceGT630v2,GeForceGT730,GeForceGT720,GeForceGT710,GeForceGT740M(64-bit,DDR3),GeForceGT920M QuadroK6000,QuadroK5200 TeslaK40,TeslaK20x,TeslaK20 3.7 GK210 TeslaK80 5.0 Maxwell GM107,GM108 GeForceGTX750Ti,GeForceGTX750,GeForceGTX960M,GeForceGTX950M,GeForce940M,GeForce930M,GeForceGTX860M,GeForceGTX850M,GeForce845M,GeForce840M,GeForce830M QuadroK1200,QuadroK2200,QuadroK620,QuadroM2000M,QuadroM1000M,QuadroM600M,QuadroK620M,NVS810 TeslaM10 5.2 GM200,GM204,GM206 GeForceGTXTitanX,GeForceGTX980Ti,GeForceGTX980,GeForceGTX970,GeForceGTX960,GeForceGTX950,GeForceGTX750SE,GeForceGTX980M,GeForceGTX970M,GeForceGTX965M QuadroM600024GB,QuadroM6000,QuadroM5000,QuadroM4000,QuadroM2000,QuadroM5500,QuadroM5000M,QuadroM4000M,QuadroM3000M TeslaM4,TeslaM40,TeslaM6,TeslaM60 5.3 GM20B Tegra X1,Jetson TX1,Jetson Nano,DRIVE CX,DRIVE PX 6.0 Pascal GP100 QuadroGP100 TeslaP100 6.1 GP102,GP104,GP106,GP107,GP108 NvidiaTITANXp,TitanX,GeForceGTX1080Ti,GTX1080,GTX1070Ti,GTX1070,GTX1060,GTX1050Ti,GTX1050,GT1030,GT1010,MX350,MX330,MX250,MX230,MX150,MX130,MX110 QuadroP6000,QuadroP5000,QuadroP4000,QuadroP2200,QuadroP2000,QuadroP1000,QuadroP400,QuadroP500,QuadroP520,QuadroP600,QuadroP5000(Mobile),QuadroP4000(Mobile),QuadroP3000(Mobile) TeslaP40,TeslaP6,TeslaP4 6.2 GP10B[41] Tegra X2,Jetson TX2,DRIVE PX 2 7.0 Volta GV100 NVIDIATITANV QuadroGV100 TeslaV100,TeslaV100S 7.2 GV10B[42] GV11B[43][44] TegraXavier,JetsonXavierNX,JetsonAGXXavier,DRIVEAGXXavier,DRIVEAGXPegasus,ClaraAGX 7.5 Turing TU102,TU104,TU106,TU116,TU117 NVIDIATITANRTX,GeForceRTX2080Ti,RTX2080Super,RTX2080,RTX2070Super,RTX2070,RTX2060Super,RTX206012GB,RTX2060,GeForceGTX1660Ti,GTX1660Super,GTX1660,GTX1650Super,GTX1650,MX550,MX450 QuadroRTX8000,QuadroRTX6000,QuadroRTX5000,QuadroRTX4000,T1000,T600,T400T1200(mobile),T600(mobile),T500(mobile),QuadroT2000(mobile),QuadroT1000(mobile) TeslaT4 8.0 Ampere GA100 A10080GB,A10040GB,A30 8.6 GA102,GA104,GA106,GA107 GeForceRTX3090Ti,RTX3090,RTX3080Ti,RTX308012GB,RTX3080,RTX3070Ti,RTX3070,RTX3060Ti,RTX3060,RTX3050,RTX3050Ti(mobile),RTX3050(mobile),RTX2050(mobile),MX570 RTXA6000,RTXA5000,RTXA4500,RTXA4000,RTXA2000RTXA5000(mobile),RTXA4000(mobile),RTXA3000(mobile),RTXA2000(mobile) A40,A16,A10,A2 8.7 GA10B JetsonOrinNX,JetsonAGXOrin,DRIVEAGXOrin,ClaraHoloscan 9.0 Hopper GH100,GH202 9.x Lovelace AD102,AD104,AD106,AD107 9.x AD10B Computecapability(version) Micro-architecture GPUs GeForce Quadro,NVS Tesla/Datacenter Tegra,Jetson,DRIVE '*'–OEM-onlyproducts Versionfeaturesandspecifications[edit] Featuresupport(unlistedfeaturesaresupportedforallcomputecapabilities) Computecapability(version) 1.0 1.1 1.2 1.3 2.x 3.0 3.2 3.5,3.7,5.0,5.2 5.3 6.x 7.x 8.0 8.6 9.0 9.x Integeratomicfunctionsoperatingon32-bitwordsinglobalmemory No Yes atomicExch()operatingon32-bitfloatingpointvaluesinglobalmemory Integeratomicfunctionsoperatingon32-bitwordsinsharedmemory No Yes atomicExch()operatingon32-bitfloatingpointvaluesinsharedmemory Integeratomicfunctionsoperatingon64-bitwordsinglobalmemory Warpvotefunctions Double-precisionfloating-pointoperations No Yes Atomicfunctionsoperatingon64-bitintegervaluesinsharedmemory No Yes Floating-pointatomicadditionoperatingon32-bitwordsinglobalandsharedmemory _ballot() _threadfence_system() _syncthreads_count(),_syncthreads_and(),_syncthreads_or() Surfacefunctions 3Dgridofthreadblock Warpshufflefunctions,unifiedmemory No Yes Funnelshift No Yes Dynamicparallelism No Yes Half-precisionfloating-pointoperations:addition,subtraction,multiplication,comparison,warpshufflefunctions,conversion No Yes Atomicadditionoperatingon64-bitfloatingpointvaluesinglobalmemoryandsharedmemory No Yes Tensorcore No Yes Mixedprecisionwarp-matrixfunctions No Yes Hardware-acceleratedasync-copy No Yes Hardware-acceleratedsplitarrive/waitbarrier No Yes L2cacheresidencymanagement No Yes Featuresupport(unlistedfeaturesaresupportedforallcomputecapabilities) 1.0 1.1 1.2 1.3 2.x 3.0 3.2 3.5,3.7,5.0,5.2 5.3 6.x 7.x 8.0 8.6 9.0 9.x Computecapability(version) [45] DataType Operation Supportedsince Supportedsinceforglobalmemory Supportedsinceforsharedmemory 16-bitinteger generaloperations 32-bitinteger atomicfunctions 1.1 1.2 64-bitinteger atomicfunctions 1.2 2.0 16-bitfloatingpoint addition,subtraction,multiplication,comparison,warpshufflefunctions,conversion 5.3 32-bitfloatingpoint atomicExch() 1.1 1.2 32-bitfloatingpoint atomicaddition 2.0 2.0 64-bitfloatingpoint generaloperations 1.3 64-bitfloatingpoint atomicaddition 6.0 6.0 tensorcore 7.0 Note:Anymissinglinesoremptyentriesdoreflectsomelackofinformationonthatexactitem. [46] DataType Supportedsincefordensematrices Supportedsinceforsparsematrices Volta/ProfessionalTuringFMApercycleperTensorCore ConsumerTuringFMApercycleperTensorCore ProfessionalAmpere/OrinFMApercycleperTensorCore ConsumerAmpereFMApercycleperTensorCore 1-bitvalues 7.5 - 1024 1024 4096 2048 4-bitintegers 7.5 8.0 256 256 1024 512 8-bitintegers 7.2 8.0 128 128 512 256 16-bitfloatingpointFP16withFP16accumulate 7.0 8.0 64 64 256 128 16-bitfloatingpointFP16withFP32accumulate 7.0 8.0 64 32 256 64 16-bitfloatingpointBF16withFP32accumulate 8.0 8.0 - - 256 64 32-bit(19bitsused)floatingpointTF32 8.0 8.0 - - 128 32 64-bitfloatingpoint 8.0 8.0 - - 16 [47] [48] [49] [50] Technicalspecifications Computecapability(version) 1.0 1.1 1.2 1.3 2.x 3.0 3.2 3.5 3.7 5.0 5.2 5.3 6.0 6.1 6.2 7.0 7.2 7.5 8.0 8.6 9.0 9.x Maximumnumberofresidentgridsperdevice(concurrentkernelexecution) t.b.d. 16 4 32 16 128 32 16 128 16 128 Maximumdimensionalityofgridofthreadblocks 2 3 Maximumx-dimensionofagridofthreadblocks 65535 231−1 Maximumy-,orz-dimensionofagridofthreadblocks 65535 Maximumdimensionalityofthreadblock 3 Maximumx-ory-dimensionofablock 512 1024 Maximumz-dimensionofablock 64 Maximumnumberofthreadsperblock 512 1024 Warpsize 32 Maximumnumberofresidentblockspermultiprocessor 8 16 32 16 32 16 Maximumnumberofresidentwarpspermultiprocessor 24 32 48 64 32 64 48 Maximumnumberofresidentthreadspermultiprocessor 768 1024 1536 2048 1024 2048 1536 Numberof32-bitregisterspermultiprocessor 8K 16K 32K 64K 128K 64K Maximumnumberof32-bitregistersperthreadblock N/A 32K 64K 32K 64K 32K 64K 32K 64K Maximumnumberof32-bitregistersperthread 124 63 255 Maximumamountofsharedmemorypermultiprocessor 16KB 48KB 112KB 64KB 96KB 64KB 96KB 64KB 96KB(of128) 64KB(of96) 164KB(of192) 100KB(of128) Maximumamountofsharedmemoryperthreadblock 48KB 96KB 48KB 64KB 163KB 99KB Numberofsharedmemorybanks 16 32 Amountoflocalmemoryperthread 16KB 512KB Constantmemorysize 64KB Cacheworkingsetpermultiprocessorforconstantmemory 8KB 4KB 8KB Cacheworkingsetpermultiprocessorfortexturememory 6 – 8 KB 12 KB 12 – 48 KB 24 KB 48 KB N/A 24 KB 48 KB 24 KB 32 – 128 KB 32 – 64 KB 28 – 192 KB 28 – 128 KB Maximumwidthfor1DtexturereferenceboundtoaCUDAarray 8192 65536 131072 Maximumwidthfor1Dtexturereferenceboundtolinearmemory 227 228 227 228 227 228 Maximumwidthandnumberoflayersfora1Dlayeredtexturereference 8192×512 16384×2048 32768x2048 Maximumwidthandheightfor2DtexturereferenceboundtoaCUDAarray 65536×32768 65536×65535 131072x65536 Maximumwidthandheightfor2Dtexturereferenceboundtoalinearmemory 65000x65000 65536x65536 131072x65000 Maximumwidthandheightfor2DtexturereferenceboundtoaCUDAarraysupportingtexturegather N/A 16384x16384 32768x32768 Maximumwidth,height,andnumberoflayersfora2Dlayeredtexturereference 8192×8192×512 16384×16384×2048 32768x32768x2048 Maximumwidth,heightanddepthfora3DtexturereferenceboundtolinearmemoryoraCUDAarray 20483 40963 163843 Maximumwidth(andheight)foracubemaptexturereference N/A 16384 32768 Maximumwidth(andheight)andnumberoflayersforacubemaplayeredtexturereference N/A 16384×2046 32768×2046 Maximumnumberoftexturesthatcanbeboundtoakernel 128 256 Maximumwidthfora1DsurfacereferenceboundtoaCUDAarray Notsupported 65536 16384 32768 Maximumwidthandnumberoflayersfora1Dlayeredsurfacereference 65536×2048 16384×2048 32768×2048 Maximumwidthandheightfora2DsurfacereferenceboundtoaCUDAarray 65536×32768 16384×65536 131072×65536 Maximumwidth,height,andnumberoflayersfora2Dlayeredsurfacereference 65536×32768×2048 16384×16384×2048 32768×32768×2048 Maximumwidth,height,anddepthfora3DsurfacereferenceboundtoaCUDAarray 65536×32768×2048 4096×4096×4096 16384×16384×16384 Maximumwidth(andheight)foracubemapsurfacereferenceboundtoaCUDAarray 32768 16384 32768 Maximumwidthandnumberoflayersforacubemaplayeredsurfacereference 32768×2046 16384×2046 32768×2046 Maximumnumberofsurfacesthatcanbeboundtoakernel 8 16 32 Maximumnumberofinstructionsperkernel 2million 512million Technicalspecifications 1.0 1.1 1.2 1.3 2.x 3.0 3.2 3.5 3.7 5.0 5.2 5.3 6.0 6.1 6.2 7.0 7.2 7.5 8.0 8.6 9.0 9.x Computecapability(version) [51] Architecturespecifications Computecapability(version) 1.0 1.1 1.2 1.3 2.0 2.1 3.0 3.5 3.7 5.0 5.2 6.0 6.1,6.2 7.0,7.2 7.5 8.0 8.6 9.0 9.x NumberofALUlanesforintegerandsingle-precisionfloating-pointarithmeticoperations 8[52] 32 48 192 128 64 128 64 Numberofspecialfunctionunitsforsingle-precisionfloating-pointtranscendentalfunctions 2 4 8 32 16 32 16 Numberoftexturefilteringunitsforeverytextureaddressunitorrenderoutputunit(ROP) 2 4 8 16 8[53] Numberofwarpschedulers 1 2 4 2 4 Maxnumberofinstructionsissuedatoncebyasinglescheduler 1 2[54] 1 Numberoftensorcores N/A 8[53] 4 SizeinKBofunifiedmemoryfordatacacheandsharedmemorypermultiprocessor t.b.d. 128 96[55] 192 128 [56] FormoreinformationreadtheNvidiaCUDAprogrammingguide.[57] Example[edit] ThisexamplecodeinC++loadsatexturefromanimageintoanarrayontheGPU: texturetex; voidfoo() { cudaArray*cu_array; //Allocatearray cudaChannelFormatDescdescription=cudaCreateChannelDesc(); cudaMallocArray(&cu_array,&description,width,height); //Copyimagedatatoarray cudaMemcpyToArray(cu_array,image,width*height*sizeof(float),cudaMemcpyHostToDevice); //Settextureparameters(default) tex.addressMode[0]=cudaAddressModeClamp; tex.addressMode[1]=cudaAddressModeClamp; tex.filterMode=cudaFilterModePoint; tex.normalized=false;//donotnormalizecoordinates //Bindthearraytothetexture cudaBindTextureToArray(tex,cu_array); //Runkernel dim3blockDim(16,16,1); dim3gridDim((width+blockDim.x-1)/blockDim.x,(height+blockDim.y-1)/blockDim.y,1); kernel<<>>(d_data,height,width); //Unbindthearrayfromthetexture cudaUnbindTexture(tex); }//endfoo() __global__voidkernel(float*odata,intheight,intwidth) { unsignedintx=blockIdx.x*blockDim.x+threadIdx.x; unsignedinty=blockIdx.y*blockDim.y+threadIdx.y; if(x