CUDA - Wikipedia
文章推薦指數: 80 %
CUDA is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing unit ...
CUDA
FromWikipedia,thefreeencyclopedia
Jumptonavigation
Jumptosearch
Parallelcomputingplatformandprogrammingmodel
CUDADeveloper(s)NvidiaInitialreleaseJune 23,2007;14yearsago (2007-06-23)Stablerelease11.6.1
/February 22,2022;22daysago (2022-02-22)
OperatingsystemWindows,LinuxPlatformSupportedGPUsTypeGPGPULicenseProprietaryWebsitedeveloper.nvidia.com/cuda-zone
CUDA(orComputeUnifiedDeviceArchitecture)isaparallelcomputingplatformandapplicationprogramminginterface(API)thatallowssoftwaretousecertaintypesofgraphicsprocessingunit(GPU)forgeneralpurposeprocessing,anapproachcalledgeneral-purposecomputingonGPUs(GPGPU).CUDAisasoftwarelayerthatgivesdirectaccesstotheGPU'svirtualinstructionsetandparallelcomputationalelements,fortheexecutionofcomputekernels.[1]
CUDAisdesignedtoworkwithprogramminglanguagessuchasC,C++,andFortran.ThisaccessibilitymakesiteasierforspecialistsinparallelprogrammingtouseGPUresources,incontrasttopriorAPIslikeDirect3DandOpenGL,whichrequiredadvancedskillsingraphicsprogramming.[2]CUDA-poweredGPUsalsosupportprogrammingframeworkssuchasOpenMP,OpenACCandOpenCL;[3][1]andHIPbycompilingsuchcodetoCUDA.
CUDAwascreatedbyNvidia.[4]Whenitwasfirstintroduced,thenamewasanacronymforComputeUnifiedDeviceArchitecture,[5]butNvidialaterdroppedthecommonuseoftheacronym.
Contents
1Background
2Programmingabilities
3Advantages
4Limitations
5GPUssupported
6Versionfeaturesandspecifications
7Example
8CurrentandfutureusagesofCUDAarchitecture
9Seealso
10References
11Externallinks
Background[edit]
Furtherinformation:Graphicsprocessingunit
Thegraphicsprocessingunit(GPU),asaspecializedcomputerprocessor,addressesthedemandsofreal-timehigh-resolution3Dgraphicscompute-intensivetasks.By2012,GPUshadevolvedintohighlyparallelmulti-coresystemsallowingveryefficientmanipulationoflargeblocksofdata.Thisdesignismoreeffectivethangeneral-purposecentralprocessingunit(CPUs)foralgorithmsinsituationswhereprocessinglargeblocksofdataisdoneinparallel,suchas:
push–relabelmaximumflowalgorithm
fastsortalgorithmsoflargelists
two-dimensionalfastwavelettransform
moleculardynamicssimulations
machinelearning
Programmingabilities[edit]
ExampleofCUDAprocessingflowCopydatafrommainmemorytoGPUmemoryCPUinitiatestheGPUcomputekernelGPU'sCUDAcoresexecutethekernelinparallelCopytheresultingdatafromGPUmemorytomainmemory
TheCUDAplatformisaccessibletosoftwaredevelopersthroughCUDA-acceleratedlibraries,compilerdirectivessuchasOpenACC,andextensionstoindustry-standardprogramminglanguagesincludingC,C++andFortran.C/C++programmerscanuse'CUDAC/C++',compiledtoPTXwithnvcc,Nvidia'sLLVM-basedC/C++compiler.[6]Fortranprogrammerscanuse'CUDAFortran',compiledwiththePGICUDAFortrancompilerfromThePortlandGroup.
Inadditiontolibraries,compilerdirectives,CUDAC/C++andCUDAFortran,theCUDAplatformsupportsothercomputationalinterfaces,includingtheKhronosGroup'sOpenCL,[7]Microsoft'sDirectCompute,OpenGLComputeShaderandC++AMP.[8]ThirdpartywrappersarealsoavailableforPython,Perl,Fortran,Java,Ruby,Lua,CommonLisp,Haskell,R,MATLAB,IDL,Julia,andnativesupportinMathematica.
Inthecomputergameindustry,GPUsareusedforgraphicsrendering,andforgamephysicscalculations(physicaleffectssuchasdebris,smoke,fire,fluids);examplesincludePhysXandBullet.CUDAhasalsobeenusedtoacceleratenon-graphicalapplicationsincomputationalbiology,cryptographyandotherfieldsbyanorderofmagnitudeormore.[9][10][11][12][13]
CUDAprovidesbothalowlevelAPI(CUDADriverAPI,nonsingle-source)andahigherlevelAPI(CUDARuntimeAPI,single-source).TheinitialCUDASDKwasmadepublicon15February2007,forMicrosoftWindowsandLinux.MacOSXsupportwaslateraddedinversion2.0,[14]whichsupersedesthebetareleasedFebruary14,2008.[15]CUDAworkswithallNvidiaGPUsfromtheG8xseriesonwards,includingGeForce,QuadroandtheTeslaline.CUDAiscompatiblewithmoststandardoperatingsystems.
CUDA8.0comeswiththefollowinglibraries(forcompilation&runtime,inalphabeticalorder):
cuBLAS–CUDABasicLinearAlgebraSubroutineslibrary
CUDART–CUDARuntimelibrary
cuFFT–CUDAFastFourierTransformlibrary
cuRAND–CUDARandomNumberGenerationlibrary
cuSOLVER–CUDAbasedcollectionofdenseandsparsedirectsolvers
cuSPARSE–CUDASparseMatrixlibrary
NPP–NVIDIAPerformancePrimitiveslibrary
nvGRAPH–NVIDIAGraphAnalyticslibrary
NVML–NVIDIAManagementLibrary
NVRTC–NVIDIARuntimeCompilationlibraryforCUDAC++
CUDA8.0comeswiththeseothersoftwarecomponents:
nView–NVIDIAnViewDesktopManagementSoftware
NVWMI–NVIDIAEnterpriseManagementToolkit
GameWorksPhysX–isamulti-platformgamephysicsengine
CUDA9.0–9.2comeswiththeseothercomponents:
CUTLASS1.0–customlinearalgebraalgorithms,
NVCUVID–NVIDIAVideoDecoderwasdeprecatedinCUDA9.2;itisnowavailableinNVIDIAVideoCodecSDK
CUDA10comeswiththeseothercomponents:
nvJPEG–Hybrid(CPUandGPU)JPEGprocessing
CUDA11-11.6comeswiththeseothercomponents:[16][17][18][19]
CUBisnewoneofmoresupportedC++libraries
MIGmultiinstanceGPUsupport
nvJPEG2000–JPEG2000encoderanddecoder
Advantages[edit]
CUDAhasseveraladvantagesovertraditionalgeneral-purposecomputationonGPUs(GPGPU)usinggraphicsAPIs:
Scatteredreads –codecanreadfromarbitraryaddressesinmemory.
Unifiedvirtualmemory(CUDA 4.0andabove)
Unifiedmemory(CUDA 6.0andabove)
Sharedmemory –CUDAexposesafastsharedmemoryregionthatcanbesharedamongthreads.Thiscanbeusedasauser-managedcache,enablinghigherbandwidththanispossibleusingtexturelookups.[20]
FasterdownloadsandreadbackstoandfromtheGPU
Fullsupportforintegerandbitwiseoperations,includingintegertexturelookups
OnRTX20and30seriescards,theCUDAcoresareusedforafeaturecalled"RTXIO"WhichiswheretheCUDAcoresdramaticallydecreasegame-loadingtimes.
Limitations[edit]
WhetherforthehostcomputerortheGPUdevice,allCUDAsourcecodeisnowprocessedaccordingtoC++syntaxrules.[21]Thiswasnotalwaysthecase.EarlierversionsofCUDAwerebasedonCsyntaxrules.[22]AswiththemoregeneralcaseofcompilingCcodewithaC++compiler,itisthereforepossiblethatoldC-styleCUDAsourcecodewilleitherfailtocompileorwillnotbehaveasoriginallyintended.
InteroperabilitywithrenderinglanguagessuchasOpenGLisone-way,withOpenGLhavingaccesstoregisteredCUDAmemorybutCUDAnothavingaccesstoOpenGLmemory.
Copyingbetweenhostanddevicememorymayincuraperformancehitduetosystembusbandwidthandlatency(thiscanbepartlyalleviatedwithasynchronousmemorytransfers,handledbytheGPU'sDMAengine).
Threadsshouldberunningingroupsofatleast32forbestperformance,withtotalnumberofthreadsnumberinginthethousands.Branchesintheprogramcodedonotaffectperformancesignificantly,providedthateachof32threadstakesthesameexecutionpath;theSIMDexecutionmodelbecomesasignificantlimitationforanyinherentlydivergenttask(e.g.traversingaspacepartitioningdatastructureduringraytracing).
Noemulatororfallbackfunctionalityisavailableformodernrevisions.
ValidC++maysometimesbeflaggedandpreventcompilationduetothewaythecompilerapproachesoptimizationfortargetGPUdevicelimitations.[citationneeded]
C++run-timetypeinformation(RTTI)andC++-styleexceptionhandlingareonlysupportedinhostcode,notindevicecode.
Insingle-precisiononfirstgenerationCUDAcomputecapability1.xdevices,denormalnumbersareunsupportedandareinsteadflushedtozero,andtheprecisionofboththedivisionandsquarerootoperationsareslightlylowerthanIEEE754-compliantsingleprecisionmath.Devicesthatsupportcomputecapability2.0andabovesupportdenormalnumbers,andthedivisionandsquarerootoperationsareIEEE754compliantbydefault.However,userscanobtainthepriorfastergaming-grademathofcomputecapability1.xdevicesifdesiredbysettingcompilerflagstodisableaccuratedivisionsandaccuratesquareroots,andenableflushingdenormalnumberstozero.[23]
UnlikeOpenCL,CUDA-enabledGPUsareonlyavailablefromNvidia.[24]AttemptstoimplementCUDAonotherGPUsinclude:
ProjectCoriander:ConvertsCUDAC++11sourcetoOpenCL1.2C.AforkofCUDA-on-CLintendedtorunTensorFlow.[25][26][27]
CU2CL:ConvertCUDA3.2C++toOpenCLC.[28]
GPUOpenHIP:AthinabstractionlayerontopofCUDAandROCmintendedforAMDandNvidiaGPUs.HasaconversiontoolforimportingCUDAC++source.SupportsCUDA4.0plusC++11andfloat16.
GPUssupported[edit]
SupportedCUDAlevelofGPUandcard.SeealsoatNvidia:
CUDASDK1.0supportforcomputecapability1.0–1.1(Tesla)[29]
CUDASDK1.1supportforcomputecapability1.0–1.1+x(Tesla)
CUDASDK2.0supportforcomputecapability1.0–1.1+x(Tesla)
CUDASDK2.1–2.3.1supportforcomputecapability1.0–1.3(Tesla)[30][31][32][33]
CUDASDK3.0–3.1supportforcomputecapability1.0–2.0(Tesla,Fermi)[34][35]
CUDASDK3.2supportforcomputecapability1.0–2.1(Tesla,Fermi)[36]
CUDASDK4.0–4.2supportforcomputecapability1.0–2.1+x(Tesla,Fermi,more?).
CUDASDK5.0–5.5supportforcomputecapability1.0–3.5(Tesla,Fermi,Kepler).
CUDASDK6.0supportforcomputecapability1.0–3.5(Tesla,Fermi,Kepler).
CUDASDK6.5supportforcomputecapability1.1–5.x(Tesla,Fermi,Kepler,Maxwell).Lastversionwithsupportforcomputecapability1.x(Tesla).
CUDASDK7.0–7.5supportforcomputecapability2.0–5.x(Fermi,Kepler,Maxwell).
CUDASDK8.0supportforcomputecapability2.0–6.x(Fermi,Kepler,Maxwell,Pascal).Lastversionwithsupportforcomputecapability2.x(Fermi)(PascalGTX1070TiNotSupported).
CUDASDK9.0–9.2supportforcomputecapability3.0–7.2(Kepler,Maxwell,Pascal,Volta)(PascalGTX1070TiNotSupported.CUDASDK9.0andsupportCUDASDK9.2).
CUDASDK10.0–10.2supportforcomputecapability3.0–7.5(Kepler,Maxwell,Pascal,Volta,Turing).Lastversionwithsupportforcomputecapability3.x(Kepler).10.2isthelastofficialreleaseformacOS,assupportwillnotbeavailableformacOSinnewerreleases.
CUDASDK11.0supportforcomputecapability3.5–8.0(Kepler(inpart),Maxwell,Pascal,Volta,Turing,Ampere(inpart)).[37]Newdatatypes:Bfloat16andTF32onthird-generationsTensorCores.[38]
CUDASDK11.1–11.6supportforcomputecapability3.5–8.6(Kepler(inpart),Maxwell,Pascal,Volta,Turing,Ampere).[39]
Computecapability(version)
Micro-architecture
GPUs
GeForce
Quadro,NVS
Tesla/Datacenter
Tegra,Jetson,DRIVE
1.0
Tesla
G80
GeForce8800Ultra,GeForce8800GTX,GeForce8800GTS(G80)
QuadroFX5600,QuadroFX4600,QuadroPlex2100S4
TeslaC870,TeslaD870,TeslaS870
1.1
G92,G94,G96,G98,G84,G86
GeForceGTS250,GeForce9800GX2,GeForce9800GTX,GeForce9800GT,GeForce8800GTS(G92),GeForce8800GT,GeForce9600GT,GeForce9500GT,GeForce9400GT,GeForce8600GTS,GeForce8600GT,GeForce8500GT,GeForceG110M,GeForce9300MGS,GeForce9200MGS,GeForce9100MG,GeForce8400MGT,GeForceG105M
QuadroFX4700X2,QuadroFX3700,QuadroFX1800,QuadroFX1700,QuadroFX580,QuadroFX570,QuadroFX470,QuadroFX380,QuadroFX370,QuadroFX370LowProfile,QuadroNVS450,QuadroNVS420,QuadroNVS290,QuadroNVS295,QuadroPlex2100D4,QuadroFX3800M,QuadroFX3700M,QuadroFX3600M,QuadroFX2800M,QuadroFX2700M,QuadroFX1700M,QuadroFX1600M,QuadroFX770M,QuadroFX570M,QuadroFX370M,QuadroFX360M,QuadroNVS320M,QuadroNVS160M,QuadroNVS150M,QuadroNVS140M,QuadroNVS135M,QuadroNVS130M,QuadroNVS450,QuadroNVS420,[40]QuadroNVS295
1.2
GT218,GT216,GT215
GeForceGT340*,GeForceGT330*,GeForceGT320*,GeForce315*,GeForce310*,GeForceGT240,GeForceGT220,GeForce210,GeForceGTS360M,GeForceGTS350M,GeForceGT335M,GeForceGT330M,GeForceGT325M,GeForceGT240M,GeForceG210M,GeForce310M,GeForce305M
QuadroFX380LowProfile,QuadroFX1800M,QuadroFX880M,QuadroFX380M,NvidiaNVS300,NVS5100M,NVS3100M,NVS2100M,ION
1.3
GT200,GT200b
GeForceGTX295,GTX285,GTX280,GeForceGTX275,GeForceGTX260
QuadroFX5800,QuadroFX4800,QuadroFX4800forMac,QuadroFX3800,QuadroCX,QuadroPlex2200D2
TeslaC1060,TeslaS1070,TeslaM1060
2.0
Fermi
GF100,GF110
GeForceGTX590,GeForceGTX580,GeForceGTX570,GeForceGTX480,GeForceGTX470,GeForceGTX465,GeForceGTX480M
Quadro6000,Quadro5000,Quadro4000,Quadro4000forMac,QuadroPlex7000,Quadro5010M,Quadro5000M
TeslaC2075,TeslaC2050/C2070,TeslaM2050/M2070/M2075/M2090
2.1
GF104,GF106GF108,GF114,GF116,GF117,GF119
GeForceGTX560Ti,GeForceGTX550Ti,GeForceGTX460,GeForceGTS450,GeForceGTS450*,GeForceGT640(GDDR3),GeForceGT630,GeForceGT620,GeForceGT610,GeForceGT520,GeForceGT440,GeForceGT440*,GeForceGT430,GeForceGT430*,GeForceGT420*,GeForceGTX675M,GeForceGTX670M,GeForceGT635M,GeForceGT630M,GeForceGT625M,GeForceGT720M,GeForceGT620M,GeForce710M,GeForce610M,GeForce820M,GeForceGTX580M,GeForceGTX570M,GeForceGTX560M,GeForceGT555M,GeForceGT550M,GeForceGT540M,GeForceGT525M,GeForceGT520MX,GeForceGT520M,GeForceGTX485M,GeForceGTX470M,GeForceGTX460M,GeForceGT445M,GeForceGT435M,GeForceGT420M,GeForceGT415M,GeForce710M,GeForce410M
Quadro2000,Quadro2000D,Quadro600,Quadro4000M,Quadro3000M,Quadro2000M,Quadro1000M,NVS310,NVS315,NVS5400M,NVS5200M,NVS4200M
3.0
Kepler
GK104,GK106,GK107
GeForceGTX770,GeForceGTX760,GeForceGT740,GeForceGTX690,GeForceGTX680,GeForceGTX670,GeForceGTX660Ti,GeForceGTX660,GeForceGTX650TiBOOST,GeForceGTX650Ti,GeForceGTX650,GeForceGTX880M,GeForceGTX870M,GeForceGTX780M,GeForceGTX770M,GeForceGTX765M,GeForceGTX760M,GeForceGTX680MX,GeForceGTX680M,GeForceGTX675MX,GeForceGTX670MX,GeForceGTX660M,GeForceGT750M,GeForceGT650M,GeForceGT745M,GeForceGT645M,GeForceGT740M,GeForceGT730M,GeForceGT640M,GeForceGT640MLE,GeForceGT735M,GeForceGT730M
QuadroK5000,QuadroK4200,QuadroK4000,QuadroK2000,QuadroK2000D,QuadroK600,QuadroK420,QuadroK500M,QuadroK510M,QuadroK610M,QuadroK1000M,QuadroK2000M,QuadroK1100M,QuadroK2100M,QuadroK3000M,QuadroK3100M,QuadroK4000M,QuadroK5000M,QuadroK4100M,QuadroK5100M,NVS510,Quadro410
TeslaK10,GRIDK340,GRIDK520,GRIDK2
3.2
GK20A
Tegra K1,Jetson TK1
3.5
GK110,GK208
GeForceGTXTitanZ,GeForceGTXTitanBlack,GeForceGTXTitan,GeForceGTX780Ti,GeForceGTX780,GeForceGT640(GDDR5),GeForceGT630v2,GeForceGT730,GeForceGT720,GeForceGT710,GeForceGT740M(64-bit,DDR3),GeForceGT920M
QuadroK6000,QuadroK5200
TeslaK40,TeslaK20x,TeslaK20
3.7
GK210
TeslaK80
5.0
Maxwell
GM107,GM108
GeForceGTX750Ti,GeForceGTX750,GeForceGTX960M,GeForceGTX950M,GeForce940M,GeForce930M,GeForceGTX860M,GeForceGTX850M,GeForce845M,GeForce840M,GeForce830M
QuadroK1200,QuadroK2200,QuadroK620,QuadroM2000M,QuadroM1000M,QuadroM600M,QuadroK620M,NVS810
TeslaM10
5.2
GM200,GM204,GM206
GeForceGTXTitanX,GeForceGTX980Ti,GeForceGTX980,GeForceGTX970,GeForceGTX960,GeForceGTX950,GeForceGTX750SE,GeForceGTX980M,GeForceGTX970M,GeForceGTX965M
QuadroM600024GB,QuadroM6000,QuadroM5000,QuadroM4000,QuadroM2000,QuadroM5500,QuadroM5000M,QuadroM4000M,QuadroM3000M
TeslaM4,TeslaM40,TeslaM6,TeslaM60
5.3
GM20B
Tegra X1,Jetson TX1,Jetson Nano,DRIVE CX,DRIVE PX
6.0
Pascal
GP100
QuadroGP100
TeslaP100
6.1
GP102,GP104,GP106,GP107,GP108
NvidiaTITANXp,TitanX,GeForceGTX1080Ti,GTX1080,GTX1070Ti,GTX1070,GTX1060,GTX1050Ti,GTX1050,GT1030,GT1010,MX350,MX330,MX250,MX230,MX150,MX130,MX110
QuadroP6000,QuadroP5000,QuadroP4000,QuadroP2200,QuadroP2000,QuadroP1000,QuadroP400,QuadroP500,QuadroP520,QuadroP600,QuadroP5000(Mobile),QuadroP4000(Mobile),QuadroP3000(Mobile)
TeslaP40,TeslaP6,TeslaP4
6.2
GP10B[41]
Tegra X2,Jetson TX2,DRIVE PX 2
7.0
Volta
GV100
NVIDIATITANV
QuadroGV100
TeslaV100,TeslaV100S
7.2
GV10B[42]
GV11B[43][44]
TegraXavier,JetsonXavierNX,JetsonAGXXavier,DRIVEAGXXavier,DRIVEAGXPegasus,ClaraAGX
7.5
Turing
TU102,TU104,TU106,TU116,TU117
NVIDIATITANRTX,GeForceRTX2080Ti,RTX2080Super,RTX2080,RTX2070Super,RTX2070,RTX2060Super,RTX206012GB,RTX2060,GeForceGTX1660Ti,GTX1660Super,GTX1660,GTX1650Super,GTX1650,MX550,MX450
QuadroRTX8000,QuadroRTX6000,QuadroRTX5000,QuadroRTX4000,T1000,T600,T400T1200(mobile),T600(mobile),T500(mobile),QuadroT2000(mobile),QuadroT1000(mobile)
TeslaT4
8.0
Ampere
GA100
A10080GB,A10040GB,A30
8.6
GA102,GA104,GA106,GA107
GeForceRTX3090Ti,RTX3090,RTX3080Ti,RTX308012GB,RTX3080,RTX3070Ti,RTX3070,RTX3060Ti,RTX3060,RTX3050,RTX3050Ti(mobile),RTX3050(mobile),RTX2050(mobile),MX570
RTXA6000,RTXA5000,RTXA4500,RTXA4000,RTXA2000RTXA5000(mobile),RTXA4000(mobile),RTXA3000(mobile),RTXA2000(mobile)
A40,A16,A10,A2
8.7
GA10B
JetsonOrinNX,JetsonAGXOrin,DRIVEAGXOrin,ClaraHoloscan
9.0
Hopper
GH100,GH202
9.x
Lovelace
AD102,AD104,AD106,AD107
9.x
AD10B
Computecapability(version)
Micro-architecture
GPUs
GeForce
Quadro,NVS
Tesla/Datacenter
Tegra,Jetson,DRIVE
'*'–OEM-onlyproducts
Versionfeaturesandspecifications[edit]
Featuresupport(unlistedfeaturesaresupportedforallcomputecapabilities)
Computecapability(version)
1.0
1.1
1.2
1.3
2.x
3.0
3.2
3.5,3.7,5.0,5.2
5.3
6.x
7.x
8.0
8.6
9.0
9.x
Integeratomicfunctionsoperatingon32-bitwordsinglobalmemory
No
Yes
atomicExch()operatingon32-bitfloatingpointvaluesinglobalmemory
Integeratomicfunctionsoperatingon32-bitwordsinsharedmemory
No
Yes
atomicExch()operatingon32-bitfloatingpointvaluesinsharedmemory
Integeratomicfunctionsoperatingon64-bitwordsinglobalmemory
Warpvotefunctions
Double-precisionfloating-pointoperations
No
Yes
Atomicfunctionsoperatingon64-bitintegervaluesinsharedmemory
No
Yes
Floating-pointatomicadditionoperatingon32-bitwordsinglobalandsharedmemory
_ballot()
_threadfence_system()
_syncthreads_count(),_syncthreads_and(),_syncthreads_or()
Surfacefunctions
3Dgridofthreadblock
Warpshufflefunctions,unifiedmemory
No
Yes
Funnelshift
No
Yes
Dynamicparallelism
No
Yes
Half-precisionfloating-pointoperations:addition,subtraction,multiplication,comparison,warpshufflefunctions,conversion
No
Yes
Atomicadditionoperatingon64-bitfloatingpointvaluesinglobalmemoryandsharedmemory
No
Yes
Tensorcore
No
Yes
Mixedprecisionwarp-matrixfunctions
No
Yes
Hardware-acceleratedasync-copy
No
Yes
Hardware-acceleratedsplitarrive/waitbarrier
No
Yes
L2cacheresidencymanagement
No
Yes
Featuresupport(unlistedfeaturesaresupportedforallcomputecapabilities)
1.0
1.1
1.2
1.3
2.x
3.0
3.2
3.5,3.7,5.0,5.2
5.3
6.x
7.x
8.0
8.6
9.0
9.x
Computecapability(version)
[45]
DataType
Operation
Supportedsince
Supportedsinceforglobalmemory
Supportedsinceforsharedmemory
16-bitinteger
generaloperations
32-bitinteger
atomicfunctions
1.1
1.2
64-bitinteger
atomicfunctions
1.2
2.0
16-bitfloatingpoint
addition,subtraction,multiplication,comparison,warpshufflefunctions,conversion
5.3
32-bitfloatingpoint
atomicExch()
1.1
1.2
32-bitfloatingpoint
atomicaddition
2.0
2.0
64-bitfloatingpoint
generaloperations
1.3
64-bitfloatingpoint
atomicaddition
6.0
6.0
tensorcore
7.0
Note:Anymissinglinesoremptyentriesdoreflectsomelackofinformationonthatexactitem.
[46]
DataType
Supportedsincefordensematrices
Supportedsinceforsparsematrices
Volta/ProfessionalTuringFMApercycleperTensorCore
ConsumerTuringFMApercycleperTensorCore
ProfessionalAmpere/OrinFMApercycleperTensorCore
ConsumerAmpereFMApercycleperTensorCore
1-bitvalues
7.5
-
1024
1024
4096
2048
4-bitintegers
7.5
8.0
256
256
1024
512
8-bitintegers
7.2
8.0
128
128
512
256
16-bitfloatingpointFP16withFP16accumulate
7.0
8.0
64
64
256
128
16-bitfloatingpointFP16withFP32accumulate
7.0
8.0
64
32
256
64
16-bitfloatingpointBF16withFP32accumulate
8.0
8.0
-
-
256
64
32-bit(19bitsused)floatingpointTF32
8.0
8.0
-
-
128
32
64-bitfloatingpoint
8.0
8.0
-
-
16
[47]
[48]
[49]
[50]
Technicalspecifications
Computecapability(version)
1.0
1.1
1.2
1.3
2.x
3.0
3.2
3.5
3.7
5.0
5.2
5.3
6.0
6.1
6.2
7.0
7.2
7.5
8.0
8.6
9.0
9.x
Maximumnumberofresidentgridsperdevice(concurrentkernelexecution)
t.b.d.
16
4
32
16
128
32
16
128
16
128
Maximumdimensionalityofgridofthreadblocks
2
3
Maximumx-dimensionofagridofthreadblocks
65535
231−1
Maximumy-,orz-dimensionofagridofthreadblocks
65535
Maximumdimensionalityofthreadblock
3
Maximumx-ory-dimensionofablock
512
1024
Maximumz-dimensionofablock
64
Maximumnumberofthreadsperblock
512
1024
Warpsize
32
Maximumnumberofresidentblockspermultiprocessor
8
16
32
16
32
16
Maximumnumberofresidentwarpspermultiprocessor
24
32
48
64
32
64
48
Maximumnumberofresidentthreadspermultiprocessor
768
1024
1536
2048
1024
2048
1536
Numberof32-bitregisterspermultiprocessor
8K
16K
32K
64K
128K
64K
Maximumnumberof32-bitregistersperthreadblock
N/A
32K
64K
32K
64K
32K
64K
32K
64K
Maximumnumberof32-bitregistersperthread
124
63
255
Maximumamountofsharedmemorypermultiprocessor
16KB
48KB
112KB
64KB
96KB
64KB
96KB
64KB
96KB(of128)
64KB(of96)
164KB(of192)
100KB(of128)
Maximumamountofsharedmemoryperthreadblock
48KB
96KB
48KB
64KB
163KB
99KB
Numberofsharedmemorybanks
16
32
Amountoflocalmemoryperthread
16KB
512KB
Constantmemorysize
64KB
Cacheworkingsetpermultiprocessorforconstantmemory
8KB
4KB
8KB
Cacheworkingsetpermultiprocessorfortexturememory
6 – 8 KB
12 KB
12 – 48 KB
24 KB
48 KB
N/A
24 KB
48 KB
24 KB
32 – 128 KB
32 – 64 KB
28 – 192 KB
28 – 128 KB
Maximumwidthfor1DtexturereferenceboundtoaCUDAarray
8192
65536
131072
Maximumwidthfor1Dtexturereferenceboundtolinearmemory
227
228
227
228
227
228
Maximumwidthandnumberoflayersfora1Dlayeredtexturereference
8192×512
16384×2048
32768x2048
Maximumwidthandheightfor2DtexturereferenceboundtoaCUDAarray
65536×32768
65536×65535
131072x65536
Maximumwidthandheightfor2Dtexturereferenceboundtoalinearmemory
65000x65000
65536x65536
131072x65000
Maximumwidthandheightfor2DtexturereferenceboundtoaCUDAarraysupportingtexturegather
N/A
16384x16384
32768x32768
Maximumwidth,height,andnumberoflayersfora2Dlayeredtexturereference
8192×8192×512
16384×16384×2048
32768x32768x2048
Maximumwidth,heightanddepthfora3DtexturereferenceboundtolinearmemoryoraCUDAarray
20483
40963
163843
Maximumwidth(andheight)foracubemaptexturereference
N/A
16384
32768
Maximumwidth(andheight)andnumberoflayersforacubemaplayeredtexturereference
N/A
16384×2046
32768×2046
Maximumnumberoftexturesthatcanbeboundtoakernel
128
256
Maximumwidthfora1DsurfacereferenceboundtoaCUDAarray
Notsupported
65536
16384
32768
Maximumwidthandnumberoflayersfora1Dlayeredsurfacereference
65536×2048
16384×2048
32768×2048
Maximumwidthandheightfora2DsurfacereferenceboundtoaCUDAarray
65536×32768
16384×65536
131072×65536
Maximumwidth,height,andnumberoflayersfora2Dlayeredsurfacereference
65536×32768×2048
16384×16384×2048
32768×32768×2048
Maximumwidth,height,anddepthfora3DsurfacereferenceboundtoaCUDAarray
65536×32768×2048
4096×4096×4096
16384×16384×16384
Maximumwidth(andheight)foracubemapsurfacereferenceboundtoaCUDAarray
32768
16384
32768
Maximumwidthandnumberoflayersforacubemaplayeredsurfacereference
32768×2046
16384×2046
32768×2046
Maximumnumberofsurfacesthatcanbeboundtoakernel
8
16
32
Maximumnumberofinstructionsperkernel
2million
512million
Technicalspecifications
1.0
1.1
1.2
1.3
2.x
3.0
3.2
3.5
3.7
5.0
5.2
5.3
6.0
6.1
6.2
7.0
7.2
7.5
8.0
8.6
9.0
9.x
Computecapability(version)
[51]
Architecturespecifications
Computecapability(version)
1.0
1.1
1.2
1.3
2.0
2.1
3.0
3.5
3.7
5.0
5.2
6.0
6.1,6.2
7.0,7.2
7.5
8.0
8.6
9.0
9.x
NumberofALUlanesforintegerandsingle-precisionfloating-pointarithmeticoperations
8[52]
32
48
192
128
64
128
64
Numberofspecialfunctionunitsforsingle-precisionfloating-pointtranscendentalfunctions
2
4
8
32
16
32
16
Numberoftexturefilteringunitsforeverytextureaddressunitorrenderoutputunit(ROP)
2
4
8
16
8[53]
Numberofwarpschedulers
1
2
4
2
4
Maxnumberofinstructionsissuedatoncebyasinglescheduler
1
2[54]
1
Numberoftensorcores
N/A
8[53]
4
SizeinKBofunifiedmemoryfordatacacheandsharedmemorypermultiprocessor
t.b.d.
128
96[55]
192
128
[56]
FormoreinformationreadtheNvidiaCUDAprogrammingguide.[57]
Example[edit]
ThisexamplecodeinC++loadsatexturefromanimageintoanarrayontheGPU:
texture
延伸文章資訊
- 1什麼是CUDA? - NVIDIA 台灣官方部落格
它遠不止如此。CUDA 是一種平行編程的模型和架構,讓使用GPU 進行一般計算變得簡單而優雅。開發人員仍舊使用熟悉的C、 ...
- 2CUDA Toolkit 11.6 Update 1 Downloads | NVIDIA Developer
Resources CUDA Documentation/Release NotesMacOS Tools Training Sample Code Forums Archive of Prev...
- 3CUDA | GeForce - Nvidia
CUDA 是NVIDIA 研發的平行運算平台及編程模型,可利用繪圖處理單元(GPU) 的能力大幅提升運算效能。
- 4硬體加速搞不懂?CUDA讓一切變得更簡單 - 電腦DIY
的資料現在都可以藉由GPU來執行,而且還做得更好!以NVIDIA所推出的CUDA技術來說,源起於GPGPU的運算技術經過改良之後正式命名為CUDA,但是有多數玩家並不了解何謂CUDA ...
- 5CUDA Zone | NVIDIA Developer
CUDA Zone CUDA® is a parallel computing platform and programming model developed by NVIDIA for ge...