CUDA Toolkit Major Component Versions
CUDAToolkit
v11.6.1
ReleaseNotes
1. CUDA11.6ReleaseNotes
1.1. CUDAToolkitMajorComponentVersions
1.2. GeneralCUDA
1.3. CUDACompilers
1.4. CUDADeveloperTools
1.5. ResolvedIssues
1.5.1. CUDACompilers
1.6. DeprecatedFeatures
1.7. KnownIssues
1.7.1. GeneralCUDA
1.7.2. CUDACompiler
2. CUDALibraries
2.1. cuBLASLibrary
2.1.1. cuBLAS:Release11.6
2.1.2. cuBLAS:Release11.4Update3
2.1.3. cuBLAS:Release11.4Update2
2.1.4. cuBLAS:Release11.4
2.1.5. cuBLAS:Release11.3Update1
2.1.6. cuBLAS:Release11.3
2.1.7. cuBLAS:Release11.2
2.1.8. cuBLAS:Release11.1Update1
2.1.9. cuBLAS:Release11.1
2.1.10. cuBLAS:Release11.0Update1
2.1.11. cuBLAS:Release11.0
2.1.12. cuBLAS:Release11.0RC
2.2. cuFFTLibrary
2.2.1. cuFFT:Release11.5
2.2.2. cuFFT:Release11.4Update2
2.2.3. cuFFT:Release11.4Update1
2.2.4. cuFFT:Release11.4
2.2.5. cuFFT:Release11.3
2.2.6. cuFFT:Release11.2Update2
2.2.7. cuFFT:Release11.2Update1
2.2.8. cuFFT:Release11.2
2.2.9. cuFFT:Release11.1
2.2.10. cuFFT:Release11.0RC
2.3. cuRANDLibrary
2.3.1. cuRAND:Release11.5Update1
2.3.2. cuRAND:Release11.3
2.3.3. cuRAND:Release11.0Update1
2.3.4. cuRAND:Release11.0
2.3.5. cuRAND:Release11.0RC
2.4. cuSOLVERLibrary
2.4.1. cuSOLVER:Release11.4
2.4.2. cuSOLVER:Release11.3
2.4.3. cuSOLVER:Release11.2Update2
2.4.4. cuSOLVER:Release11.2
2.4.5. cuSOLVER:Release11.1Update1
2.4.6. cuSOLVER:Release11.1
2.4.7. cuSOLVER:Release11.0
2.5. cuSPARSELibrary
2.5.1. cuSPARSE:Release11.6Update1
2.5.2. cuSPARSE:Release11.6
2.5.3. cuSPARSE:Release11.5Update1
2.5.4. cuSPARSE:Release11.4Update1
2.5.5. cuSPARSE:Release11.4
2.5.6. cuSPARSE:Release11.3Update1
2.5.7. cuSPARSE:Release11.3
2.5.8. cuSPARSE:Release11.2Update2
2.5.9. cuSPARSE:Release11.2Update1
2.5.10. cuSPARSE:Release11.2
2.5.11. cuSPARSE:Release11.1Update1
2.5.12. cuSPARSE:Release11.0
2.5.13. cuSPARSE:Release11.0RC
2.6. MathLibrary
2.6.1. CUDAMath:Release11.6
2.6.2. CUDAMath:Release11.5
2.6.3. CUDAMath:Release11.4
2.6.4. CUDAMath:Release11.3
2.6.5. CUDAMath:Release11.1
2.6.6. CUDAMath:Release11.0Update1
2.6.7. CUDAMath:Release11.0RC
2.7. NVIDIAPerformancePrimitives(NPP)
2.7.1. NPP:Release11.5
2.7.2. NPP:Release11.4
2.7.3. NPP:Release11.3
2.7.4. NPP:Release11.2Update2
2.7.5. NPP:Release11.2Update1
2.7.6. NPP:Release11.0
2.7.7. NPP:Release11.0RC
2.8. nvJPEGLibrary
2.8.1. nvJPEG:Release11.5Update1
2.8.2. nvJPEG:Release11.4
2.8.3. nvJPEG:Release11.2Update1
2.8.4. nvJPEG:Release11.1Update1
2.8.5. nvJPEG:Release11.0Update1
2.8.6. nvJPEG:Release11.0
2.8.7. nvJPEG:Release11.0RC
SearchResults
ReleaseNotes
(PDF)
-
v11.6.1
(older)
-
LastupdatedFebruary22,2022
-
SendFeedback
NVIDIACUDAToolkitReleaseNotes
TheReleaseNotesfortheCUDAToolkit.
1. CUDA11.6ReleaseNotes
ThereleasenotesfortheNVIDIA®CUDA®Toolkitcanbefoundonlineathttp://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html.
Note:Thereleasenoteshavebeenreorganizedinto
twomajorsections:thegeneralCUDAreleasenotes,andtheCUDAlibrariesrelease
notesincludinghistoricalinformationfor11.xreleases.
1.1. CUDAToolkitMajorComponentVersions
CUDAComponents
StartingwithCUDA11,thevariouscomponentsinthetoolkitareversionedindependently.
ForCUDA11.6,
thetablebelowindicatestheversions:
Table1.CUDA11.6Update1ComponentVersions
ComponentName
VersionInformation
SupportedArchitectures
CUDAC++CoreComputeLibraries
11.6.55
x86_64,POWER,Arm64
CUDARuntime(cudart)
11.6.55
x86_64,POWER,Arm64
cuobjdump
11.6.112
x86_64,POWER,Arm64
CUPTI
11.6.112
x86_64,POWER,Arm64
CUDAcuxxfilt(demangler)
11.6.112
x86_64,POWER,Arm64
CUDADemoSuite
11.6.55
x86_64
CUDAGDB
11.6.112
x86_64,POWER,Arm64
CUDAMemcheck
11.6.112
x86_64,POWER
CUDANsight
11.6.112
x86_64,POWER
CUDANVCC
11.6.112
x86_64,POWER,Arm64
CUDAnvdisasm
11.6.104
x86_64,POWER,Arm64
CUDANVMLHeaders
11.6.55
x86_64,POWER,Arm64
CUDAnvprof
11.6.112
x86_64,POWER,Arm64
CUDAnvprune
11.6.112
x86_64,POWER,Arm64
CUDANVRTC
11.6.112
x86_64,POWER,Arm64
CUDANVTX
11.6.112
x86_64,POWER,Arm64
CUDANVVP
11.6.112
x86_64,POWER
CUDASamples
11.6.101
x86_64,POWER,Arm64
CUDAComputeSanitizerAPI
11.6.112
x86_64,POWER,Arm64
CUDAcuBLAS
11.8.1.74
x86_64,POWER,Arm64
CUDAcuFFT
10.7.1.112
x86_64,POWER,Arm64
CUDAcuFile
1.2.1.4
x86_64
CUDAcuRAND
10.2.9.55
x86_64,POWER,Arm64
CUDAcuSOLVER
11.3.3.112
x86_64,POWER,Arm64
CUDAcuSPARSE
11.7.2.112
x86_64,POWER,Arm64
CUDANPP
11.6.2.112
x86_64,POWER,Arm64
CUDAnvJPEG
11.6.1.112
x86_64,POWER,Arm64
NsightCompute
2022.1.1.2
x86_64,POWER,Arm64(CLIonly)
NVTX
1.21018621
x86_64,POWER,Arm64
NsightSystems
2021.5.2.53
x86_64,POWER,Arm64(CLIonly)
NsightVisualStudioEdition(VSE)
2022.1.1.22006
x86_64(Windows)
nvidia_fs1
2.11.0
x86_64
VisualStudioIntegration
11.6.112
x86_64(Windows)
NVIDIALinuxDriver
510.47.03
x86_64,POWER,Arm64
NVIDIAWindowsDriver
511.65
x86_64(Windows)
CUDADriver
RunningaCUDAapplicationrequiresthesystemwithatleastoneCUDAcapableGPU
andadriverthatiscompatiblewiththeCUDAToolkit.SeeTable3.Formore
informationvariousGPUproductsthatareCUDAcapable,visithttps://developer.nvidia.com/cuda-gpus.
EachreleaseoftheCUDAToolkitrequiresaminimumversionoftheCUDAdriver.
TheCUDAdriverisbackwardcompatible,meaningthatapplicationscompiledagainst
aparticularversionoftheCUDAwillcontinuetoworkonsubsequent(later)
driverreleases.
Moreinformationoncompatibilitycanbefoundathttps://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#cuda-compatibility-and-upgrades.
Note:StartingwithCUDA11.0,thetoolkitcomponentsareindividually
versioned,andthetoolkititselfisversionedasshowninthetable
below.
TheminimumrequireddriverversionforCUDAminorversioncompatibilityisshownbelow.
CUDAminorversioncompatibilityisdescribedindetailinhttps://docs.nvidia.com/deploy/cuda-compatibility/index.html
Table2.CUDAToolkitandMinimumRequiredDriverVersionforCUDAMinorVersion
Compatibility
CUDAToolkit
MinimumRequiredDriverVersionforCUDA
MinorVersionCompatibility*
Linuxx86_64DriverVersion
Windowsx86_64DriverVersion
CUDA11.6.x
>=450.80.02
>=452.39
CUDA11.5.x
>=450.80.02
>=452.39
CUDA11.4.x
>=450.80.02
>=452.39
CUDA11.3.x
>=450.80.02
>=452.39
CUDA11.2.x
>=450.80.02
>=452.39
CUDA11.1(11.1.0)
>=450.80.02
>=452.39
CUDA11.0(11.0.3)
>=450.36.06**
>=451.22**
*UsingaMinimumRequiredVersionthatisdifferentfromToolkitDriverVersion
couldbeallowedincompatibilitymode--pleasereadtheCUDACompatibility
Guidefordetails.
**CUDA11.0wasreleasedwithanearlierdriverversion,butbyupgradingtoTesla
RecommendedDrivers450.80.02(Linux)/452.39(Windows),minor
versioncompatibilityispossibleacrosstheCUDA11.xfamilyof
toolkits.
TheversionofthedevelopmentNVIDIAGPUDriverpackagedineachCUDA
Toolkitreleaseisshownbelow.
Table3.CUDAToolkitandCorrespondingDriverVersions
CUDAToolkit
ToolkitDriver
Version
Linuxx86_64DriverVersion
Windowsx86_64DriverVersion
CUDA11.6Update1
>=510.47.03
>=511.65
CUDA11.6GA
>=510.39.01
>=511.23
CUDA11.5Update2
>=495.29.05
>=496.13
CUDA11.5Update1
>=495.29.05
>=496.13
CUDA11.5GA
>=495.29.05
>=496.04
CUDA11.4Update4
>=470.82.01
>=472.50
CUDA11.4Update3
>=470.82.01
>=472.50
CUDA11.4Update2
>=470.57.02
>=471.41
CUDA11.4Update1
>=470.57.02
>=471.41
CUDA11.4.0GA
>=470.42.01
>=471.11
CUDA11.3.1Update1
>=465.19.01
>=465.89
CUDA11.3.0GA
>=465.19.01
>=465.89
CUDA11.2.2Update2
>=460.32.03
>=461.33
CUDA11.2.1Update1
>=460.32.03
>=461.09
CUDA11.2.0GA
>=460.27.03
>=460.82
CUDA11.1.1Update1
>=455.32
>=456.81
CUDA11.1GA
>=455.23
>=456.38
CUDA11.0.3Update1
>=450.51.06
>=451.82
CUDA11.0.2GA
>=450.51.05
>=451.48
CUDA11.0.1RC
>=450.36.06
>=451.22
CUDA10.2.89
>=440.33
>=441.22
CUDA10.1(10.1.105generalrelease,andupdates)
>=418.39
>=418.96
CUDA10.0.130
>=410.48
>=411.31
CUDA9.2(9.2.148Update1)
>=396.37
>=398.26
CUDA9.2(9.2.88)
>=396.26
>=397.44
CUDA9.1(9.1.85)
>=390.46
>=391.29
CUDA9.0(9.0.76)
>=384.81
>=385.54
CUDA8.0(8.0.61GA2)
>=375.26
>=376.51
CUDA8.0(8.0.44)
>=367.48
>=369.30
CUDA7.5(7.5.16)
>=352.31
>=353.66
CUDA7.0(7.0.28)
>=346.46
>=347.62
Forconvenience,theNVIDIAdriverisinstalledaspartoftheCUDAToolkit
installation.Notethatthisdriverisfordevelopmentpurposesandisnot
recommendedforuseinproductionwithTeslaGPUs.
ForrunningCUDAapplicationsinproductionwithTeslaGPUs,itisrecommendedto
downloadthelatestdriverforTeslaGPUsfromtheNVIDIAdriverdownloadssiteat
https://www.nvidia.com/drivers.
DuringtheinstallationoftheCUDAToolkit,theinstallationoftheNVIDIAdrivermaybe
skippedonWindows(whenusingtheinteractiveorsilentinstallation)oron
Linux(byusingmetapackages).
FormoreinformationoncustomizingtheinstallprocessonWindows,seehttps://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html#install-cuda-software.
FormetapackagesonLinux,seehttps://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-metas
1.2. GeneralCUDA
11.6
AddedanewAPI,cudaGraphNodeSetEnabled(),toallowdisabling
nodesinaninstantiatedgraph.Supportislimitedtokernelnodesinthis
release.AcorrespondingAPI,cudaGraphNodeGetEnabled(),allows
queryingtheenabledstateofanode.
Fullreleaseof128-bitinteger(__int128)datatypeincluding
compileranddevelopertoolssupport.Thehost-sidecompilermustsupport
the__int128typetousethisfeature.
Cooperativegroupsnamespaceisupdatedwithnewfunctionstoimproveconsistencyin
naming,functionscope,andunitdimension/size:
ImplicitGroup/Member
Threads
Blocks
thread_block::
dim_threads
num_threads
thread_rank
thread_index
(notneeded)
grid_group::
num_threads
thread_rank
num_blocks
block_rank
block_index
AddedabilitytodisableNULLkernelgraphnodelaunches.
AddednewNVMLpublicAPIsforqueryingfunctionalityunderWayland.
AddedL2cachecontroldescriptorsforatomics.
LargeCPUpagesupportforUVMmanagedmemory.
1.3. CUDACompilers
11.6
VS2022Support:CUDA11.6officiallysupportsthelatestVS2022ashostcompiler.A
separateNsightVisualStudioinstaller2022.1.1mustbedownloadedfromhere.AfutureCUDAreleasewill
havetheNsightVisualStudioinstallerwithVS2022supportintegratedintoit.
NewinstructionsinpublicPTX:Newinstructionsforbitmaskcreation-BMSKandsign
extension-SZEXTareaddedtothepublicPTXISA.Youcanfinddocumentationfor
theseinstructionsinthePTXISAguide:BMSKandSZEXT.
UnusedKernelOptimization:InCUDA11.5,unusedkernelpruningwasintroducedwith
thepotentialbenefitsofreducingbinarysizeandimprovingperformancethrough
moreefficientoptimizations.Thiswasanopt-infeaturebutin11.6,thisfeature
isenabledbydefault.Asmentionedinthe11.5bloghere,thereisan
opt-outflagthatcanbeusedincaseitbecomesnecessaryfordebugpurposesorfor
otherspecialsituations.
$nvcc-rdc=trueuser.cutestlib.a-ouser
-Xnvlink-ignore-host-info
Inadditiontothe-arch=alland-arch=all-majoroptions
addedinCUDA11.5,NVCCintroduced-arch=nativeinCUDA11.5
update1.This-arch=nativeoptionisaconvenientwayforusersto
letNVCCdeterminetherighttargetarchitecturetocompiletheCUDAdevicecodeto
basedontheGPUinstalledonthesystem.Thiscanbeparticularlyhelpfulfor
testingwhenapplicationsarerunonthesamesystemtheyarecompiledin.
GeneratePTXfromnvlink:Usingthefollowingcommandline,devicelinker,
nvlinkwillproducePTXasanoutputinadditiontoCUBIN:
nvcc-dlto-dlink-ptx
Device
linkingbynvlinkisthefinalstageintheCUDAcompilationprocess.
Applicationsthathavemultiplesourcetranslationunitshavetobecompiledin
separatecompilationmode.LTO(introducedinCUDA11.4)allowednvlinkto
performoptimizationsatdevicelinktimeinsteadofatcompiletimesothat
separatelycompiledapplicationswithseveraltranslationunitscanbeoptimized
tothesamelevelaswholeprogramcompilationswithasingletranslationunit.
However,withouttheoptiontooutputPTX,applicationsthatcaredaboutforward
compatibilityofdevicecodecouldnotbenefitfromLinkTimeOptimizationor
hadtoconstrainthedevicecodetoasinglesourcefile.
WiththeoptionfornvlinkthatperformsLTOtogeneratetheoutputin
PTX,customerapplicationsthatrequireforwardcompatibilityacrossGPU
architecturescanspanacrossmultiplefilesandcanalsotakeadvantageofLink
TimeOptimization.
Bullseyesupport:NVCCcompiledsourcecodewillworkwithcodecoverage
toolBullseye.ThecodecoverageisonlyfortheCPUorthehostfunctions.Code
coveragefordevicefunctionisnotsupportedthroughbullseye.
INT128developertoolsupport:In11.5,CUDAC++supportfor128-bitwas
added.Inthisrelease,developertoolssupportsthedatatypeaswell.Withthe
latestversionoflibcu++,int128datatypeissupportedbymathfunctions.
1.4. CUDADeveloperTools
ForchangestonvprofandVisualProfiler,seethechangelog.
Fornewfeatures,improvements,andbugfixesinCUPTI,seethechangelog.
Fornewfeatures,improvements,andbugfixesinNsightCompute,seethechangelog.
1.5. ResolvedIssues
1.5.1. CUDACompilers
11.5.Update1
Whenusingthe--fmad=falsecompileroption,eventheexplicitly
requestedfusedmultiply-addinstructionsweredecomposedintoseparatemultiplyand
add,leadingtolossofalgorithmsemanticsintendedbytheprogrammer.Oneofthe
consequenceswasthatCUDAMathAPIscouldnotbetrustedtodelivercorrect
results;worstcaseerrorsbecameunbounded.Thisissuewasintroducedin11.5,and
isnowresolved.
Fixedacompileroptimizationbugthatmaymovememoryaccess
instructionsacrossmemorybarriersthatmayleadtoincorrectruntime
resultswithcertainsynchronizationdependencies.
AnissueinthePTXoptimizersometimesproducedincorrectresults.This
issueisresolved.
11.5
Linkingwithcubinslargerthan2GBisnowsupported.
CertainC++17featuresthatwerebackportedtoC++14inMSVCarenowsupported.
Anissuewiththeuseoflambdafunctionwhenanobjectispassed-by-valueis
resolved.https://github.com/Ahdhn/nvcc_bug_maybe
1.6. DeprecatedFeatures
ThefollowingfeaturesaredeprecatedinthecurrentreleaseoftheCUDAsoftware.The
featuresstillworkinthecurrentrelease,buttheirdocumentationmayhavebeen
removed,andtheywillbecomeofficiallyunsupportedinafuturerelease.Werecommend
thatdevelopersemployalternativesolutionstothesefeaturesintheirsoftware.
GeneralCUDA
ThecudaDeviceSynchronize()functionusedforon-devicefork/join
parallelismisdeprecatedinpreparationforareplacement
programmingmodelwithhigherperformance.Thesefunctionscontinue
toworkinthisrelease,butthetoolswillemitawarningaboutthe
upcomingchange.
CentOSLinux8hasreachedEnd-of-LifeonDec
31,2021andsupportforthisOSisnowdeprecatedintheCUDA
Toolkit.CentOSLinux8supportwillbecompletelyremovedina
futurerelease.
1.7. KnownIssues
1.7.1. GeneralCUDA
IntermittentcrasheswereseenwhenCUDAbinarieswererunningonasystemwitha
GLIBCversionolderthan2.17-106.el7_2.1.Thisisduetoaknownbuginolder
versionsofGLIBC(Bugreference:https://bugzilla.redhat.com/show_bug.cgi?id=1293976)andhasbeenfixedinlaterversions(>=
glibc-2.17-107.el7).
1.7.2. CUDACompiler
11.6Update1
Clang13/PowerPCisnotyetsupported.
NVCCdoesn'tsupportOpenMP5.0"#pragmabegin/enddeclarevariant...";anyhost
compilerthatsupportsOpenMP5.0suchasclang13willnotbesupportedforthe
option-fopenmp.
2. CUDALibraries
ThissectioncoversCUDALibrariesreleasenotesfor11.xreleases.
CUDAMathLibrariestoolchainusesC++11features,anda
C++11-compatiblestandardlibrary(libstdc++>=20150422)isrequiredonthe
host.
CUDAMathlibrariesarenolongershippedforSM30andSM32.
Supportforthefollowingcomputecapabilitiesaredeprecatedfor
alllibraries:
sm_35(Kepler)
sm_37(Kepler)
sm_50(Maxwell)
2.1. cuBLASLibrary
2.1.1. cuBLAS:Release11.6
NewFeatures
Newepilogueoptionshavebeenaddedtosupport
fusioninDLtraining:
CUBLASLT_EPILOGUE_{DRELU,DGELU}
whicharesimilarto
CUBLASLT_EPILOGUE_{DRELU,DGELU}_BGRAD
butdon’tcomputebiasgradient.
ResolvedIssues
Somesyrk-relatedfunctions
(cublas{D,Z}syrk,cublas{D,Z}syr2k,
cublas{D,Z}syrkx)mayfailformatrices
whichsizeisgreaterthan2^31.
2.1.2. cuBLAS:Release11.4Update3
ResolvedIssues
SomecublasandcublasLtfunctionssometimesreturned
CUBLAS_STATUS_EXECUTION_FAILEDif
thedynamiclibrarywasloadedandunloadedseveraltimesduring
applicationlifetimewithinthesameCUDAcontext.Thisissuehas
beenresolved.
2.1.3. cuBLAS:Release11.4Update2
NewFeatures
Vector(andbatched)alphasupportforper-rowscalinginTNint32
mathMatmulwithint8output.See
CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_HOST
and
CUBLASLT_MATMUL_DESC_ALPHA_VECTOR_BATCH_STRIDE.
Newepilogueoptionshavebeenaddedtosupportfusionin
DLtraining:CUBLASLT_EPILOGUE_BGRADAand
CUBLASLT_EPILOGUE_BGRADBwhichcomputebias
gradientsbasedonmatricesAandB,respectively.
NewauxiliaryfunctionscublasGetStatusName(),
cublasGetStatusString()havebeenaddedto
cuBLASthatreturnthestringrepresentationandthedescriptionof
thecuBLASstatus(cublasStatus_t)respectively.
Similarly,cublasLtGetStatusName(),
cublasLtGetStatusString()havebeenaddedto
cuBlasLt.
KnownIssues
cublasGemmBatchedEx()andcublasgemmBatched()checkthealignmentof
theinput/outputarraysofthepointersliketheywerethepointers
totheactualmatrices.Thesechecksareirrelevantandwillbe
disabledinfuturereleases.Thismostlyaffectshalf-precision
inputGEMMswhichmightrequire16-bytealignment,andarrayof
pointerscouldonlybealignedby8-byteboundary.
ResolvedIssues
cublasLtMatrixTransformcannowoperateonmatriceswithdimensions
greaterthan65535.
Fixedout-of-boundaccessinGEMMandMatmulfunctions,whensplitK
ornon-defaultepilogueisusedandleadingdimensionoftheoutput
matrixexceedsint32_tlimit.
NVBLASnowuseslazyloadingoftheCPUBLASlibraryonLinuxto
avoidissuescausedbypreloadinglibnvblas.soin
complexapplicationsthatuseforkandsimilar
APIs.
ResolvedsymbolsnameconflictwhenusingcuBlasLtstaticlibrary
withstaticTensorRTorcuDNNlibraries.
2.1.4. cuBLAS:Release11.4
ResolvedIssues
Somegemvcaseswereproducingincorrectresultsifthematrixdimension(n
orm)waslarge,forexample2^20.
2.1.5. cuBLAS:Release11.3Update1
NewFeatures
Somenewkernelshavebeenaddedforimprovedperformancebuthave
thelimitationthatonlyhostpointersaresupportedforscalars
(forexample,alphaandbetaparameters).Thislimitationis
expectedtoberesolvedinafuturerelease.
NewepilogueshavebeenaddedtosupportfusioninMLtraining.
Theseinclude:
ReLuBiasandGeluBiasepiloguesthatproduceanauxiliary
outputwhichisusedonbackwardpropagationtocomputethe
correspondinggradients.
DReLuBGradandDGeluBGradepiloguesthatcomputethe
backpropagationofthecorrespondingactivationfunctionon
matrixC,andproducebiasgradientasaseparateoutput.
Theseepiloguesrequireauxiliaryinputmentionedinthe
bulletabove.
ResolvedIssues
SometensorcoreacceleratedstridedbatchedGEMMroutineswould
resultinmisalignedmemoryaccessexceptionswhenbatchstride
wasn'tamultipleof8.
TensorcoreacceleratedcublasGemmBatchedEx(pointer-array)routines
woulduseslowervariantsofkernelsassumingbadalignmentofthe
pointersinthepointerarray.Nowitassumesthatpointersarewell
aligned,asnotedinthedocumentation.
KnownIssues
Tobeabletoaccessthefastestpossiblekernelsthrough
cublasLtMatmulAlgoGetHeuristic()
youneedtosetCUBLASLT_MATMUL_PREF_POINTER_MODE_MASKinsearch
preferencestoCUBLASLT_POINTER_MODE_MASK_HOSTor
CUBLASLT_POINTER_MODE_MASK_NO_FILTERING.Bydefault,heuristics
queryassumesthepointermodemaychangelaterandonlyreturns
algoconfigurationsthatsupportboth_HOSTand_DEVICEmodes.
Withoutthis,newlyaddedkernelswillbeexcludedanditwill
likelyleadtoaperformancepenaltyonsomeproblemsizes.
DeprecatedFeatures
LinkingwithstaticcublasandcublasLtlibrariesonLinuxnow
requiresusinggcc-5.2andcompatibleorhigherduetoC++11
requirementsintheselibraries.
2.1.6. cuBLAS:Release11.3
KnownIssues
Theplanarcomplexmatrixdescriptorforbatchedmatmulhas
inconsistentinterpretationofbatchoffset.
Mixedprecisionoperationswithreductionscheme
CUBLASLT_REDUCTION_SCHEME_OUTPUT_TYPE(mightbeautomatically
selectedbasedonproblemsizeby
cublasSgemmEx()or
cublasGemmEx()too,unless
CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION
mathmodebitisset)notonlystores
intermediateresultsinoutputtypebutalsoaccumulatesthem
internallyinthesameprecision,whichmayresultinlowerthan
expectedaccuracy.Pleaseuse
CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK
or
CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION
ifthisresultsinnumericalprecisionissuesin
yourapplication.
2.1.7. cuBLAS:Release11.2
KnownIssues
cublasGemm()withvery
largenandm=k=1mayfailonPascaldevices.
2.1.8. cuBLAS:Release11.1Update1
NewFeatures
cuBLASLtLoggingisofficiallystableandnolongerexperimental.
cuBLASLtLoggingAPIsarestillexperimentalandmaychangein
futurereleases.
ResolvedIssues
cublasLtMatmulfailsonVoltaarchitectureGPUswith
CUBLAS_STATUS_EXECUTION_FAILEDwhen
ndimension>262,137andepiloguebiasfeatureisbeingused.This
issueexistsin11.0and11.1releasesbuthasbeencorrectedin
11.1Update1
2.1.9. cuBLAS:Release11.1
ResolvedIssues
Aperformanceregressioninthe
cublasCgetrfBatchedand
cublasCgetriBatchedroutines
hasbeenfixed.
TheIMMAkernelsdonotsupportpaddinginmatrixCandmay
corruptthedatawhenmatrixCwithpaddingissuppliedto
cublasLtMatmul.Asuggested
workaroundistosupplymatrixCwithleadingdimensionequal
to32timesthenumberofrowswhentargetingtheIMMAkernels:
computeType=CUDA_R_32IandCUBLASLT_ORDER_COL32formatrices
A,C,D,andCUBLASLT_ORDER_COL4_4R2_8C(onNVIDIAAmpereGPU
architectureorTuringarchitecture)or
CUBLASLT_ORDER_COL32_2R_4R4(onNVIDIAAmpereGPUarchitecture)
formatrixB.MatmuldescriptormustspecifyCUBLAS_OP_Ton
matrixBandCUBLAS_OP_N(default)onmatrixAandC.Thedata
corruptionbehaviorwasfixedsothat
CUBLAS_STATUS_NOT_SUPPORTEDisreturnedinstead.
FixedanissuethatcausedanAddressoutofboundserrorwhen
callingcublasSgemm().
Aperformanceregressioninthe
cublasCgetrfBatchedand
cublasCgetriBatchedroutines
hasbeenfixed.
2.1.10. cuBLAS:Release11.0Update1
NewFeatures
ThecuBLASAPIwasextendedwithanewfunction,
cublasSetWorkspace(),whichallowstheuserto
setthecuBLASlibraryworkspacetoauser-owneddevicebuffer,
whichwillbeusedbycuBLAStoexecuteallsubsequentcallstothe
libraryonthecurrentlysetstream.
cuBLASLtexperimentalloggingmechanismcanbeenabledintwoways:
Bysettingthefollowingenvironmentvariablesbefore
launchingthetargetapplication:
CUBLASLT_LOG_LEVEL=--
wherelevelisoneofthefollowinglevels:
"0"-Off-loggingisdisabled(default)
"1"-Error-onlyerrorswillbelogged
"2"-Trace-APIcallsthatlaunchCUDA
kernelswilllogtheirparametersandimportant
information
"3"-Hints-hintsthatcanpotentially
improvetheapplication'sperformance
"4"-Heuristics-heuristicslogthatmay
helpuserstotunetheirparameters
"5"-APITrace-APIcallswilllogtheir
parameterandimportantinformation
CUBLASLT_LOG_MASK=--
wheremaskisacombinationofthefollowingmasks:
"0"-Off
"1"-Error
"2"-Trace
"4"-Hints
"8"-Heuristics
"16"-APITrace
CUBLASLT_LOG_FILE=--
wherevalueisafilenameintheformatof
".%i",%iwillbe
replacedwithprocessid.IfCUBLASLT_LOG_FILEisnot
defined,thelogmessagesareprintedto
stdout.
ByusingtheruntimeAPIfunctionsdefinedinthecublasLt
header:
typedefvoid(*cublasLtLoggerCallback_t)(int
logLevel,constchar*functionName,constchar*
message)--Atypeofcallbackfunction
pointer.
cublasStatus_t
cublasLtLoggerSetCallback(cublasLtLoggerCallback_t
callback)--Allowstosetacallback
functionsthatwillbecalledforeverymessagethat
isloggedbythelibrary.
cublasStatus_tcublasLtLoggerSetFile(FILE*
file)--Allowstosettheoutputfile
forthelogger.Thefilemustbeopenandhavewrite
permissions.
cublasStatus_tcublasLtLoggerOpenFile(const
char*logFile)--Allowstogiveapath
inwhichtheloggershouldcreatethelogfile.
cublasStatus_tcublasLtLoggerSetLevel(int
level)--Allowstosettheloglevelto
oneoftheabovementionedlevels.
cublasStatus_tcublasLtLoggerSetMask(int
mask)--Allowstosetthelogmasktoa
combinationoftheabovementionedmasks.
cublasStatus_t
cublasLtLoggerForceDisable()--Allowsto
disabletologgerfortheentiresession.Oncethis
APIisbeingcalled,theloggercannotbe
reactivatedinthecurrentsession.
2.1.11. cuBLAS:Release11.0
NewFeatures
cuBLASLtMatrixMultiplicationaddssupportforfusedReLUandbias
operationsforallfloatingpointtypesexceptdoubleprecision
(FP64).
ImprovedbatchedTRSMperformanceformatriceslargerthan256.
2.1.12. cuBLAS:Release11.0RC
NewFeatures
ManyperformanceimprovementshavebeenimplementedforNVIDIA
Ampere,Volta,andTuringArchitecturebasedGPUs.
ThecuBLASLtloggingmechanismcanbeenabledbysettingthe
followingenvironmentvariablesbeforelaunchingthetarget
application:
CUBLASLT_LOG_LEVEL=-whilelevel
isoneofthefollowinglevels:
"0"-Off-loggingisdisabled(default)
"1"-Error-onlyerrorswillbelogged
"2"-Trace-APIcallswillbeloggedwiththeir
parametersandimportantinformation
CUBLASLT_LOG_FILE=-while
valueisafilenameintheformatof
".%i",%iwillbe
replacedwithprocessid.IfCUBLASLT_LOG_FILEisnot
defined,thelogmessagesareprintedtostdout.
FormatrixmultiplicationAPIs:
cublasGemmEx,
cublasGemmBatchedEx,
cublasGemmStridedBatchedExand
cublasLtMatmuladdednewdatatype
supportfor__nv_bfloat16(CUDA_R_16BF).
AnewcomputetypeTensorFloat32(TF32)has
beenaddedtoprovidetensorcoreaccelerationforFP32
matrixmultiplicationroutineswithfulldynamicrangeand
increasedprecisioncomparedtoBFLOAT16.
NewcomputemodesDefault,Pedantic,andFasthavebeen
introducedtooffermorecontrolovercomputeprecision
used.
Tensorcoresarenowenabledbydefaultforhalf-,and
mixed-precision-matrixmultiplications.
Doubleprecisiontensorcores(DMMA)areusedautomatically.
Tensorcorescannowbeusedforallsizesanddata
alignmentsandforallGPUarchitectures:
SelectionofthesekernelsthroughcuBLASheuristics
isautomaticandwilldependonfactorssuchasmath
modesettingaswellaswhetheritwillrunfaster
thanthenon-tensorcorekernels.
Usersshouldnotethatwhilethesenewkernelsthat
usetensorcoresforallunalignedcasesare
expectedtoperformfasterthannon-tensorcore
basedkernelsbutslowerthankernelsthatcanbe
runwhenallbuffersarewellaligned.
DeprecatedFeatures
Algorithmselectionin
cublasGemmExAPIs
(includingbatchedvariants)isnon-functionalforNVIDIA
AmpereArchitectureGPUs.Regardlessofselectionitwill
defaulttoaheuristicsselection.Usersareencouragedto
usethecublasLtAPIsfor
algorithmselectionfunctionality.
Thematrixmultiplymathmode
CUBLAS_TENSOR_OP_MATHis
beingdeprecatedandwillberemovedinafuturerelease.
Usersareencouragedtousethenew
cublasComputeType_t
enumerationtodefinecomputeprecision.
2.2. cuFFTLibrary
2.2.1. cuFFT:Release11.5
KnownIssues
FFTsofcertainsizesinsingleanddoubleprecision(multiplesofsize
6)couldfailonfuturedevices.Thisissuewillbefixedinan
upcomingrelease.
2.2.2. cuFFT:Release11.4Update2
ResolvedIssues
SincecuFFT10.3.0(CUDAToolkit11.1),cuFFTmayrequireuserto
makesurethatalloperationsoninputandoutputbuffersare
completebeforecallingcufft[Xt]Exec*if:
sm70orlater,3DFFT,batch>1,totalsizeoftransformis
greaterthan4.5MB
FFTsizeforalldimensionsisinthesetofthefollowing
sizes:{2,4,8,16,32,64,128,3,9,81,243,729,2187,
6561,5,25,125,625,3125,6,36,216,1296,7776,7,49,
343,2401,11,121}
SomeV100FFTswereslowerthanexpected.Thisissueisresolved.
KnownIssues
SomeT4FFTsareslowerthanexpected.
PlansforFFTsofcertainsizesinsingleprecision(includingsome
multiplesof1024sizes,andsomelargeprimenumbers)couldfailon
futuredeviceswithlessthan64kBofsharedmemory.Thisissue
willbefixedinanupcomingrelease.
2.2.3. cuFFT:Release11.4Update1
ResolvedIssues
SomecuFFTmulti-GPUplansexhibitedverylongcreation
times.
cuFFTsometimesproducedincorrectresultsforreal-to-complexand
complex-to-realtransformswhenthetotalnumberofelementsacross
allbatchesinasingleexecutionexceeded
2147483647.
KnownIssues
SomeV100FFTsareslowerthan
expected.
SomeT4FFTsareslowerthanexpected.
2.2.4. cuFFT:Release11.4
NewFeatures
Performanceimprovements.
KnownIssues
SomeT4FFTsareslowerthanexpected.
cuFFTmayproduceincorrectresultsfor
real-to-complexandcomplex-to-realtransformswhenthe
totalnumberofelementsacrossallbatchesinasingle
executionexceeds2147483647.
SomecuFFTmulti-GPUplansmayexhibitvery
longcreationtime.Issuewillbefixedinthenext
update.
cuFFTmayproduceincorrectresultsfortransformswith
strideswhentheindexofthelastelementacrossallbatches
exceeds2147483647(seeAdvancedDataLayout).
DeprecatedFeatures
Supportforcallbackfunctionalityusingseparatelycompileddevice
codeisdeprecatedonallGPUarchitectures.Callbackfunctionality
willcontinuetobesupportedforallGPUarchitectures.
2.2.5. cuFFT:Release11.3
NewFeatures
cuFFTsharedlibrariesarenowlinkedstaticallyagainstlibstdc++
onLinuxplatforms.
Improvedperformanceofcertainsizes(multiplesoflargepowersof
3,powersof11)inSM86.
KnownIssues
cuFFTplanningandplanestimationfunctionsmaynotrestore
correctcontextaffectingCUDAdriverAPIapplications.
Planswithstrides,primeslargerthan127inFFTsize
decompositionandtotalsizeoftransformincludingstrides
biggerthan32GBproduceincorrectresults.
2.2.6. cuFFT:Release11.2Update2
KnownIssues
cuFFTplanningandplanestimationfunctionsmaynotrestore
correctcontextaffectingCUDAdriverAPIapplications.
Planswithstrides,primeslargerthan127inFFTsize
decompositionandtotalsizeoftransformincludingstrides
biggerthan32GBproduceincorrectresults.
2.2.7. cuFFT:Release11.2Update1
ResolvedIssues
Previously,reducedperformanceofpower-of-2singleprecision
FFTswasobservedonGPUswithsm_86architecture.Thisissue
hasbeenresolved.
Largeprimefactorsinsizedecompositionandrealtocomplexor
complextorealFFTtypenolongercausecuFFTplanfunctionsto
fail.
KnownIssues
cuFFTplanningandplanestimationfunctionsmaynotrestore
correctcontextaffectingCUDAdriverAPIapplications.
Planswithstrides,primeslargerthan127inFFTsize
decompositionandtotalsizeoftransformincludingstrides
biggerthan32GBproduceincorrectresults.
2.2.8. cuFFT:Release11.2
NewFeatures
Multi-GPUplanscanbeassociatedwithastreamusingthe
cufftSetStreamAPIfunction
call.
PerformanceimprovementsforR2C/C2C/C2Rtransforms.
Performanceimprovementsformulti-GPUsystems.
ResolvedIssues
cuFFTisnolongerstuckinabadstateifpreviousplan
creationfailswith
CUFFT_ALLOC_FAILED.
Previously,singledimensionalmulti-GPUFFTplansignoreduser
inputon
cufftXtSetGPUswhichGPUs
argumentandassumedthatGPUsIDsarealwaysnumberedfrom0to
N-1.Thisissuehasbeenresolved.
Planswithprimeslargerthan127inFFTsizedecompositionor
FFTsizebeingaprimenumberbiggerthan4093donotperform
calculationsonsecondandsubsequent
cufftExecute*calls.The
regressionwasintroducedincuFFT11.1.
KnownIssues
cuFFTplanningandplanestimationfunctionsmaynotrestorecorrect
contextaffectingCUDAdriverAPIapplications.
2.2.9. cuFFT:Release11.1
NewFeatures
cuFFTisnowL2-cacheawareandusesL2cacheforGPUswithmore
than4.5MBofL2cache.Performancemayimproveincertain
single-GPU3DC2CFFTcases.
Aftersuccessfullycreatingaplan,cuFFTnowenforcesalockon
thecufftHandle.Subsequentcallstoanyplanningfunctionwith
thesamecufftHandlewillfail.
Addedsupportforverylargesizes(3kcube)tomulti-GPUcuFFT
onDGX-2.
Improvedperformanceonmulti-gpucuFFTforcertainsizes(1k
cube).
ResolvedIssues
ResolvedanissuethatcausedcuFFTtocrashwhenreusinga
handleafterclearingacallback.
Fixedanerrorwhichproducedincorrectresults/NaNvalues
whenrunningareal-to-complexFFTinhalfprecision.
KnownIssues
cuFFTwillalwaysoverwritetheinputforout-of-placeC2R
transform.
Singledimensionalmulti-GPUFFTplansignoreuserinputonthe
whichGPUsparameterof
cufftXtSetGPUs()andassume
thatGPUsIDsarealwaysnumberedfrom0toN-1.
2.2.10. cuFFT:Release11.0RC
NewFeatures
cuFFTnowaccepts__nv_bfloat16inputand
outputdatatypeforpower-of-twosizeswithsingleprecision
computationswithinthekernels.
Reoptimizedpowerof2FFTkernelsonVoltaandTuring
architectures.
ResolvedIssues
ReducedR2C/C2Rplanmemoryusagetopreviouslevels.
Resolvedbugintroducedin10.1update1thatcaused
incorrectresultswhenusingcustomstrides,batched2D
plansandcertainsizesonVoltaandlater.
KnownIssues
cuFFTmodifiesC2Rinputbufferforsomenon-stridedFFT
plans.
ThereisaknownissuewithcertaincuFFTplansthatcausesan
assertionintheexecutionphaseofcertainplans.Thisapplies
toplanswithallofthefollowingcharacteristics:realinput
tocomplexoutput(R2C),in-place,nativecompatibilitymode,
certaineventransformsizes,andmorethanonebatch.
2.3. cuRANDLibrary
2.3.1. cuRAND:Release11.5Update1
NewFeatures
ImprovedperformanceofCURAND_RNG_PSEUDO_MRG32K3A
pseudorandomnumbergeneratorwhenusingordering
CURAND_ORDERING_PSEUDO_BESTor
CURAND_ORDERING_PSEUDO_DEFAULT.
Addedanewtypeoforderparameter:
CURAND_ORDERING_PSEUDO_DYNAMIC.
SupportedPRNGs:
CURAND_RNG_PSEUDO_XORWOW
CURAND_RNG_PSEUDO_MRG32K3A
CURAND_RNG_PSEUDO_MTGP32
CURAND_RNG_PSEUDO_PHILOX4_32_10
Improvedperformancecomparedto
CURAND_ORDERING_PSEUDO_DEFAULT,
especiallyonNVIDIAAmperearchitectureGPUs.
Theoutputorderingofgeneratedrandomnumbersfor
CURAND_ORDERING_PSEUDO_DYNAMICdepends
onthenumberofSMsonaGPU,andthuscanbedifferenton
differentGPUs.
TheCURAND_ORDERING_PSEUDO_DYNAMICordering
can'tbeusedwithahostgeneratorcreatedusing
curandCreateGeneratorHost().
ResolvedIssues
AddedinformationaboutcuRANDthreadsafety.
KnownIssues
CURAND_RNG_PSEUDO_XORWOWwithordering
CURAND_ORDERING_PSEUDO_DYNAMICcanproduce
incorrectresultsonarchitecturesnewerthanSM86.
2.3.2. cuRAND:Release11.3
ResolvedIssues
FixedinconsistencybetweenrandomnumbersgeneratedbyGPUandhost
generatorswhenCURAND_ORDERING_PSEUDO_LEGACY
orderingisselectedforcertaingeneratortypes.
2.3.3. cuRAND:Release11.0Update1
ResolvedIssues
Fixedanissuethatcausedlinkererrorsaboutthemultiple
definitionsof
mtgp32dc_params_fast_11213and
mtgpdc_params_11213_numwhen
includingcurand_mtgp32dc_p_11213.hin
differentcompilationunits.
2.3.4. cuRAND:Release11.0
ResolvedIssues
Fixedanissuethatcausedlinkererrorsaboutthemultiple
definitionsof
mtgp32dc_params_fast_11213and
mtgpdc_params_11213_numwhen
includingcurand_mtgp32dc_p_11213.hin
differentcompilationunits.
2.3.5. cuRAND:Release11.0RC
ResolvedIssues
IntroducedCURAND_ORDERING_PSEUDO_LEGACYordering.
StartingwithCUDA10.0,theorderingofrandomnumbersreturnedby
MTGP32andMRG32k3ageneratorsarenolongerthesameasprevious
releasesdespitebeingguaranteedbythedocumentationforthe
CURAND_ORDERING_PSEUDO_DEFAULTsetting.The
CURAND_ORDERING_PSEUDO_LEGACYprovidespre-CUDA
10.0orderingforMTGP32andMRG32k3agenerators.
StartingwithCUDA11.0
CURAND_ORDERING_PSEUDO_DEFAULTisthesameas
CURAND_ORDERING_PSEUDO_BESTforallgenerators
exceptMT19937.OnlyCURAND_ORDERING_PSEUDO_LEGACY
isguaranteedtoprovidethesameforallfuturecuRANDreleases.
2.4. cuSOLVERLibrary
2.4.1. cuSOLVER:Release11.4
NewFeatures
IntroducingcusolverDnXtrtri,anewgeneric
APIfortriangularmatrixinversion(trtri).
IntroducingcusolverDnXsytrs,anewgeneric
APIforsolvingsystemsoflinearequationsusingagivenfactorized
symmetricmatrixfromSYTRF.
2.4.2. cuSOLVER:Release11.3
KnownIssues
ForvaluesN<=16,
cusolverDn[S|D|C|Z]syevjBatched
hitsout-of-boundaccessandmaydeliverthewrongresult.The
workaroundistopadthematrixAwithadiagonalmatrixDsuchthat
thedimensionof[A0;0D]isbiggerthan16.Thediagonalentry
D(j,j)mustbebiggerthanmaximumeigenvalueofA,forexample,
norm(A,‘fro’).Afterthesyevj,W(0:n-1)containstheeigenvalues
andA(0:n-1,0:n-1)containstheeigenvectors.
2.4.3. cuSOLVER:Release11.2Update2
NewFeatures
Newsingularvaluedecomposition(GESVDR)isadded.GESVDR
computespartialspectrumwithrandomsampling,anorderof
magnitudefasterthanGESVD.
libcusolver.sonolongerlinks
libcublas_static.a;instead,it
dependsonlibcublas.so.This
reducesthebinarysizeof
libcusolver.so.However,it
breaksbackwardcompatibility.Theuserhastolink
libcusolver.sowiththecorrect
versionoflibcublas.so.
2.4.4. cuSOLVER:Release11.2
ResolvedIssues
cusolverDnIRSXgelssometimes
returned
CUSOLVER_STATUS_INTERNAL_ERROR
whentheprecisionis‘z’.ThisissuehasbeenfixedinCUDA
11.2;nowcusolverDnIRSXgelsworks
forallprecisions.
ZSYTRFsometimesreturned
CUSOLVER_STATUS_INTERNAL_ERROR
duetoinsufficientresourcestolaunchthekernel.Thisissue
hasbeenfixedinCUDA11.2.
GETRFreturnedearlywithoutfinishingthewholefactorization
whenthematrixwassingular.ThisissuehasbeenfixedinCUDA
11.2.
2.4.5. cuSOLVER:Release11.1Update1
ResolvedIssues
cusolverDnDDgelsreports
IRS_NOT_SUPPORTEDwhenm>n.
Theissuehasbeenfixedinrelease11.1U1,so
cusolverDnDDgelswillsupportm
>n.
cusolverMgDeviceSelectcanconsume
over1GBdevicememory.Theissuehasbeenfixedinrelease11.1
U1.ThehiddenmemoryallocationinsidecusolverMGhandleis
about30MBperdevice.
KnownIssues
cusolverDnIRSXgelsmayreturn
CUSOLVER_STATUS_INTERNAL_ERROR.
whentheprecisionis‘z’duetoinsufficientworkspacewhichcauses
illegalmemoryaccess.
ThecusolverDnIRSXgels_bufferSize()
doesnotreportthecorrectsizeofworkspace.Toworkaroundthe
issue,theuserhastoaddmoreworkspacethanwhatisreportedby
cusolverDnIRSXgels_bufferSize().
Forexample,ifxisthesizeofworkspacereturnedby
cusolverDnIRSXgels_bufferSize(),
thentheuserhastoallocate(x+
min(m,n)*sizeof(cuDoubleComplex))bytes.
2.4.6. cuSOLVER:Release11.1
NewFeatures
Addednew64-bitAPIs:
cusolverDnXpotrf_bufferSize
cusolverDnXpotrf
cusolverDnXpotrs
cusolverDnXgeqrf_bufferSize
cusolverDnXgeqrf
cusolverDnXgetrf_bufferSize
cusolverDnXgetrf
cusolverDnXgetrs
cusolverDnXsyevd_bufferSize
cusolverDnXsyevd
cusolverDnXsyevdx_bufferSize
cusolverDnXsyevdx
cusolverDnXgesvd_bufferSize
cusolverDnXgesvd
AddedanewSVDalgorithmbasedonpolardecomposition,called
GESVDPwhichusesthenew64-bitAPI,including
cusolverDnXgesvdp_bufferSize
andcusolverDnXgesvdp.
DeprecatedFeaturesThefollowing64-bitAPIsaredeprecated:
cusolverDnPotrf_bufferSize
cusolverDnPotrf
cusolverDnPotrs
cusolverDnGeqrf_bufferSize
cusolverDnGeqrf
cusolverDnGetrf_bufferSize
cusolverDnGetrf
cusolverDnGetrs
cusolverDnSyevd_bufferSize
cusolverDnSyevd
cusolverDnSyevdx_bufferSize
cusolverDnSyevdx
cusolverDnGesvd_bufferSize
cusolverDnGesvd
2.4.7. cuSOLVER:Release11.0
NewFeatures
Add64-bitAPIofGESVD.Thenewroutine
cusolverDnGesvd_bufferSize()fillsthemissingparametersin
32-bitAPI
cusolverDn[S|D|C|Z]gesvd_bufferSize()
suchthatitcanestimatethesizeoftheworkspace
accurately.
Addedthesingleprocessmulti-GPUCholeskyfactorization
capabilitiesPOTRF,POTRSandPOTRIincusolverMGlibrary.
ResolvedIssues
FixedanissuewhereSYEVD/SYGVDwouldfailandreturnerrorcode7
ifthematrixiszeroandthedimensionisbiggerthan25.
2.5. cuSPARSELibrary
2.5.1. cuSPARSE:Release11.6Update1
NewFeatures
ImprovedCSRcusparseSpMMAlg1forcolumn-majorlayout:
Betterperformance
Supportforbatchedcomputation,customrow/col-majorlayoutfor
B/C,andmixed-precisioncomputation
ImprovedCOOcusparseSpMMAlg3withsupportforbatched
computation,customrow/col-majorlayoutforB/C,andmixed-precision
computation.
Improvedmixed-precisioncomputationofCSR/COO
cusparseSpMV.
AddedCSCformatsupportforcusparseSpMVand
cusparseSpMM.
BettererrorhandlingforJITLTOcusparseSpMMOp.
cusparseSpMMnowsupportsbatchesofsparse
matrices.
ResolvedIssues
cusparseDenseToSparseproducedwrongresultswhentheinput
matrixcontainedthefloating-pointvalue-0.0.
std::localeisnolongermodifiedbycuSPARSEduringthe
initialization.
AddedanoteinthedocumentationofcusparseSpMMOpto
reportthattheroutineisnotcompatiblewitholdCUDAdriverversion
andAndroidplatforms.
KnownIssues
cusparseSpSV,cusparseSpSMcouldproduce
wrongresultsiftheoutputvector/matrixisnotzero-initialized.
2.5.2. cuSPARSE:Release11.6
NewFeatures
BetterperformanceforcusparseSpGEMM,
cusparseSpGEMMreuse,
cusparseCsr2cscEx2,and
cusparseDenseToSparseroutines.
ResolvedIssues
Fixedforwardcompatibilityissueswithaxpby,
rot,spvv,
scatter,gather.
FixedincorrectresultsinCOOSpMMAlg1whichoccurredinsomerare
cases.
2.5.3. cuSPARSE:Release11.5Update1
NewFeatures
NewroutinecusparseSpMMOpthatexploits
Just-In-TimeLink-Time-Optimization(JITLTO)forprovidingsparse
matrix-densematrixmultiplicationwithcustom(user-defined)operators.
Seehttps://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmm-op.
cuSPARSEnowsupportsloggingfunctionalities.See
https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-logging.
ResolvedIssues
Addedmemoryrequirements,graphcapture,andasynchronousnotesfor
cusparseXcsrsm2_analysis.
CSR,CSC,andCOOformatdescriptionswronglyreportedsortedcolumnindices
requirement.Allroutinessupportunsortedcolumnindices,exceptwhere
strictlyindicated
ClarifiedcusparseSpSVand
cusparseSpSMmemorymanagement.
cusparseSpSMproducedwrongresultsinsome
caseswhenthematBoperationisCUSPARSE_
OPERATION_NON_TRANSPOSEor
CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE.
cusparseSpSMproducedwrongresultsinsome
caseswhenthematrixlayoutisrow-major.
2.5.4. cuSPARSE:Release11.4Update1
ResolvedIssues
cusparseSpSVandcusparseSpSMcould
producewrongresults
cusparseSpSVandcusparseSpSMdidnotwork
correctlywhenvecX==vecYormatB==matC.
2.5.5. cuSPARSE:Release11.4
KnownIssues
cusparseSpSVandcusparseSpSMcould
producewrongresults
cusparseSpSVand
cusparseSpSMdonotworkcorrectlywhenvecX==
vecYormatB==matC.
2.5.6. cuSPARSE:Release11.3Update1
NewFeatures
Introducedanewroutineforsparsematrix-sparsematrix
multiplication
(cusparseSpGEMMreuse)wherethe
outputmatrixstructureisreusedformultiplecomputation.The
newroutinesupportsCSRstorageformatandmixed-precision
computation.
SparsetriangularsolveraddssupportforCOOformat.
Introducedanewroutineforsparsetriangularsolverwith
multipleright-handsides
cusparseSpSM().
cusparseDenseToSparse()routine
addstheconversionfromdensematrix(row-major/column-major)
toBlocked-ELLformat.
Blocke-ELLformatnowsupportemptyblocks
BetterperformanceforBlocked-ELLSpMMwithblocksize>64,
doubledatatype,andalignmentssmallerthan128-byteonNVIDIA
Amperesm80.
AllcuSPARSEAPIsarenowasynchronousonplatformsthatsupport
streamorderedmemoryallocatorshttps://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-ordered-querying-memory-support.
ImprovedNTVXtracewithdistinctionbetweenlightcallsand
kernelroutines
ResolvedIssues
cusparseCnnz_compressproduced
wrongresultswhenthenumberofrowsaregreaterthan128*
residentCTAs.
cusparseSnnzproducedwrongresults
forsomeparticularsparsitypattern.
DeprecatedFeatures
cusparseXcsrsm2_zeroPivot,
cusparseXcsrsm2_solve,
cusparseXcsrsm2_analysis,and
cusparseScsrsm2_bufferSizeExthave
beendeprecatedinfavorof
cusparseSpSMGenericAPIs
2.5.7. cuSPARSE:Release11.3
NewFeaturesAddednewroutine
cusparesSpSVforsparsetriangular
solverwithbetterperformance.ThenewGenericAPIsupports:
CSRstorageformat
Non-transpose,transpose,andtranspose-conjugate
operations
Upper,lowerfillmode
Unit,non-unitdiagonaltype
32-bitand64-bitindices
Uniformdatatypecomputation
DeprecatedFeatures
cusparseScsrsv2_analysis,
cusparseScsrsv2_solve,
cusparseXcsrsv2_zeroPivot,and
cusparseScsrsv2_bufferSize
havebeendeprecatedinfavorof
cusparseSpSV.
2.5.8. cuSPARSE:Release11.2Update2
ResolvedIssues
cusparseDestroy(NULL)nolongercrashes
onWindows.
KnownIssues
cusparseDestroySpVec,
cusparseDestroyDnVec,
cusparseDestroySpMat,
cusparseDestroyDnMat,
cusparseDestroywith
NULLargumentcouldcause
segmentationfaultonWindows.
2.5.9. cuSPARSE:Release11.2Update1
NewFeatures
NewTensorCore-acceleratedBlockSparseMatrix-Matrix
Multiplication(cusparseSpMM)and
introductionoftheBlocked-Ellpackstorageformat.
NewalgorithmsforCSR/COOSparseMatrix-VectorMultiplication
(cusparseSpMV)withbetter
performance.
Extendedfunctionalitiesfor
cusparseSpMV:
SupportfortheCSCformat.
Supportforregular/complexbfloat16datatypesforboth
uniformandmixed-precisioncomputation.
Supportformixedregular-complexdatatype
computation.
Supportfordeterministicandnon-deterministic
computation.
Newalgorithm
(CUSPARSE_SPMM_CSR_ALG3)for
SparseMatrix-MatrixMultiplication
(cusparseSpMM)withbetter
performanceespeciallyforsmallmatrices.
NewroutineforSampledDenseMatrix-DenseMatrix
Multiplication(cusparseSDDMM)
whichdeprecated
cusparseConstrainedGeMMand
providesbetterperformance.
BetteraccuracyofcusparseAxpby,
cusparseRot,
cusparseSpVVforbfloat16and
halfregular/complexdatatypes.
AllroutinessupportNVTXannotationforenhancingtheprofiler
timelineoncomplexapplications.
ResolvedIssues
cusparseAxpby,
cusparseGather,
cusparseScatter,
cusparseRot,
cusparseSpVV,
cusparseSpMVnowsupport
zero-sizematrices.
cusparseCsr2cscEx2nowcorrectly
handlesemptymatrices(nnz=0).
cusparseXcsr2csr_compressnowuses
2-normforthecomparisonofcomplexvaluesinsteadofonlythe
realpart.
KnownIssuescusparseDestroySpVec,
cusparseDestroyDnVec,
cusparseDestroySpMat,
cusparseDestroyDnMat,
cusparseDestroywithNULL
argumentcouldcausesegmentationfaultonWindows.
DeprecatedFeatures
cusparseConstrainedGeMMhasbeen
deprecatedinfavorof
cusparseSDDMM.
cusparseCsrmvExhasbeendeprecated
infavorofcusparseSpMV.
COOArrayofStructure(CooAoS)formathasbeendeprecated
includingcusparseCreateCooAoS,
cusparseCooAoSGet,andits
supportforcusparseSpMV.
2.5.10. cuSPARSE:Release11.2
KnownIssues
cusparseXdense2csrprovidesincorrect
resultsforsomematrixsizes.
2.5.11. cuSPARSE:Release11.1Update1
NewFeatures
cusparseSparseToDense
CSR,CSC,orCOOconversiontodenserepresentation
Supportrow-majorandcolumn-majorlayouts
Supportalldatatypes
Support32-bitand64-bitindices
Provideperformance3xhigherthan
cusparseXcsc2dense,
cusparseXcsr2dense
cusparseDenseToSparse
DenserepresentationtoCSR,CSC,orCOO
Supportrow-majorandcolumn-majorlayouts
Supportalldatatypes
Support32-bitand64-bitindices
Provideperformance3xhigherthan
cusparseXcsc2dense,
cusparseXcsr2dense
KnownIssues
cusparseXdense2csrprovidesincorrect
resultsforsomematrixsizes.
DeprecatedFeatures
Legacyconversionroutines:
cusparseXcsc2dense,
cusparseXcsr2dense,
cusparseXdense2csc,
cusparseXdense2csr
2.5.12. cuSPARSE:Release11.0
NewFeatures
AddednewGenericAPIsforAxpby(cusparseAxpby),Scatter
(cusparseScatter),Gather(cusparseGather),Givensrotation
(cusparseRot).__nv_bfloat16/__nv_bfloat162datatypesand
64-bitindicesarealsosupported.
Thisreleaseaddsthefollowingfeaturesfor
cusparseSpMM:
Supportforrow-majorlayoutforcusparseSpMMfor
bothCSRandCOOformat
Supportfor64-bitindices
Supportfor__nv_bfloat16and__nv_bfloat162data
types
Supportforthefollowingstridedbatchmode:
Ci=A⋅Bi
Ci=Ai⋅B
Ci=Ai⋅Bi
2.5.13. cuSPARSE:Release11.0RC
NewFeatures
AddednewGenericAPIsforAxpby(cusparseAxpby),Scatter
(cusparseScatter),Gather(cusparseGather),Givensrotation
(cusparseRot).__nv_bfloat16/__nv_bfloat162datatypesand
64-bitindicesarealsosupported.
Thisreleaseaddsthefollowingfeaturesfor
cusparseSpMM:
Supportforrow-majorlayoutforcusparseSpMMfor
bothCSRandCOOformat
Supportfor64-bitindices
Supportfor__nv_bfloat16and__nv_bfloat162data
types
Supportforthefollowingstridedbatchmode:
Ci=A⋅Bi
Ci=Ai⋅B
Ci=Ai⋅Bi
AddednewgenericAPIsandimprovedperformanceforsparse
matrix-sparsematrixmultiplication(SpGEMM):
cusparseSpGEMM_workEstimation,
cusparseSpGEMM_compute,and
cusparseSpGEMM_copy.
SpVV:addedsupportfor
__nv_bfloat16.
DeprecatedFeaturesThefollowingfunctionshavebeenremoved:
cusparsegemmi()
cusparseXaxpyi,
cusparseXgthr,
cusparseXgthrz,
cusparseXroti,
cusparseXsctr
Hybridformatenumsandhelperfunctions:
cusparseHybPartition_t,
cusparseHybPartition_t,
cusparseCreateHybMat,
cusparseDestroyHybMat
Triangularsolverenumsandhelperfunctions:
cusparseSolveAnalysisInfo_t,
cusparseCreateSolveAnalysisInfo,
cusparseDestroySolveAnalysisInfo
Sparsedotproduct:cusparseXdoti,
cusparseXdotci
Sparsematrix-vectormultiplication:
cusparseXcsrmv,
cusparseXcsrmv_mp
Sparsematrix-matrixmultiplication:
cusparseXcsrmm,
cusparseXcsrmm2
Sparsetriangular-singlevectorsolver:
cusparseXcsrsv_analysis,
cusparseCsrsv_analysisEx,
cusparseXcsrsv_solve,
cusparseCsrsv_solveEx
Sparsetriangular-multiplevectorssolver:
cusparseXcsrsm_analysis,
cusparseXcsrsm_solve
Sparsehybridformatsolver:
cusparseXhybsv_analysis,
cusparseShybsv_solve
Extrafunctions:
cusparseXcsrgeamNnz,
cusparseScsrgeam,
cusparseXcsrgemmNnz,
cusparseXcsrgemm
IncompleteCholeskyFactorization,level0:
cusparseXcsric0
IncompleteLUFactorization,level0:
cusparseXcsrilu0,
cusparseCsrilu0Ex
TridiagonalSolver:cusparseXgtsv,
cusparseXgtsv_nopivot
BatchedTridiagonalSolver:
cusparseXgtsvStridedBatch
Reordering:cusparseXcsc2hyb,
cusparseXcsr2hyb,
cusparseXdense2hyb,
cusparseXhyb2csc,
cusparseXhyb2csr,
cusparseXhyb2dense
Thefollowingfunctionshavebeendeprecated:
SpGEMM:
cusparseXcsrgemm2_bufferSizeExt,
cusparseXcsrgemm2Nnz,
cusparseXcsrgemm2
2.6. MathLibrary
2.6.1. CUDAMath:Release11.6
NewFeatures
Newhalfandbfloat16APIsforaddition/multiplicationin
round-to-nearest-evenmodethatdonotgetcontractedintoan
fmainstruction.Pleasesee__hadd_rn,
__hsub_rn,__hmul_rn,
__hadd2_rn,__hsub2_rn,
and__hmul2_rninhttps://docs.nvidia.com/cuda/cuda-math-api/index.html.
2.6.2. CUDAMath:Release11.5
Deprecations
ThefollowingundocumentedCUDAMathAPIsaredeprecatedandwillbe
removedinafuturerelease.Pleaseconsiderswitchingtosimilar
intrinsicAPIsdocumentedhere:https://docs.nvidia.com/cuda/cuda-math-api/index.html
__device__intmulhi(constinta,constint
b)
__device__unsignedintmulhi(constunsignedinta,
constunsignedintb)
__device__unsignedintmulhi(constinta,const
unsignedintb)
__device__unsignedintmulhi(constunsignedinta,
constintb)
__device__longlongintmul64hi(constlonglongint
a,constlonglongintb)
__device__unsignedlonglongintmul64hi(const
unsignedlonglonginta,constunsignedlonglongint
b)
__device__unsignedlonglongintmul64hi(constlong
longinta,constunsignedlonglongint
b)
__device__unsignedlonglongintmul64hi(const
unsignedlonglonginta,constlonglongint
b)
__device__intfloat_as_int(constfloat
a)
__device__floatint_as_float(constint
a)
__device__unsignedintfloat_as_uint(constfloat
a)
__device__floatuint_as_float(constunsignedint
a)
__device__floatsaturate(constfloat
a)
__device__intmul24(constinta,constint
b)
__device__unsignedintumul24(constunsignedinta,
constunsignedintb)
__device__intfloat2int(constfloata,constenum
cudaRoundModemode=cudaRoundZero)
__device__unsignedintfloat2uint(constfloata,
constenumcudaRoundModemode=
cudaRoundZero)
__device__floatint2float(constinta,constenum
cudaRoundModemode=cudaRoundNearest)
__device__floatuint2float(constunsignedinta,
constenumcudaRoundModemode=
cudaRoundNearest)
2.6.3. CUDAMath:Release11.4
Beginningin2022,theNVIDIAMathLibrariesofficialhardwaresupportwillfollowan
N-2policy,whereNisanx100seriesGPU.
2.6.4. CUDAMath:Release11.3
ResolvedIssues
PreviousreleasesofCUDAwerepotentiallydeliveringincorrect
resultsinsomeLinuxdistributionsforthefollowinghostMath
APIs:sinpi,
cospi,
sincospi,
sinpif,
cospif,
sincospif.Ifpassedhugeinputs
like7.3748776e+15or8258177.5theresultswerenotequalto0or
1.Thesehavebeencorrectedwiththisrelease.
2.6.5. CUDAMath:Release11.1
NewFeatures
Addedhostsupportforhalfandnv_bfloat16
convertsto/fromintegertypes.
Added__hcmadd()deviceonlyAPI
forfasthalf2andnv_bfloat162basedcomplex
multiply-accumulate.
2.6.6. CUDAMath:Release11.0Update1
ResolvedIssues
nv_bfloat16comparisonfunctionscouldtriggera
faultwithmisalignedaddresses.
Performanceimprovementsinhalfandnv_bfloat16
basicarithmeticimplementations.
2.6.7. CUDAMath:Release11.0RC
NewFeatures
Addarithmeticsupportfor
__nv_bfloat16floating-pointdata
typewith8bitsofexponent,7explicitbitsofmantissa.
Performanceandaccuracyimprovementsinsingleprecisionmath
functions:fmodf,
expf,
exp10f,
sinhf,and
coshf.
ResolvedIssues
Correcteddocumentedmaximumulperrorthresholdsin
erfcinvfand
powf.
Improvedcuda_fp16.h
interoperabilitywithVisualStudioC++compiler.
UpdatedlibdeviceuserguideandCUDAmathAPIdefinitionsfor
j1,
j1f,
fmod,
fmodf,
ilogb,and
ilogbfmathfunctions.
2.7. NVIDIAPerformancePrimitives(NPP)
2.7.1. NPP:Release11.5
NewFeatures
NewAPIsaddedtocomputeSignedAnti-aliasedDistanceTransform
usingPBA,theanti-aliasedEuclideandistancebetweenpixelsites
inimages.Thiswillimprovetheaccuracyofdistancetransform.
nppiSignedDistanceTransformAbsPBA_xxxxx_C1R_Ctx()
–Inputandoutputcombinationsupports(xxxxxx)-32f,
32f64f,64f
NewAPIforAbsoluteManhattandistancetransform;anothermethodto
improvetheaccuracyofdistancetransformusingManhattandistance
transformbetweenpixels.
nppiDistanceTransformAbsPBA_xxxxx_C1R_Ctx()–Inputandoutputcombination
supports(xxxxxx)-8u16u,8s16u,16u16u,16s16u,8u32f,
8s32f,16u32f,16s32f,8u64f,8s64f,16u64f,16s64f,32f64f,
64f
ResolvedIssues
FixedanissueinFilterMedian()
APIwithaddinterpolationwhenmaskevensize.
ImprovedContourfunctionperformancebyparallelizing
moreofitandalsoimprovingquality.
ResolvedanissuewithAlphacompositionused
toaccumulateoutputbuffersmultipletimes.
Resolvedanissuewith
nppiLabelMarkersUF_8u32u_C1Rcolumnprocessing
incorrectresults.
2.7.2. NPP:Release11.4
NewFeatures
NewAPIFindContours
.FindContours
canbeexplainedsimplyasacurvejoiningallthecontinuouspoints
(alongtheboundary),havingthesamecolororintensity.The
contoursareausefultoolforshapeanalysisandobjectdetection
andrecognition.
2.7.3. NPP:Release11.3
NewFeatures
AddednppiDistanceTransformPBA
functions.
2.7.4. NPP:Release11.2Update2
NewFeatures
AddednppiDistanceTransformPBA
functions.
2.7.5. NPP:Release11.2Update1
NewFeaturesNewAPIsaddedtocomputeDistanceTransformusing
ParallelBandingAlgorithm(PBA):
nppiDistanceTransformPBA_xxxxx_C1R_Ctx()
–wherexxxxxspecifiestheinputandoutputcombination:
8u16u,8s16u,16u16u,16s16u,8u32f,8s32f,16u32f,16s32f
nppiSignedDistanceTransformPBA_32f_C1R_Ctx()
ResolvedIssues
FixedtheissueinwhichLabelMarkersaddszeropixelasobject
region.
2.7.6. NPP:Release11.0
NewFeatures
BatchedImageLabelMarkersCompressionthatremovessparseness
betweenmarkerlabelIDsoutputfromLabelMarkerscall.
ImageFloodFillfunctionalityfillsaconnectedregionofanimage
withaspecifiednewvalue.
StabilityandperformancefixestoImageLabelMarkersandImage
LabelMarkersCompression.
2.7.7. NPP:Release11.0RC
NewFeatures
BatchedImageLabelMarkersCompressionthatremovessparseness
betweenmarkerlabelIDsoutputfromLabelMarkerscall.
ImageFloodFillfunctionalityfillsaconnectedregionofanimage
withaspecifiednewvalue.
AddedbatchingsupportfornppiLabelMarkersUFfunctions.
Addedthe
nppiCompressMarkerLabelsUF_32u_C1IR
function.
AddednppiSegmentWatershed
functions.
AddedsampleappsonGitHubdemonstratingtheuseofNPPapplication
managedstreamcontextsalongwithwatershedsegmentationand
batchedandcompressedUFimagelabelmarkersfunctions.
Addedsupportfornon-blockingstreams.
ResolvedIssues
StabilityandperformancefixestoImageLabelMarkersandImage
LabelMarkersCompression.
Improvedqualityof
nppiLabelMarkersUFfunctions.
nppiCompressMarkerLabelsUF_32u_C1IR
cannowhandleahugenumberoflabelsgeneratedby
thenppiLabelMarkersUF
function.
KnownIssues
ThenppiCopyAPIislimitedbyCUDA
threadforlargeimagesize.Maximumimagelimitsisaminimumof16
*65,535=1,048,560horizontalpixelsofanydatatypeandnumber
ofchannelsand8*65,535=524,280verticalpixelsforamaximum
totalof549,739,036,800pixels.
2.8. nvJPEGLibrary
2.8.1. nvJPEG:Release11.5Update1
ResolvedIssues
Fixedtheissueinwhichnvcuvid()released
uncompressedframescausingamemoryleak.
2.8.2. nvJPEG:Release11.4
ResolvedIssues
Additionalsubsamplingaddedtosolvethe
NVJPEG_CSS_2x4.
2.8.3. nvJPEG:Release11.2Update1
NewFeaturesnvJPEGdecoderaddednewAPIstosupportregionof
interest(ROI)baseddecodingforbatchedhardwaredecoder:
nvjpegDecodeBatchedEx()
nvjpegDecodeBatchedSupportedEx()
2.8.4. nvJPEG:Release11.1Update1
NewFeatures
AddederrorhandlingcapabilitiesfornonstandardJPEGimages.
2.8.5. nvJPEG:Release11.0Update1
KnownIssues
NVJPEG_BACKEND_GPU_HYBRIDhasanissue
whenhandlingbit-streamswhichhavecorruptioninthescan.
2.8.6. nvJPEG:Release11.0
NewFeatures
nvJPEGallowstheusertoallocateseparatememorypoolsfor
eachchromasubsamplingformat.Thishelpsavoidmemory
re-allocationoverhead.Thiscanbecontrolledbypassing
thenewlyaddedflag
NVJPEG_FLAGS_ENABLE_MEMORY_POOLS
tothe
nvjpegCreateExAPI.
nvJPEGencodernowallowcompressedbitstreamontheGPU
Memory.
2.8.7. nvJPEG:Release11.0RC
NewFeatures
nvJPEGallowstheusertoallocateseparatememorypoolsfor
eachchromasubsamplingformat.Thishelpsavoidmemory
re-allocationoverhead.Thiscanbecontrolledbypassing
thenewlyaddedflag
NVJPEG_FLAGS_ENABLE_MEMORY_POOLS
tothe
nvjpegCreateExAPI.
nvJPEGencodernowallowcompressedbitstreamontheGPU
Memory.
HardwareaccelerateddecodeisnowsupportedonNVIDIA
A100.
ThenvJPEGdecodeAPI(nvjpegDecodeJpeg())
nowhastheflexibilitytoselectthebackendwhencreating
nvjpegJpegDecoder_tobject.
TheuserhastheoptiontocallthisAPIinsteadofmaking
threeseparatecallsto
nvjpegDecodeJpegHost(),
nvjpegDecodeJpegTransferToDevice(),
andnvjpegDecodeJpegDevice().
KnownIssues
NVJPEG_BACKEND_GPU_HYBRIDhasan
issuewhenhandlingbit-streamswhichhavecorruptioninthe
scan.
DeprecatedFeaturesThefollowingmultiphaseAPIshavebeenremoved:
nvjpegStatus_tNVJPEGAPI
nvjpegDecodePhaseOne
nvjpegStatus_tNVJPEGAPI
nvjpegDecodePhaseTwo
nvjpegStatus_tNVJPEGAPI
nvjpegDecodePhaseThree
nvjpegStatus_tNVJPEGAPI
nvjpegDecodeBatchedPhaseOne
nvjpegStatus_tNVJPEGAPI
nvjpegDecodeBatchedPhaseTwo
Notices
Notice
Thisdocumentisprovidedforinformation
purposesonlyandshallnotberegardedasawarrantyofa
certainfunctionality,condition,orqualityofaproduct.
NVIDIACorporation(“NVIDIA”)makesnorepresentationsor
warranties,expressedorimplied,astotheaccuracyor
completenessoftheinformationcontainedinthisdocument
andassumesnoresponsibilityforanyerrorscontained
herein.NVIDIAshallhavenoliabilityfortheconsequences
oruseofsuchinformationorforanyinfringementof
patentsorotherrightsofthirdpartiesthatmayresult
fromitsuse.Thisdocumentisnotacommitmenttodevelop,
release,ordeliveranyMaterial(definedbelow),code,or
functionality.
NVIDIAreservestherighttomakecorrections,modifications,
enhancements,improvements,andanyotherchangestothis
document,atanytimewithoutnotice.
Customershouldobtainthelatestrelevantinformationbefore
placingordersandshouldverifythatsuchinformationis
currentandcomplete.
NVIDIAproductsaresoldsubjecttotheNVIDIAstandardtermsand
conditionsofsalesuppliedatthetimeoforder
acknowledgement,unlessotherwiseagreedinanindividual
salesagreementsignedbyauthorizedrepresentativesof
NVIDIAandcustomer(“TermsofSale”).NVIDIAhereby
expresslyobjectstoapplyinganycustomergeneraltermsand
conditionswithregardstothepurchaseoftheNVIDIA
productreferencedinthisdocument.Nocontractual
obligationsareformedeitherdirectlyorindirectlybythis
document.
NVIDIAproductsarenotdesigned,authorized,orwarrantedtobe
suitableforuseinmedical,military,aircraft,space,or
lifesupportequipment,norinapplicationswherefailureor
malfunctionoftheNVIDIAproductcanreasonablybeexpected
toresultinpersonalinjury,death,orpropertyor
environmentaldamage.NVIDIAacceptsnoliabilityfor
inclusionand/oruseofNVIDIAproductsinsuchequipmentor
applicationsandthereforesuchinclusionand/oruseisat
customer’sownrisk.
NVIDIAmakesnorepresentationorwarrantythatproductsbasedon
thisdocumentwillbesuitableforanyspecifieduse.
Testingofallparametersofeachproductisnotnecessarily
performedbyNVIDIA.Itiscustomer’ssoleresponsibilityto
evaluateanddeterminetheapplicabilityofanyinformation
containedinthisdocument,ensuretheproductissuitable
andfitfortheapplicationplannedbycustomer,andperform
thenecessarytestingfortheapplicationinordertoavoid
adefaultoftheapplicationortheproduct.Weaknessesin
customer’sproductdesignsmayaffectthequalityand
reliabilityoftheNVIDIAproductandmayresultin
additionalordifferentconditionsand/orrequirements
beyondthosecontainedinthisdocument.NVIDIAacceptsno
liabilityrelatedtoanydefault,damage,costs,orproblem
whichmaybebasedonorattributableto:(i)theuseofthe
NVIDIAproductinanymannerthatiscontrarytothis
documentor(ii)customerproductdesigns.
Nolicense,eitherexpressedorimplied,isgrantedunderanyNVIDIA
patentright,copyright,orotherNVIDIAintellectual
propertyrightunderthisdocument.Informationpublishedby
NVIDIAregardingthird-partyproductsorservicesdoesnot
constitutealicensefromNVIDIAtousesuchproductsor
servicesorawarrantyorendorsementthereof.Useofsuch
informationmayrequirealicensefromathirdpartyunder
thepatentsorotherintellectualpropertyrightsofthe
thirdparty,oralicensefromNVIDIAunderthepatentsor
otherintellectualpropertyrightsofNVIDIA.
Reproductionofinformationinthisdocumentispermissibleonlyif
approvedinadvancebyNVIDIAinwriting,reproducedwithout
alterationandinfullcompliancewithallapplicableexport
lawsandregulations,andaccompaniedbyallassociated
conditions,limitations,andnotices.
THISDOCUMENTANDALLNVIDIADESIGNSPECIFICATIONS,REFERENCE
BOARDS,FILES,DRAWINGS,DIAGNOSTICS,LISTS,ANDOTHER
DOCUMENTS(TOGETHERANDSEPARATELY,“MATERIALS”)AREBEING
PROVIDED“ASIS.”NVIDIAMAKESNOWARRANTIES,EXPRESSED,
IMPLIED,STATUTORY,OROTHERWISEWITHRESPECTTOTHE
MATERIALS,ANDEXPRESSLYDISCLAIMSALLIMPLIEDWARRANTIESOF
NONINFRINGEMENT,MERCHANTABILITY,ANDFITNESSFORA
PARTICULARPURPOSE.TOTHEEXTENTNOTPROHIBITEDBYLAW,IN
NOEVENTWILLNVIDIABELIABLEFORANYDAMAGES,INCLUDING
WITHOUTLIMITATIONANYDIRECT,INDIRECT,SPECIAL,
INCIDENTAL,PUNITIVE,ORCONSEQUENTIALDAMAGES,HOWEVER
CAUSEDANDREGARDLESSOFTHETHEORYOFLIABILITY,ARISING
OUTOFANYUSEOFTHISDOCUMENT,EVENIFNVIDIAHASBEEN
ADVISEDOFTHEPOSSIBILITYOFSUCHDAMAGES.Notwithstanding
anydamagesthatcustomermightincurforanyreason
whatsoever,NVIDIA’saggregateandcumulativeliability
towardscustomerfortheproductsdescribedhereinshallbe
limitedinaccordancewiththeTermsofSaleforthe
product.
OpenCL
OpenCLisatrademarkofAppleInc.usedunderlicensetotheKhronosGroupInc.
Trademarks
NVIDIAandtheNVIDIAlogoaretrademarksorregisteredtrademarksofNVIDIACorporation
intheU.S.andothercountries.Othercompanyandproductnamesmaybetrademarksof
therespectivecompanieswithwhichtheyareassociated.
Copyright
©2007-2022NVIDIACorporation&
affiliates.Allrightsreserved.
ThisproductincludessoftwaredevelopedbytheSyncroSoftSRL(http://www.sync.ro/).
1OnlyavailableonselectLinuxdistros