NVIDIA CUDA Toolkit Release Notes

2024-10-07

文章推薦指數： 80 %

投票人數：10人

CUDA Toolkit Major Component Versions CUDAToolkit v11.6.1 ReleaseNotes 1. CUDA11.6ReleaseNotes 1.1. CUDAToolkitMajorComponentVersions 1.2. GeneralCUDA 1.3. CUDACompilers 1.4. CUDADeveloperTools 1.5. ResolvedIssues 1.5.1. CUDACompilers 1.6. DeprecatedFeatures 1.7. KnownIssues 1.7.1. GeneralCUDA 1.7.2. CUDACompiler 2. CUDALibraries 2.1. cuBLASLibrary 2.1.1. cuBLAS:Release11.6 2.1.2. cuBLAS:Release11.4Update3 2.1.3. cuBLAS:Release11.4Update2 2.1.4. cuBLAS:Release11.4 2.1.5. cuBLAS:Release11.3Update1 2.1.6. cuBLAS:Release11.3 2.1.7. cuBLAS:Release11.2 2.1.8. cuBLAS:Release11.1Update1 2.1.9. cuBLAS:Release11.1 2.1.10. cuBLAS:Release11.0Update1 2.1.11. cuBLAS:Release11.0 2.1.12. cuBLAS:Release11.0RC 2.2. cuFFTLibrary 2.2.1. cuFFT:Release11.5 2.2.2. cuFFT:Release11.4Update2 2.2.3. cuFFT:Release11.4Update1 2.2.4. cuFFT:Release11.4 2.2.5. cuFFT:Release11.3 2.2.6. cuFFT:Release11.2Update2 2.2.7. cuFFT:Release11.2Update1 2.2.8. cuFFT:Release11.2 2.2.9. cuFFT:Release11.1 2.2.10. cuFFT:Release11.0RC 2.3. cuRANDLibrary 2.3.1. cuRAND:Release11.5Update1 2.3.2. cuRAND:Release11.3 2.3.3. cuRAND:Release11.0Update1 2.3.4. cuRAND:Release11.0 2.3.5. cuRAND:Release11.0RC 2.4. cuSOLVERLibrary 2.4.1. cuSOLVER:Release11.4 2.4.2. cuSOLVER:Release11.3 2.4.3. cuSOLVER:Release11.2Update2 2.4.4. cuSOLVER:Release11.2 2.4.5. cuSOLVER:Release11.1Update1 2.4.6. cuSOLVER:Release11.1 2.4.7. cuSOLVER:Release11.0 2.5. cuSPARSELibrary 2.5.1. cuSPARSE:Release11.6Update1 2.5.2. cuSPARSE:Release11.6 2.5.3. cuSPARSE:Release11.5Update1 2.5.4. cuSPARSE:Release11.4Update1 2.5.5. cuSPARSE:Release11.4 2.5.6. cuSPARSE:Release11.3Update1 2.5.7. cuSPARSE:Release11.3 2.5.8. cuSPARSE:Release11.2Update2 2.5.9. cuSPARSE:Release11.2Update1 2.5.10. cuSPARSE:Release11.2 2.5.11. cuSPARSE:Release11.1Update1 2.5.12. cuSPARSE:Release11.0 2.5.13. cuSPARSE:Release11.0RC 2.6. MathLibrary 2.6.1. CUDAMath:Release11.6 2.6.2. CUDAMath:Release11.5 2.6.3. CUDAMath:Release11.4 2.6.4. CUDAMath:Release11.3 2.6.5. CUDAMath:Release11.1 2.6.6. CUDAMath:Release11.0Update1 2.6.7. CUDAMath:Release11.0RC 2.7. NVIDIAPerformancePrimitives(NPP) 2.7.1. NPP:Release11.5 2.7.2. NPP:Release11.4 2.7.3. NPP:Release11.3 2.7.4. NPP:Release11.2Update2 2.7.5. NPP:Release11.2Update1 2.7.6. NPP:Release11.0 2.7.7. NPP:Release11.0RC 2.8. nvJPEGLibrary 2.8.1. nvJPEG:Release11.5Update1 2.8.2. nvJPEG:Release11.4 2.8.3. nvJPEG:Release11.2Update1 2.8.4. nvJPEG:Release11.1Update1 2.8.5. nvJPEG:Release11.0Update1 2.8.6. nvJPEG:Release11.0 2.8.7. nvJPEG:Release11.0RC SearchResults ReleaseNotes (PDF) - v11.6.1 (older) - LastupdatedFebruary22,2022 - SendFeedback NVIDIACUDAToolkitReleaseNotes TheReleaseNotesfortheCUDAToolkit. 1. CUDA11.6ReleaseNotes ThereleasenotesfortheNVIDIA®CUDA®Toolkitcanbefoundonlineathttp://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html. Note:Thereleasenoteshavebeenreorganizedinto twomajorsections:thegeneralCUDAreleasenotes,andtheCUDAlibrariesrelease notesincludinghistoricalinformationfor11.xreleases. 1.1. CUDAToolkitMajorComponentVersions CUDAComponents StartingwithCUDA11,thevariouscomponentsinthetoolkitareversionedindependently. ForCUDA11.6, thetablebelowindicatestheversions: Table1.CUDA11.6Update1ComponentVersions ComponentName VersionInformation SupportedArchitectures CUDAC++CoreComputeLibraries 11.6.55 x86_64,POWER,Arm64 CUDARuntime(cudart) 11.6.55 x86_64,POWER,Arm64 cuobjdump 11.6.112 x86_64,POWER,Arm64 CUPTI 11.6.112 x86_64,POWER,Arm64 CUDAcuxxfilt(demangler) 11.6.112 x86_64,POWER,Arm64 CUDADemoSuite 11.6.55 x86_64 CUDAGDB 11.6.112 x86_64,POWER,Arm64 CUDAMemcheck 11.6.112 x86_64,POWER CUDANsight 11.6.112 x86_64,POWER CUDANVCC 11.6.112 x86_64,POWER,Arm64 CUDAnvdisasm 11.6.104 x86_64,POWER,Arm64 CUDANVMLHeaders 11.6.55 x86_64,POWER,Arm64 CUDAnvprof 11.6.112 x86_64,POWER,Arm64 CUDAnvprune 11.6.112 x86_64,POWER,Arm64 CUDANVRTC 11.6.112 x86_64,POWER,Arm64 CUDANVTX 11.6.112 x86_64,POWER,Arm64 CUDANVVP 11.6.112 x86_64,POWER CUDASamples 11.6.101 x86_64,POWER,Arm64 CUDAComputeSanitizerAPI 11.6.112 x86_64,POWER,Arm64 CUDAcuBLAS 11.8.1.74 x86_64,POWER,Arm64 CUDAcuFFT 10.7.1.112 x86_64,POWER,Arm64 CUDAcuFile 1.2.1.4 x86_64 CUDAcuRAND 10.2.9.55 x86_64,POWER,Arm64 CUDAcuSOLVER 11.3.3.112 x86_64,POWER,Arm64 CUDAcuSPARSE 11.7.2.112 x86_64,POWER,Arm64 CUDANPP 11.6.2.112 x86_64,POWER,Arm64 CUDAnvJPEG 11.6.1.112 x86_64,POWER,Arm64 NsightCompute 2022.1.1.2 x86_64,POWER,Arm64(CLIonly) NVTX 1.21018621 x86_64,POWER,Arm64 NsightSystems 2021.5.2.53 x86_64,POWER,Arm64(CLIonly) NsightVisualStudioEdition(VSE) 2022.1.1.22006 x86_64(Windows) nvidia_fs1 2.11.0 x86_64 VisualStudioIntegration 11.6.112 x86_64(Windows) NVIDIALinuxDriver 510.47.03 x86_64,POWER,Arm64 NVIDIAWindowsDriver 511.65 x86_64(Windows) CUDADriver RunningaCUDAapplicationrequiresthesystemwithatleastoneCUDAcapableGPU andadriverthatiscompatiblewiththeCUDAToolkit.SeeTable3.Formore informationvariousGPUproductsthatareCUDAcapable,visithttps://developer.nvidia.com/cuda-gpus. EachreleaseoftheCUDAToolkitrequiresaminimumversionoftheCUDAdriver. TheCUDAdriverisbackwardcompatible,meaningthatapplicationscompiledagainst aparticularversionoftheCUDAwillcontinuetoworkonsubsequent(later) driverreleases. Moreinformationoncompatibilitycanbefoundathttps://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#cuda-compatibility-and-upgrades. Note:StartingwithCUDA11.0,thetoolkitcomponentsareindividually versioned,andthetoolkititselfisversionedasshowninthetable below. TheminimumrequireddriverversionforCUDAminorversioncompatibilityisshownbelow. CUDAminorversioncompatibilityisdescribedindetailinhttps://docs.nvidia.com/deploy/cuda-compatibility/index.html Table2.CUDAToolkitandMinimumRequiredDriverVersionforCUDAMinorVersion Compatibility CUDAToolkit MinimumRequiredDriverVersionforCUDA MinorVersionCompatibility* Linuxx86_64DriverVersion Windowsx86_64DriverVersion CUDA11.6.x >=450.80.02 >=452.39 CUDA11.5.x >=450.80.02 >=452.39 CUDA11.4.x >=450.80.02 >=452.39 CUDA11.3.x >=450.80.02 >=452.39 CUDA11.2.x >=450.80.02 >=452.39 CUDA11.1(11.1.0) >=450.80.02 >=452.39 CUDA11.0(11.0.3) >=450.36.06** >=451.22** *UsingaMinimumRequiredVersionthatisdifferentfromToolkitDriverVersion couldbeallowedincompatibilitymode--pleasereadtheCUDACompatibility Guidefordetails. **CUDA11.0wasreleasedwithanearlierdriverversion,butbyupgradingtoTesla RecommendedDrivers450.80.02(Linux)/452.39(Windows),minor versioncompatibilityispossibleacrosstheCUDA11.xfamilyof toolkits. TheversionofthedevelopmentNVIDIAGPUDriverpackagedineachCUDA Toolkitreleaseisshownbelow. Table3.CUDAToolkitandCorrespondingDriverVersions CUDAToolkit ToolkitDriver Version Linuxx86_64DriverVersion Windowsx86_64DriverVersion CUDA11.6Update1 >=510.47.03 >=511.65 CUDA11.6GA >=510.39.01 >=511.23 CUDA11.5Update2 >=495.29.05 >=496.13 CUDA11.5Update1 >=495.29.05 >=496.13 CUDA11.5GA >=495.29.05 >=496.04 CUDA11.4Update4 >=470.82.01 >=472.50 CUDA11.4Update3 >=470.82.01 >=472.50 CUDA11.4Update2 >=470.57.02 >=471.41 CUDA11.4Update1 >=470.57.02 >=471.41 CUDA11.4.0GA >=470.42.01 >=471.11 CUDA11.3.1Update1 >=465.19.01 >=465.89 CUDA11.3.0GA >=465.19.01 >=465.89 CUDA11.2.2Update2 >=460.32.03 >=461.33 CUDA11.2.1Update1 >=460.32.03 >=461.09 CUDA11.2.0GA >=460.27.03 >=460.82 CUDA11.1.1Update1 >=455.32 >=456.81 CUDA11.1GA >=455.23 >=456.38 CUDA11.0.3Update1 >=450.51.06 >=451.82 CUDA11.0.2GA >=450.51.05 >=451.48 CUDA11.0.1RC >=450.36.06 >=451.22 CUDA10.2.89 >=440.33 >=441.22 CUDA10.1(10.1.105generalrelease,andupdates) >=418.39 >=418.96 CUDA10.0.130 >=410.48 >=411.31 CUDA9.2(9.2.148Update1) >=396.37 >=398.26 CUDA9.2(9.2.88) >=396.26 >=397.44 CUDA9.1(9.1.85) >=390.46 >=391.29 CUDA9.0(9.0.76) >=384.81 >=385.54 CUDA8.0(8.0.61GA2) >=375.26 >=376.51 CUDA8.0(8.0.44) >=367.48 >=369.30 CUDA7.5(7.5.16) >=352.31 >=353.66 CUDA7.0(7.0.28) >=346.46 >=347.62 Forconvenience,theNVIDIAdriverisinstalledaspartoftheCUDAToolkit installation.Notethatthisdriverisfordevelopmentpurposesandisnot recommendedforuseinproductionwithTeslaGPUs. ForrunningCUDAapplicationsinproductionwithTeslaGPUs,itisrecommendedto downloadthelatestdriverforTeslaGPUsfromtheNVIDIAdriverdownloadssiteat https://www.nvidia.com/drivers. DuringtheinstallationoftheCUDAToolkit,theinstallationoftheNVIDIAdrivermaybe skippedonWindows(whenusingtheinteractiveorsilentinstallation)oron Linux(byusingmetapackages). FormoreinformationoncustomizingtheinstallprocessonWindows,seehttps://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html#install-cuda-software. FormetapackagesonLinux,seehttps://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#package-manager-metas 1.2. GeneralCUDA 11.6 AddedanewAPI,cudaGraphNodeSetEnabled(),toallowdisabling nodesinaninstantiatedgraph.Supportislimitedtokernelnodesinthis release.AcorrespondingAPI,cudaGraphNodeGetEnabled(),allows queryingtheenabledstateofanode. Fullreleaseof128-bitinteger(__int128)datatypeincluding compileranddevelopertoolssupport.Thehost-sidecompilermustsupport the__int128typetousethisfeature. Cooperativegroupsnamespaceisupdatedwithnewfunctionstoimproveconsistencyin naming,functionscope,andunitdimension/size: ImplicitGroup/Member Threads Blocks thread_block:: dim_threads num_threads thread_rank thread_index (notneeded) grid_group:: num_threads thread_rank num_blocks block_rank block_index AddedabilitytodisableNULLkernelgraphnodelaunches. AddednewNVMLpublicAPIsforqueryingfunctionalityunderWayland. AddedL2cachecontroldescriptorsforatomics. LargeCPUpagesupportforUVMmanagedmemory. 1.3. CUDACompilers 11.6 VS2022Support:CUDA11.6officiallysupportsthelatestVS2022ashostcompiler.A separateNsightVisualStudioinstaller2022.1.1mustbedownloadedfromhere.AfutureCUDAreleasewill havetheNsightVisualStudioinstallerwithVS2022supportintegratedintoit. NewinstructionsinpublicPTX:Newinstructionsforbitmaskcreation-BMSKandsign extension-SZEXTareaddedtothepublicPTXISA.Youcanfinddocumentationfor theseinstructionsinthePTXISAguide:BMSKandSZEXT. UnusedKernelOptimization:InCUDA11.5,unusedkernelpruningwasintroducedwith thepotentialbenefitsofreducingbinarysizeandimprovingperformancethrough moreefficientoptimizations.Thiswasanopt-infeaturebutin11.6,thisfeature isenabledbydefault.Asmentionedinthe11.5bloghere,thereisan opt-outflagthatcanbeusedincaseitbecomesnecessaryfordebugpurposesorfor otherspecialsituations. $nvcc-rdc=trueuser.cutestlib.a-ouser -Xnvlink-ignore-host-info Inadditiontothe-arch=alland-arch=all-majoroptions addedinCUDA11.5,NVCCintroduced-arch=nativeinCUDA11.5 update1.This-arch=nativeoptionisaconvenientwayforusersto letNVCCdeterminetherighttargetarchitecturetocompiletheCUDAdevicecodeto basedontheGPUinstalledonthesystem.Thiscanbeparticularlyhelpfulfor testingwhenapplicationsarerunonthesamesystemtheyarecompiledin. GeneratePTXfromnvlink:Usingthefollowingcommandline,devicelinker, nvlinkwillproducePTXasanoutputinadditiontoCUBIN: nvcc-dlto-dlink-ptx Device linkingbynvlinkisthefinalstageintheCUDAcompilationprocess. Applicationsthathavemultiplesourcetranslationunitshavetobecompiledin separatecompilationmode.LTO(introducedinCUDA11.4)allowednvlinkto performoptimizationsatdevicelinktimeinsteadofatcompiletimesothat separatelycompiledapplicationswithseveraltranslationunitscanbeoptimized tothesamelevelaswholeprogramcompilationswithasingletranslationunit. However,withouttheoptiontooutputPTX,applicationsthatcaredaboutforward compatibilityofdevicecodecouldnotbenefitfromLinkTimeOptimizationor hadtoconstrainthedevicecodetoasinglesourcefile. WiththeoptionfornvlinkthatperformsLTOtogeneratetheoutputin PTX,customerapplicationsthatrequireforwardcompatibilityacrossGPU architecturescanspanacrossmultiplefilesandcanalsotakeadvantageofLink TimeOptimization. Bullseyesupport:NVCCcompiledsourcecodewillworkwithcodecoverage toolBullseye.ThecodecoverageisonlyfortheCPUorthehostfunctions.Code coveragefordevicefunctionisnotsupportedthroughbullseye. INT128developertoolsupport:In11.5,CUDAC++supportfor128-bitwas added.Inthisrelease,developertoolssupportsthedatatypeaswell.Withthe latestversionoflibcu++,int128datatypeissupportedbymathfunctions. 1.4. CUDADeveloperTools ForchangestonvprofandVisualProfiler,seethechangelog. Fornewfeatures,improvements,andbugfixesinCUPTI,seethechangelog. Fornewfeatures,improvements,andbugfixesinNsightCompute,seethechangelog. 1.5. ResolvedIssues 1.5.1. CUDACompilers 11.5.Update1 Whenusingthe--fmad=falsecompileroption,eventheexplicitly requestedfusedmultiply-addinstructionsweredecomposedintoseparatemultiplyand add,leadingtolossofalgorithmsemanticsintendedbytheprogrammer.Oneofthe consequenceswasthatCUDAMathAPIscouldnotbetrustedtodelivercorrect results;worstcaseerrorsbecameunbounded.Thisissuewasintroducedin11.5,and isnowresolved. Fixedacompileroptimizationbugthatmaymovememoryaccess instructionsacrossmemorybarriersthatmayleadtoincorrectruntime resultswithcertainsynchronizationdependencies. AnissueinthePTXoptimizersometimesproducedincorrectresults.This issueisresolved. 11.5 Linkingwithcubinslargerthan2GBisnowsupported. CertainC++17featuresthatwerebackportedtoC++14inMSVCarenowsupported. Anissuewiththeuseoflambdafunctionwhenanobjectispassed-by-valueis resolved.https://github.com/Ahdhn/nvcc_bug_maybe 1.6. DeprecatedFeatures ThefollowingfeaturesaredeprecatedinthecurrentreleaseoftheCUDAsoftware.The featuresstillworkinthecurrentrelease,buttheirdocumentationmayhavebeen removed,andtheywillbecomeofficiallyunsupportedinafuturerelease.Werecommend thatdevelopersemployalternativesolutionstothesefeaturesintheirsoftware. GeneralCUDA ThecudaDeviceSynchronize()functionusedforon-devicefork/join parallelismisdeprecatedinpreparationforareplacement programmingmodelwithhigherperformance.Thesefunctionscontinue toworkinthisrelease,butthetoolswillemitawarningaboutthe upcomingchange. CentOSLinux8hasreachedEnd-of-LifeonDec 31,2021andsupportforthisOSisnowdeprecatedintheCUDA Toolkit.CentOSLinux8supportwillbecompletelyremovedina futurerelease. 1.7. KnownIssues 1.7.1. GeneralCUDA IntermittentcrasheswereseenwhenCUDAbinarieswererunningonasystemwitha GLIBCversionolderthan2.17-106.el7_2.1.Thisisduetoaknownbuginolder versionsofGLIBC(Bugreference:https://bugzilla.redhat.com/show_bug.cgi?id=1293976)andhasbeenfixedinlaterversions(>= glibc-2.17-107.el7). 1.7.2. CUDACompiler 11.6Update1 Clang13/PowerPCisnotyetsupported. NVCCdoesn'tsupportOpenMP5.0"#pragmabegin/enddeclarevariant...";anyhost compilerthatsupportsOpenMP5.0suchasclang13willnotbesupportedforthe option-fopenmp. 2. CUDALibraries ThissectioncoversCUDALibrariesreleasenotesfor11.xreleases. CUDAMathLibrariestoolchainusesC++11features,anda C++11-compatiblestandardlibrary(libstdc++>=20150422)isrequiredonthe host. CUDAMathlibrariesarenolongershippedforSM30andSM32. Supportforthefollowingcomputecapabilitiesaredeprecatedfor alllibraries: sm_35(Kepler) sm_37(Kepler) sm_50(Maxwell) 2.1. cuBLASLibrary 2.1.1. cuBLAS:Release11.6 NewFeatures Newepilogueoptionshavebeenaddedtosupport fusioninDLtraining: CUBLASLT_EPILOGUE_{DRELU,DGELU} whicharesimilarto CUBLASLT_EPILOGUE_{DRELU,DGELU}_BGRAD butdon’tcomputebiasgradient. ResolvedIssues Somesyrk-relatedfunctions (cublas{D,Z}syrk,cublas{D,Z}syr2k, cublas{D,Z}syrkx)mayfailformatrices whichsizeisgreaterthan2^31. 2.1.2. cuBLAS:Release11.4Update3 ResolvedIssues SomecublasandcublasLtfunctionssometimesreturned CUBLAS_STATUS_EXECUTION_FAILEDif thedynamiclibrarywasloadedandunloadedseveraltimesduring applicationlifetimewithinthesameCUDAcontext.Thisissuehas beenresolved. 2.1.3. cuBLAS:Release11.4Update2 NewFeatures Vector(andbatched)alphasupportforper-rowscalinginTNint32 mathMatmulwithint8output.See CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_HOST and CUBLASLT_MATMUL_DESC_ALPHA_VECTOR_BATCH_STRIDE. Newepilogueoptionshavebeenaddedtosupportfusionin DLtraining:CUBLASLT_EPILOGUE_BGRADAand CUBLASLT_EPILOGUE_BGRADBwhichcomputebias gradientsbasedonmatricesAandB,respectively. NewauxiliaryfunctionscublasGetStatusName(), cublasGetStatusString()havebeenaddedto cuBLASthatreturnthestringrepresentationandthedescriptionof thecuBLASstatus(cublasStatus_t)respectively. Similarly,cublasLtGetStatusName(), cublasLtGetStatusString()havebeenaddedto cuBlasLt. KnownIssues cublasGemmBatchedEx()andcublasgemmBatched()checkthealignmentof theinput/outputarraysofthepointersliketheywerethepointers totheactualmatrices.Thesechecksareirrelevantandwillbe disabledinfuturereleases.Thismostlyaffectshalf-precision inputGEMMswhichmightrequire16-bytealignment,andarrayof pointerscouldonlybealignedby8-byteboundary. ResolvedIssues cublasLtMatrixTransformcannowoperateonmatriceswithdimensions greaterthan65535. Fixedout-of-boundaccessinGEMMandMatmulfunctions,whensplitK ornon-defaultepilogueisusedandleadingdimensionoftheoutput matrixexceedsint32_tlimit. NVBLASnowuseslazyloadingoftheCPUBLASlibraryonLinuxto avoidissuescausedbypreloadinglibnvblas.soin complexapplicationsthatuseforkandsimilar APIs. ResolvedsymbolsnameconflictwhenusingcuBlasLtstaticlibrary withstaticTensorRTorcuDNNlibraries. 2.1.4. cuBLAS:Release11.4 ResolvedIssues Somegemvcaseswereproducingincorrectresultsifthematrixdimension(n orm)waslarge,forexample2^20. 2.1.5. cuBLAS:Release11.3Update1 NewFeatures Somenewkernelshavebeenaddedforimprovedperformancebuthave thelimitationthatonlyhostpointersaresupportedforscalars (forexample,alphaandbetaparameters).Thislimitationis expectedtoberesolvedinafuturerelease. NewepilogueshavebeenaddedtosupportfusioninMLtraining. Theseinclude: ReLuBiasandGeluBiasepiloguesthatproduceanauxiliary outputwhichisusedonbackwardpropagationtocomputethe correspondinggradients. DReLuBGradandDGeluBGradepiloguesthatcomputethe backpropagationofthecorrespondingactivationfunctionon matrixC,andproducebiasgradientasaseparateoutput. Theseepiloguesrequireauxiliaryinputmentionedinthe bulletabove. ResolvedIssues SometensorcoreacceleratedstridedbatchedGEMMroutineswould resultinmisalignedmemoryaccessexceptionswhenbatchstride wasn'tamultipleof8. TensorcoreacceleratedcublasGemmBatchedEx(pointer-array)routines woulduseslowervariantsofkernelsassumingbadalignmentofthe pointersinthepointerarray.Nowitassumesthatpointersarewell aligned,asnotedinthedocumentation. KnownIssues Tobeabletoaccessthefastestpossiblekernelsthrough cublasLtMatmulAlgoGetHeuristic() youneedtosetCUBLASLT_MATMUL_PREF_POINTER_MODE_MASKinsearch preferencestoCUBLASLT_POINTER_MODE_MASK_HOSTor CUBLASLT_POINTER_MODE_MASK_NO_FILTERING.Bydefault,heuristics queryassumesthepointermodemaychangelaterandonlyreturns algoconfigurationsthatsupportboth_HOSTand_DEVICEmodes. Withoutthis,newlyaddedkernelswillbeexcludedanditwill likelyleadtoaperformancepenaltyonsomeproblemsizes. DeprecatedFeatures LinkingwithstaticcublasandcublasLtlibrariesonLinuxnow requiresusinggcc-5.2andcompatibleorhigherduetoC++11 requirementsintheselibraries. 2.1.6. cuBLAS:Release11.3 KnownIssues Theplanarcomplexmatrixdescriptorforbatchedmatmulhas inconsistentinterpretationofbatchoffset. Mixedprecisionoperationswithreductionscheme CUBLASLT_REDUCTION_SCHEME_OUTPUT_TYPE(mightbeautomatically selectedbasedonproblemsizeby cublasSgemmEx()or cublasGemmEx()too,unless CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION mathmodebitisset)notonlystores intermediateresultsinoutputtypebutalsoaccumulatesthem internallyinthesameprecision,whichmayresultinlowerthan expectedaccuracy.Pleaseuse CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK or CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION ifthisresultsinnumericalprecisionissuesin yourapplication. 2.1.7. cuBLAS:Release11.2 KnownIssues cublasGemm()withvery largenandm=k=1mayfailonPascaldevices. 2.1.8. cuBLAS:Release11.1Update1 NewFeatures cuBLASLtLoggingisofficiallystableandnolongerexperimental. cuBLASLtLoggingAPIsarestillexperimentalandmaychangein futurereleases. ResolvedIssues cublasLtMatmulfailsonVoltaarchitectureGPUswith CUBLAS_STATUS_EXECUTION_FAILEDwhen ndimension>262,137andepiloguebiasfeatureisbeingused.This issueexistsin11.0and11.1releasesbuthasbeencorrectedin 11.1Update1 2.1.9. cuBLAS:Release11.1 ResolvedIssues Aperformanceregressioninthe cublasCgetrfBatchedand cublasCgetriBatchedroutines hasbeenfixed. TheIMMAkernelsdonotsupportpaddinginmatrixCandmay corruptthedatawhenmatrixCwithpaddingissuppliedto cublasLtMatmul.Asuggested workaroundistosupplymatrixCwithleadingdimensionequal to32timesthenumberofrowswhentargetingtheIMMAkernels: computeType=CUDA_R_32IandCUBLASLT_ORDER_COL32formatrices A,C,D,andCUBLASLT_ORDER_COL4_4R2_8C(onNVIDIAAmpereGPU architectureorTuringarchitecture)or CUBLASLT_ORDER_COL32_2R_4R4(onNVIDIAAmpereGPUarchitecture) formatrixB.MatmuldescriptormustspecifyCUBLAS_OP_Ton matrixBandCUBLAS_OP_N(default)onmatrixAandC.Thedata corruptionbehaviorwasfixedsothat CUBLAS_STATUS_NOT_SUPPORTEDisreturnedinstead. FixedanissuethatcausedanAddressoutofboundserrorwhen callingcublasSgemm(). Aperformanceregressioninthe cublasCgetrfBatchedand cublasCgetriBatchedroutines hasbeenfixed. 2.1.10. cuBLAS:Release11.0Update1 NewFeatures ThecuBLASAPIwasextendedwithanewfunction, cublasSetWorkspace(),whichallowstheuserto setthecuBLASlibraryworkspacetoauser-owneddevicebuffer, whichwillbeusedbycuBLAStoexecuteallsubsequentcallstothe libraryonthecurrentlysetstream. cuBLASLtexperimentalloggingmechanismcanbeenabledintwoways: Bysettingthefollowingenvironmentvariablesbefore launchingthetargetapplication: CUBLASLT_LOG_LEVEL=-- wherelevelisoneofthefollowinglevels: "0"-Off-loggingisdisabled(default) "1"-Error-onlyerrorswillbelogged "2"-Trace-APIcallsthatlaunchCUDA kernelswilllogtheirparametersandimportant information "3"-Hints-hintsthatcanpotentially improvetheapplication'sperformance "4"-Heuristics-heuristicslogthatmay helpuserstotunetheirparameters "5"-APITrace-APIcallswilllogtheir parameterandimportantinformation CUBLASLT_LOG_MASK=-- wheremaskisacombinationofthefollowingmasks: "0"-Off "1"-Error "2"-Trace "4"-Hints "8"-Heuristics "16"-APITrace CUBLASLT_LOG_FILE=-- wherevalueisafilenameintheformatof ".%i",%iwillbe replacedwithprocessid.IfCUBLASLT_LOG_FILEisnot defined,thelogmessagesareprintedto stdout. ByusingtheruntimeAPIfunctionsdefinedinthecublasLt header: typedefvoid(*cublasLtLoggerCallback_t)(int logLevel,constchar*functionName,constchar* message)--Atypeofcallbackfunction pointer. cublasStatus_t cublasLtLoggerSetCallback(cublasLtLoggerCallback_t callback)--Allowstosetacallback functionsthatwillbecalledforeverymessagethat isloggedbythelibrary. cublasStatus_tcublasLtLoggerSetFile(FILE* file)--Allowstosettheoutputfile forthelogger.Thefilemustbeopenandhavewrite permissions. cublasStatus_tcublasLtLoggerOpenFile(const char*logFile)--Allowstogiveapath inwhichtheloggershouldcreatethelogfile. cublasStatus_tcublasLtLoggerSetLevel(int level)--Allowstosettheloglevelto oneoftheabovementionedlevels. cublasStatus_tcublasLtLoggerSetMask(int mask)--Allowstosetthelogmasktoa combinationoftheabovementionedmasks. cublasStatus_t cublasLtLoggerForceDisable()--Allowsto disabletologgerfortheentiresession.Oncethis APIisbeingcalled,theloggercannotbe reactivatedinthecurrentsession. 2.1.11. cuBLAS:Release11.0 NewFeatures cuBLASLtMatrixMultiplicationaddssupportforfusedReLUandbias operationsforallfloatingpointtypesexceptdoubleprecision (FP64). ImprovedbatchedTRSMperformanceformatriceslargerthan256. 2.1.12. cuBLAS:Release11.0RC NewFeatures ManyperformanceimprovementshavebeenimplementedforNVIDIA Ampere,Volta,andTuringArchitecturebasedGPUs. ThecuBLASLtloggingmechanismcanbeenabledbysettingthe followingenvironmentvariablesbeforelaunchingthetarget application: CUBLASLT_LOG_LEVEL=-whilelevel isoneofthefollowinglevels: "0"-Off-loggingisdisabled(default) "1"-Error-onlyerrorswillbelogged "2"-Trace-APIcallswillbeloggedwiththeir parametersandimportantinformation CUBLASLT_LOG_FILE=-while valueisafilenameintheformatof ".%i",%iwillbe replacedwithprocessid.IfCUBLASLT_LOG_FILEisnot defined,thelogmessagesareprintedtostdout. FormatrixmultiplicationAPIs: cublasGemmEx, cublasGemmBatchedEx, cublasGemmStridedBatchedExand cublasLtMatmuladdednewdatatype supportfor__nv_bfloat16(CUDA_R_16BF). AnewcomputetypeTensorFloat32(TF32)has beenaddedtoprovidetensorcoreaccelerationforFP32 matrixmultiplicationroutineswithfulldynamicrangeand increasedprecisioncomparedtoBFLOAT16. NewcomputemodesDefault,Pedantic,andFasthavebeen introducedtooffermorecontrolovercomputeprecision used. Tensorcoresarenowenabledbydefaultforhalf-,and mixed-precision-matrixmultiplications. Doubleprecisiontensorcores(DMMA)areusedautomatically. Tensorcorescannowbeusedforallsizesanddata alignmentsandforallGPUarchitectures: SelectionofthesekernelsthroughcuBLASheuristics isautomaticandwilldependonfactorssuchasmath modesettingaswellaswhetheritwillrunfaster thanthenon-tensorcorekernels. Usersshouldnotethatwhilethesenewkernelsthat usetensorcoresforallunalignedcasesare expectedtoperformfasterthannon-tensorcore basedkernelsbutslowerthankernelsthatcanbe runwhenallbuffersarewellaligned. DeprecatedFeatures Algorithmselectionin cublasGemmExAPIs (includingbatchedvariants)isnon-functionalforNVIDIA AmpereArchitectureGPUs.Regardlessofselectionitwill defaulttoaheuristicsselection.Usersareencouragedto usethecublasLtAPIsfor algorithmselectionfunctionality. Thematrixmultiplymathmode CUBLAS_TENSOR_OP_MATHis beingdeprecatedandwillberemovedinafuturerelease. Usersareencouragedtousethenew cublasComputeType_t enumerationtodefinecomputeprecision. 2.2. cuFFTLibrary 2.2.1. cuFFT:Release11.5 KnownIssues FFTsofcertainsizesinsingleanddoubleprecision(multiplesofsize 6)couldfailonfuturedevices.Thisissuewillbefixedinan upcomingrelease. 2.2.2. cuFFT:Release11.4Update2 ResolvedIssues SincecuFFT10.3.0(CUDAToolkit11.1),cuFFTmayrequireuserto makesurethatalloperationsoninputandoutputbuffersare completebeforecallingcufft[Xt]Exec*if: sm70orlater,3DFFT,batch>1,totalsizeoftransformis greaterthan4.5MB FFTsizeforalldimensionsisinthesetofthefollowing sizes:{2,4,8,16,32,64,128,3,9,81,243,729,2187, 6561,5,25,125,625,3125,6,36,216,1296,7776,7,49, 343,2401,11,121} SomeV100FFTswereslowerthanexpected.Thisissueisresolved. KnownIssues SomeT4FFTsareslowerthanexpected. PlansforFFTsofcertainsizesinsingleprecision(includingsome multiplesof1024sizes,andsomelargeprimenumbers)couldfailon futuredeviceswithlessthan64kBofsharedmemory.Thisissue willbefixedinanupcomingrelease. 2.2.3. cuFFT:Release11.4Update1 ResolvedIssues SomecuFFTmulti-GPUplansexhibitedverylongcreation times. cuFFTsometimesproducedincorrectresultsforreal-to-complexand complex-to-realtransformswhenthetotalnumberofelementsacross allbatchesinasingleexecutionexceeded 2147483647. KnownIssues SomeV100FFTsareslowerthan expected. SomeT4FFTsareslowerthanexpected. 2.2.4. cuFFT:Release11.4 NewFeatures Performanceimprovements. KnownIssues SomeT4FFTsareslowerthanexpected. cuFFTmayproduceincorrectresultsfor real-to-complexandcomplex-to-realtransformswhenthe totalnumberofelementsacrossallbatchesinasingle executionexceeds2147483647. SomecuFFTmulti-GPUplansmayexhibitvery longcreationtime.Issuewillbefixedinthenext update. cuFFTmayproduceincorrectresultsfortransformswith strideswhentheindexofthelastelementacrossallbatches exceeds2147483647(seeAdvancedDataLayout). DeprecatedFeatures Supportforcallbackfunctionalityusingseparatelycompileddevice codeisdeprecatedonallGPUarchitectures.Callbackfunctionality willcontinuetobesupportedforallGPUarchitectures. 2.2.5. cuFFT:Release11.3 NewFeatures cuFFTsharedlibrariesarenowlinkedstaticallyagainstlibstdc++ onLinuxplatforms. Improvedperformanceofcertainsizes(multiplesoflargepowersof 3,powersof11)inSM86. KnownIssues cuFFTplanningandplanestimationfunctionsmaynotrestore correctcontextaffectingCUDAdriverAPIapplications. Planswithstrides,primeslargerthan127inFFTsize decompositionandtotalsizeoftransformincludingstrides biggerthan32GBproduceincorrectresults. 2.2.6. cuFFT:Release11.2Update2 KnownIssues cuFFTplanningandplanestimationfunctionsmaynotrestore correctcontextaffectingCUDAdriverAPIapplications. Planswithstrides,primeslargerthan127inFFTsize decompositionandtotalsizeoftransformincludingstrides biggerthan32GBproduceincorrectresults. 2.2.7. cuFFT:Release11.2Update1 ResolvedIssues Previously,reducedperformanceofpower-of-2singleprecision FFTswasobservedonGPUswithsm_86architecture.Thisissue hasbeenresolved. Largeprimefactorsinsizedecompositionandrealtocomplexor complextorealFFTtypenolongercausecuFFTplanfunctionsto fail. KnownIssues cuFFTplanningandplanestimationfunctionsmaynotrestore correctcontextaffectingCUDAdriverAPIapplications. Planswithstrides,primeslargerthan127inFFTsize decompositionandtotalsizeoftransformincludingstrides biggerthan32GBproduceincorrectresults. 2.2.8. cuFFT:Release11.2 NewFeatures Multi-GPUplanscanbeassociatedwithastreamusingthe cufftSetStreamAPIfunction call. PerformanceimprovementsforR2C/C2C/C2Rtransforms. Performanceimprovementsformulti-GPUsystems. ResolvedIssues cuFFTisnolongerstuckinabadstateifpreviousplan creationfailswith CUFFT_ALLOC_FAILED. Previously,singledimensionalmulti-GPUFFTplansignoreduser inputon cufftXtSetGPUswhichGPUs argumentandassumedthatGPUsIDsarealwaysnumberedfrom0to N-1.Thisissuehasbeenresolved. Planswithprimeslargerthan127inFFTsizedecompositionor FFTsizebeingaprimenumberbiggerthan4093donotperform calculationsonsecondandsubsequent cufftExecute*calls.The regressionwasintroducedincuFFT11.1. KnownIssues cuFFTplanningandplanestimationfunctionsmaynotrestorecorrect contextaffectingCUDAdriverAPIapplications. 2.2.9. cuFFT:Release11.1 NewFeatures cuFFTisnowL2-cacheawareandusesL2cacheforGPUswithmore than4.5MBofL2cache.Performancemayimproveincertain single-GPU3DC2CFFTcases. Aftersuccessfullycreatingaplan,cuFFTnowenforcesalockon thecufftHandle.Subsequentcallstoanyplanningfunctionwith thesamecufftHandlewillfail. Addedsupportforverylargesizes(3kcube)tomulti-GPUcuFFT onDGX-2. Improvedperformanceonmulti-gpucuFFTforcertainsizes(1k cube). ResolvedIssues ResolvedanissuethatcausedcuFFTtocrashwhenreusinga handleafterclearingacallback. Fixedanerrorwhichproducedincorrectresults/NaNvalues whenrunningareal-to-complexFFTinhalfprecision. KnownIssues cuFFTwillalwaysoverwritetheinputforout-of-placeC2R transform. Singledimensionalmulti-GPUFFTplansignoreuserinputonthe whichGPUsparameterof cufftXtSetGPUs()andassume thatGPUsIDsarealwaysnumberedfrom0toN-1. 2.2.10. cuFFT:Release11.0RC NewFeatures cuFFTnowaccepts__nv_bfloat16inputand outputdatatypeforpower-of-twosizeswithsingleprecision computationswithinthekernels. Reoptimizedpowerof2FFTkernelsonVoltaandTuring architectures. ResolvedIssues ReducedR2C/C2Rplanmemoryusagetopreviouslevels. Resolvedbugintroducedin10.1update1thatcaused incorrectresultswhenusingcustomstrides,batched2D plansandcertainsizesonVoltaandlater. KnownIssues cuFFTmodifiesC2Rinputbufferforsomenon-stridedFFT plans. ThereisaknownissuewithcertaincuFFTplansthatcausesan assertionintheexecutionphaseofcertainplans.Thisapplies toplanswithallofthefollowingcharacteristics:realinput tocomplexoutput(R2C),in-place,nativecompatibilitymode, certaineventransformsizes,andmorethanonebatch. 2.3. cuRANDLibrary 2.3.1. cuRAND:Release11.5Update1 NewFeatures ImprovedperformanceofCURAND_RNG_PSEUDO_MRG32K3A pseudorandomnumbergeneratorwhenusingordering CURAND_ORDERING_PSEUDO_BESTor CURAND_ORDERING_PSEUDO_DEFAULT. Addedanewtypeoforderparameter: CURAND_ORDERING_PSEUDO_DYNAMIC. SupportedPRNGs: CURAND_RNG_PSEUDO_XORWOW CURAND_RNG_PSEUDO_MRG32K3A CURAND_RNG_PSEUDO_MTGP32 CURAND_RNG_PSEUDO_PHILOX4_32_10 Improvedperformancecomparedto CURAND_ORDERING_PSEUDO_DEFAULT, especiallyonNVIDIAAmperearchitectureGPUs. Theoutputorderingofgeneratedrandomnumbersfor CURAND_ORDERING_PSEUDO_DYNAMICdepends onthenumberofSMsonaGPU,andthuscanbedifferenton differentGPUs. TheCURAND_ORDERING_PSEUDO_DYNAMICordering can'tbeusedwithahostgeneratorcreatedusing curandCreateGeneratorHost(). ResolvedIssues AddedinformationaboutcuRANDthreadsafety. KnownIssues CURAND_RNG_PSEUDO_XORWOWwithordering CURAND_ORDERING_PSEUDO_DYNAMICcanproduce incorrectresultsonarchitecturesnewerthanSM86. 2.3.2. cuRAND:Release11.3 ResolvedIssues FixedinconsistencybetweenrandomnumbersgeneratedbyGPUandhost generatorswhenCURAND_ORDERING_PSEUDO_LEGACY orderingisselectedforcertaingeneratortypes. 2.3.3. cuRAND:Release11.0Update1 ResolvedIssues Fixedanissuethatcausedlinkererrorsaboutthemultiple definitionsof mtgp32dc_params_fast_11213and mtgpdc_params_11213_numwhen includingcurand_mtgp32dc_p_11213.hin differentcompilationunits. 2.3.4. cuRAND:Release11.0 ResolvedIssues Fixedanissuethatcausedlinkererrorsaboutthemultiple definitionsof mtgp32dc_params_fast_11213and mtgpdc_params_11213_numwhen includingcurand_mtgp32dc_p_11213.hin differentcompilationunits. 2.3.5. cuRAND:Release11.0RC ResolvedIssues IntroducedCURAND_ORDERING_PSEUDO_LEGACYordering. StartingwithCUDA10.0,theorderingofrandomnumbersreturnedby MTGP32andMRG32k3ageneratorsarenolongerthesameasprevious releasesdespitebeingguaranteedbythedocumentationforthe CURAND_ORDERING_PSEUDO_DEFAULTsetting.The CURAND_ORDERING_PSEUDO_LEGACYprovidespre-CUDA 10.0orderingforMTGP32andMRG32k3agenerators. StartingwithCUDA11.0 CURAND_ORDERING_PSEUDO_DEFAULTisthesameas CURAND_ORDERING_PSEUDO_BESTforallgenerators exceptMT19937.OnlyCURAND_ORDERING_PSEUDO_LEGACY isguaranteedtoprovidethesameforallfuturecuRANDreleases. 2.4. cuSOLVERLibrary 2.4.1. cuSOLVER:Release11.4 NewFeatures IntroducingcusolverDnXtrtri,anewgeneric APIfortriangularmatrixinversion(trtri). IntroducingcusolverDnXsytrs,anewgeneric APIforsolvingsystemsoflinearequationsusingagivenfactorized symmetricmatrixfromSYTRF. 2.4.2. cuSOLVER:Release11.3 KnownIssues ForvaluesN<=16, cusolverDn[S|D|C|Z]syevjBatched hitsout-of-boundaccessandmaydeliverthewrongresult.The workaroundistopadthematrixAwithadiagonalmatrixDsuchthat thedimensionof[A0;0D]isbiggerthan16.Thediagonalentry D(j,j)mustbebiggerthanmaximumeigenvalueofA,forexample, norm(A,‘fro’).Afterthesyevj,W(0:n-1)containstheeigenvalues andA(0:n-1,0:n-1)containstheeigenvectors. 2.4.3. cuSOLVER:Release11.2Update2 NewFeatures Newsingularvaluedecomposition(GESVDR)isadded.GESVDR computespartialspectrumwithrandomsampling,anorderof magnitudefasterthanGESVD. libcusolver.sonolongerlinks libcublas_static.a;instead,it dependsonlibcublas.so.This reducesthebinarysizeof libcusolver.so.However,it breaksbackwardcompatibility.Theuserhastolink libcusolver.sowiththecorrect versionoflibcublas.so. 2.4.4. cuSOLVER:Release11.2 ResolvedIssues cusolverDnIRSXgelssometimes returned CUSOLVER_STATUS_INTERNAL_ERROR whentheprecisionis‘z’.ThisissuehasbeenfixedinCUDA 11.2;nowcusolverDnIRSXgelsworks forallprecisions. ZSYTRFsometimesreturned CUSOLVER_STATUS_INTERNAL_ERROR duetoinsufficientresourcestolaunchthekernel.Thisissue hasbeenfixedinCUDA11.2. GETRFreturnedearlywithoutfinishingthewholefactorization whenthematrixwassingular.ThisissuehasbeenfixedinCUDA 11.2. 2.4.5. cuSOLVER:Release11.1Update1 ResolvedIssues cusolverDnDDgelsreports IRS_NOT_SUPPORTEDwhenm>n. Theissuehasbeenfixedinrelease11.1U1,so cusolverDnDDgelswillsupportm >n. cusolverMgDeviceSelectcanconsume over1GBdevicememory.Theissuehasbeenfixedinrelease11.1 U1.ThehiddenmemoryallocationinsidecusolverMGhandleis about30MBperdevice. KnownIssues cusolverDnIRSXgelsmayreturn CUSOLVER_STATUS_INTERNAL_ERROR. whentheprecisionis‘z’duetoinsufficientworkspacewhichcauses illegalmemoryaccess. ThecusolverDnIRSXgels_bufferSize() doesnotreportthecorrectsizeofworkspace.Toworkaroundthe issue,theuserhastoaddmoreworkspacethanwhatisreportedby cusolverDnIRSXgels_bufferSize(). Forexample,ifxisthesizeofworkspacereturnedby cusolverDnIRSXgels_bufferSize(), thentheuserhastoallocate(x+ min(m,n)*sizeof(cuDoubleComplex))bytes. 2.4.6. cuSOLVER:Release11.1 NewFeatures Addednew64-bitAPIs: cusolverDnXpotrf_bufferSize cusolverDnXpotrf cusolverDnXpotrs cusolverDnXgeqrf_bufferSize cusolverDnXgeqrf cusolverDnXgetrf_bufferSize cusolverDnXgetrf cusolverDnXgetrs cusolverDnXsyevd_bufferSize cusolverDnXsyevd cusolverDnXsyevdx_bufferSize cusolverDnXsyevdx cusolverDnXgesvd_bufferSize cusolverDnXgesvd AddedanewSVDalgorithmbasedonpolardecomposition,called GESVDPwhichusesthenew64-bitAPI,including cusolverDnXgesvdp_bufferSize andcusolverDnXgesvdp. DeprecatedFeaturesThefollowing64-bitAPIsaredeprecated: cusolverDnPotrf_bufferSize cusolverDnPotrf cusolverDnPotrs cusolverDnGeqrf_bufferSize cusolverDnGeqrf cusolverDnGetrf_bufferSize cusolverDnGetrf cusolverDnGetrs cusolverDnSyevd_bufferSize cusolverDnSyevd cusolverDnSyevdx_bufferSize cusolverDnSyevdx cusolverDnGesvd_bufferSize cusolverDnGesvd 2.4.7. cuSOLVER:Release11.0 NewFeatures Add64-bitAPIofGESVD.Thenewroutine cusolverDnGesvd_bufferSize()fillsthemissingparametersin 32-bitAPI cusolverDn[S|D|C|Z]gesvd_bufferSize() suchthatitcanestimatethesizeoftheworkspace accurately. Addedthesingleprocessmulti-GPUCholeskyfactorization capabilitiesPOTRF,POTRSandPOTRIincusolverMGlibrary. ResolvedIssues FixedanissuewhereSYEVD/SYGVDwouldfailandreturnerrorcode7 ifthematrixiszeroandthedimensionisbiggerthan25. 2.5. cuSPARSELibrary 2.5.1. cuSPARSE:Release11.6Update1 NewFeatures ImprovedCSRcusparseSpMMAlg1forcolumn-majorlayout: Betterperformance Supportforbatchedcomputation,customrow/col-majorlayoutfor B/C,andmixed-precisioncomputation ImprovedCOOcusparseSpMMAlg3withsupportforbatched computation,customrow/col-majorlayoutforB/C,andmixed-precision computation. Improvedmixed-precisioncomputationofCSR/COO cusparseSpMV. AddedCSCformatsupportforcusparseSpMVand cusparseSpMM. BettererrorhandlingforJITLTOcusparseSpMMOp. cusparseSpMMnowsupportsbatchesofsparse matrices. ResolvedIssues cusparseDenseToSparseproducedwrongresultswhentheinput matrixcontainedthefloating-pointvalue-0.0. std::localeisnolongermodifiedbycuSPARSEduringthe initialization. AddedanoteinthedocumentationofcusparseSpMMOpto reportthattheroutineisnotcompatiblewitholdCUDAdriverversion andAndroidplatforms. KnownIssues cusparseSpSV,cusparseSpSMcouldproduce wrongresultsiftheoutputvector/matrixisnotzero-initialized. 2.5.2. cuSPARSE:Release11.6 NewFeatures BetterperformanceforcusparseSpGEMM, cusparseSpGEMMreuse, cusparseCsr2cscEx2,and cusparseDenseToSparseroutines. ResolvedIssues Fixedforwardcompatibilityissueswithaxpby, rot,spvv, scatter,gather. FixedincorrectresultsinCOOSpMMAlg1whichoccurredinsomerare cases. 2.5.3. cuSPARSE:Release11.5Update1 NewFeatures NewroutinecusparseSpMMOpthatexploits Just-In-TimeLink-Time-Optimization(JITLTO)forprovidingsparse matrix-densematrixmultiplicationwithcustom(user-defined)operators. Seehttps://docs.nvidia.com/cuda/cusparse/index.html#cusparse-generic-function-spmm-op. cuSPARSEnowsupportsloggingfunctionalities.See https://docs.nvidia.com/cuda/cusparse/index.html#cusparse-logging. ResolvedIssues Addedmemoryrequirements,graphcapture,andasynchronousnotesfor cusparseXcsrsm2_analysis. CSR,CSC,andCOOformatdescriptionswronglyreportedsortedcolumnindices requirement.Allroutinessupportunsortedcolumnindices,exceptwhere strictlyindicated ClarifiedcusparseSpSVand cusparseSpSMmemorymanagement. cusparseSpSMproducedwrongresultsinsome caseswhenthematBoperationisCUSPARSE_ OPERATION_NON_TRANSPOSEor CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE. cusparseSpSMproducedwrongresultsinsome caseswhenthematrixlayoutisrow-major. 2.5.4. cuSPARSE:Release11.4Update1 ResolvedIssues cusparseSpSVandcusparseSpSMcould producewrongresults cusparseSpSVandcusparseSpSMdidnotwork correctlywhenvecX==vecYormatB==matC. 2.5.5. cuSPARSE:Release11.4 KnownIssues cusparseSpSVandcusparseSpSMcould producewrongresults cusparseSpSVand cusparseSpSMdonotworkcorrectlywhenvecX== vecYormatB==matC. 2.5.6. cuSPARSE:Release11.3Update1 NewFeatures Introducedanewroutineforsparsematrix-sparsematrix multiplication (cusparseSpGEMMreuse)wherethe outputmatrixstructureisreusedformultiplecomputation.The newroutinesupportsCSRstorageformatandmixed-precision computation. SparsetriangularsolveraddssupportforCOOformat. Introducedanewroutineforsparsetriangularsolverwith multipleright-handsides cusparseSpSM(). cusparseDenseToSparse()routine addstheconversionfromdensematrix(row-major/column-major) toBlocked-ELLformat. Blocke-ELLformatnowsupportemptyblocks BetterperformanceforBlocked-ELLSpMMwithblocksize>64, doubledatatype,andalignmentssmallerthan128-byteonNVIDIA Amperesm80. AllcuSPARSEAPIsarenowasynchronousonplatformsthatsupport streamorderedmemoryallocatorshttps://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#stream-ordered-querying-memory-support. ImprovedNTVXtracewithdistinctionbetweenlightcallsand kernelroutines ResolvedIssues cusparseCnnz_compressproduced wrongresultswhenthenumberofrowsaregreaterthan128* residentCTAs. cusparseSnnzproducedwrongresults forsomeparticularsparsitypattern. DeprecatedFeatures cusparseXcsrsm2_zeroPivot, cusparseXcsrsm2_solve, cusparseXcsrsm2_analysis,and cusparseScsrsm2_bufferSizeExthave beendeprecatedinfavorof cusparseSpSMGenericAPIs 2.5.7. cuSPARSE:Release11.3 NewFeaturesAddednewroutine cusparesSpSVforsparsetriangular solverwithbetterperformance.ThenewGenericAPIsupports: CSRstorageformat Non-transpose,transpose,andtranspose-conjugate operations Upper,lowerfillmode Unit,non-unitdiagonaltype 32-bitand64-bitindices Uniformdatatypecomputation DeprecatedFeatures cusparseScsrsv2_analysis, cusparseScsrsv2_solve, cusparseXcsrsv2_zeroPivot,and cusparseScsrsv2_bufferSize havebeendeprecatedinfavorof cusparseSpSV. 2.5.8. cuSPARSE:Release11.2Update2 ResolvedIssues cusparseDestroy(NULL)nolongercrashes onWindows. KnownIssues cusparseDestroySpVec, cusparseDestroyDnVec, cusparseDestroySpMat, cusparseDestroyDnMat, cusparseDestroywith NULLargumentcouldcause segmentationfaultonWindows. 2.5.9. cuSPARSE:Release11.2Update1 NewFeatures NewTensorCore-acceleratedBlockSparseMatrix-Matrix Multiplication(cusparseSpMM)and introductionoftheBlocked-Ellpackstorageformat. NewalgorithmsforCSR/COOSparseMatrix-VectorMultiplication (cusparseSpMV)withbetter performance. Extendedfunctionalitiesfor cusparseSpMV: SupportfortheCSCformat. Supportforregular/complexbfloat16datatypesforboth uniformandmixed-precisioncomputation. Supportformixedregular-complexdatatype computation. Supportfordeterministicandnon-deterministic computation. Newalgorithm (CUSPARSE_SPMM_CSR_ALG3)for SparseMatrix-MatrixMultiplication (cusparseSpMM)withbetter performanceespeciallyforsmallmatrices. NewroutineforSampledDenseMatrix-DenseMatrix Multiplication(cusparseSDDMM) whichdeprecated cusparseConstrainedGeMMand providesbetterperformance. BetteraccuracyofcusparseAxpby, cusparseRot, cusparseSpVVforbfloat16and halfregular/complexdatatypes. AllroutinessupportNVTXannotationforenhancingtheprofiler timelineoncomplexapplications. ResolvedIssues cusparseAxpby, cusparseGather, cusparseScatter, cusparseRot, cusparseSpVV, cusparseSpMVnowsupport zero-sizematrices. cusparseCsr2cscEx2nowcorrectly handlesemptymatrices(nnz=0). cusparseXcsr2csr_compressnowuses 2-normforthecomparisonofcomplexvaluesinsteadofonlythe realpart. KnownIssuescusparseDestroySpVec, cusparseDestroyDnVec, cusparseDestroySpMat, cusparseDestroyDnMat, cusparseDestroywithNULL argumentcouldcausesegmentationfaultonWindows. DeprecatedFeatures cusparseConstrainedGeMMhasbeen deprecatedinfavorof cusparseSDDMM. cusparseCsrmvExhasbeendeprecated infavorofcusparseSpMV. COOArrayofStructure(CooAoS)formathasbeendeprecated includingcusparseCreateCooAoS, cusparseCooAoSGet,andits supportforcusparseSpMV. 2.5.10. cuSPARSE:Release11.2 KnownIssues cusparseXdense2csrprovidesincorrect resultsforsomematrixsizes. 2.5.11. cuSPARSE:Release11.1Update1 NewFeatures cusparseSparseToDense CSR,CSC,orCOOconversiontodenserepresentation Supportrow-majorandcolumn-majorlayouts Supportalldatatypes Support32-bitand64-bitindices Provideperformance3xhigherthan cusparseXcsc2dense, cusparseXcsr2dense cusparseDenseToSparse DenserepresentationtoCSR,CSC,orCOO Supportrow-majorandcolumn-majorlayouts Supportalldatatypes Support32-bitand64-bitindices Provideperformance3xhigherthan cusparseXcsc2dense, cusparseXcsr2dense KnownIssues cusparseXdense2csrprovidesincorrect resultsforsomematrixsizes. DeprecatedFeatures Legacyconversionroutines: cusparseXcsc2dense, cusparseXcsr2dense, cusparseXdense2csc, cusparseXdense2csr 2.5.12. cuSPARSE:Release11.0 NewFeatures AddednewGenericAPIsforAxpby(cusparseAxpby),Scatter (cusparseScatter),Gather(cusparseGather),Givensrotation (cusparseRot).__nv_bfloat16/__nv_bfloat162datatypesand 64-bitindicesarealsosupported. Thisreleaseaddsthefollowingfeaturesfor cusparseSpMM: Supportforrow-majorlayoutforcusparseSpMMfor bothCSRandCOOformat Supportfor64-bitindices Supportfor__nv_bfloat16and__nv_bfloat162data types Supportforthefollowingstridedbatchmode: Ci=A⋅Bi Ci=Ai⋅B Ci=Ai⋅Bi 2.5.13. cuSPARSE:Release11.0RC NewFeatures AddednewGenericAPIsforAxpby(cusparseAxpby),Scatter (cusparseScatter),Gather(cusparseGather),Givensrotation (cusparseRot).__nv_bfloat16/__nv_bfloat162datatypesand 64-bitindicesarealsosupported. Thisreleaseaddsthefollowingfeaturesfor cusparseSpMM: Supportforrow-majorlayoutforcusparseSpMMfor bothCSRandCOOformat Supportfor64-bitindices Supportfor__nv_bfloat16and__nv_bfloat162data types Supportforthefollowingstridedbatchmode: Ci=A⋅Bi Ci=Ai⋅B Ci=Ai⋅Bi AddednewgenericAPIsandimprovedperformanceforsparse matrix-sparsematrixmultiplication(SpGEMM): cusparseSpGEMM_workEstimation, cusparseSpGEMM_compute,and cusparseSpGEMM_copy. SpVV:addedsupportfor __nv_bfloat16. DeprecatedFeaturesThefollowingfunctionshavebeenremoved: cusparsegemmi() cusparseXaxpyi, cusparseXgthr, cusparseXgthrz, cusparseXroti, cusparseXsctr Hybridformatenumsandhelperfunctions: cusparseHybPartition_t, cusparseHybPartition_t, cusparseCreateHybMat, cusparseDestroyHybMat Triangularsolverenumsandhelperfunctions: cusparseSolveAnalysisInfo_t, cusparseCreateSolveAnalysisInfo, cusparseDestroySolveAnalysisInfo Sparsedotproduct:cusparseXdoti, cusparseXdotci Sparsematrix-vectormultiplication: cusparseXcsrmv, cusparseXcsrmv_mp Sparsematrix-matrixmultiplication: cusparseXcsrmm, cusparseXcsrmm2 Sparsetriangular-singlevectorsolver: cusparseXcsrsv_analysis, cusparseCsrsv_analysisEx, cusparseXcsrsv_solve, cusparseCsrsv_solveEx Sparsetriangular-multiplevectorssolver: cusparseXcsrsm_analysis, cusparseXcsrsm_solve Sparsehybridformatsolver: cusparseXhybsv_analysis, cusparseShybsv_solve Extrafunctions: cusparseXcsrgeamNnz, cusparseScsrgeam, cusparseXcsrgemmNnz, cusparseXcsrgemm IncompleteCholeskyFactorization,level0: cusparseXcsric0 IncompleteLUFactorization,level0: cusparseXcsrilu0, cusparseCsrilu0Ex TridiagonalSolver:cusparseXgtsv, cusparseXgtsv_nopivot BatchedTridiagonalSolver: cusparseXgtsvStridedBatch Reordering:cusparseXcsc2hyb, cusparseXcsr2hyb, cusparseXdense2hyb, cusparseXhyb2csc, cusparseXhyb2csr, cusparseXhyb2dense Thefollowingfunctionshavebeendeprecated: SpGEMM: cusparseXcsrgemm2_bufferSizeExt, cusparseXcsrgemm2Nnz, cusparseXcsrgemm2 2.6. MathLibrary 2.6.1. CUDAMath:Release11.6 NewFeatures Newhalfandbfloat16APIsforaddition/multiplicationin round-to-nearest-evenmodethatdonotgetcontractedintoan fmainstruction.Pleasesee__hadd_rn, __hsub_rn,__hmul_rn, __hadd2_rn,__hsub2_rn, and__hmul2_rninhttps://docs.nvidia.com/cuda/cuda-math-api/index.html. 2.6.2. CUDAMath:Release11.5 Deprecations ThefollowingundocumentedCUDAMathAPIsaredeprecatedandwillbe removedinafuturerelease.Pleaseconsiderswitchingtosimilar intrinsicAPIsdocumentedhere:https://docs.nvidia.com/cuda/cuda-math-api/index.html __device__intmulhi(constinta,constint b) __device__unsignedintmulhi(constunsignedinta, constunsignedintb) __device__unsignedintmulhi(constinta,const unsignedintb) __device__unsignedintmulhi(constunsignedinta, constintb) __device__longlongintmul64hi(constlonglongint a,constlonglongintb) __device__unsignedlonglongintmul64hi(const unsignedlonglonginta,constunsignedlonglongint b) __device__unsignedlonglongintmul64hi(constlong longinta,constunsignedlonglongint b) __device__unsignedlonglongintmul64hi(const unsignedlonglonginta,constlonglongint b) __device__intfloat_as_int(constfloat a) __device__floatint_as_float(constint a) __device__unsignedintfloat_as_uint(constfloat a) __device__floatuint_as_float(constunsignedint a) __device__floatsaturate(constfloat a) __device__intmul24(constinta,constint b) __device__unsignedintumul24(constunsignedinta, constunsignedintb) __device__intfloat2int(constfloata,constenum cudaRoundModemode=cudaRoundZero) __device__unsignedintfloat2uint(constfloata, constenumcudaRoundModemode= cudaRoundZero) __device__floatint2float(constinta,constenum cudaRoundModemode=cudaRoundNearest) __device__floatuint2float(constunsignedinta, constenumcudaRoundModemode= cudaRoundNearest) 2.6.3. CUDAMath:Release11.4 Beginningin2022,theNVIDIAMathLibrariesofficialhardwaresupportwillfollowan N-2policy,whereNisanx100seriesGPU. 2.6.4. CUDAMath:Release11.3 ResolvedIssues PreviousreleasesofCUDAwerepotentiallydeliveringincorrect resultsinsomeLinuxdistributionsforthefollowinghostMath APIs:sinpi, cospi, sincospi, sinpif, cospif, sincospif.Ifpassedhugeinputs like7.3748776e+15or8258177.5theresultswerenotequalto0or 1.Thesehavebeencorrectedwiththisrelease. 2.6.5. CUDAMath:Release11.1 NewFeatures Addedhostsupportforhalfandnv_bfloat16 convertsto/fromintegertypes. Added__hcmadd()deviceonlyAPI forfasthalf2andnv_bfloat162basedcomplex multiply-accumulate. 2.6.6. CUDAMath:Release11.0Update1 ResolvedIssues nv_bfloat16comparisonfunctionscouldtriggera faultwithmisalignedaddresses. Performanceimprovementsinhalfandnv_bfloat16 basicarithmeticimplementations. 2.6.7. CUDAMath:Release11.0RC NewFeatures Addarithmeticsupportfor __nv_bfloat16floating-pointdata typewith8bitsofexponent,7explicitbitsofmantissa. Performanceandaccuracyimprovementsinsingleprecisionmath functions:fmodf, expf, exp10f, sinhf,and coshf. ResolvedIssues Correcteddocumentedmaximumulperrorthresholdsin erfcinvfand powf. Improvedcuda_fp16.h interoperabilitywithVisualStudioC++compiler. UpdatedlibdeviceuserguideandCUDAmathAPIdefinitionsfor j1, j1f, fmod, fmodf, ilogb,and ilogbfmathfunctions. 2.7. NVIDIAPerformancePrimitives(NPP) 2.7.1. NPP:Release11.5 NewFeatures NewAPIsaddedtocomputeSignedAnti-aliasedDistanceTransform usingPBA,theanti-aliasedEuclideandistancebetweenpixelsites inimages.Thiswillimprovetheaccuracyofdistancetransform. nppiSignedDistanceTransformAbsPBA_xxxxx_C1R_Ctx() –Inputandoutputcombinationsupports(xxxxxx)-32f, 32f64f,64f NewAPIforAbsoluteManhattandistancetransform;anothermethodto improvetheaccuracyofdistancetransformusingManhattandistance transformbetweenpixels. nppiDistanceTransformAbsPBA_xxxxx_C1R_Ctx()–Inputandoutputcombination supports(xxxxxx)-8u16u,8s16u,16u16u,16s16u,8u32f, 8s32f,16u32f,16s32f,8u64f,8s64f,16u64f,16s64f,32f64f, 64f ResolvedIssues FixedanissueinFilterMedian() APIwithaddinterpolationwhenmaskevensize. ImprovedContourfunctionperformancebyparallelizing moreofitandalsoimprovingquality. ResolvedanissuewithAlphacompositionused toaccumulateoutputbuffersmultipletimes. Resolvedanissuewith nppiLabelMarkersUF_8u32u_C1Rcolumnprocessing incorrectresults. 2.7.2. NPP:Release11.4 NewFeatures NewAPIFindContours .FindContours canbeexplainedsimplyasacurvejoiningallthecontinuouspoints (alongtheboundary),havingthesamecolororintensity.The contoursareausefultoolforshapeanalysisandobjectdetection andrecognition. 2.7.3. NPP:Release11.3 NewFeatures AddednppiDistanceTransformPBA functions. 2.7.4. NPP:Release11.2Update2 NewFeatures AddednppiDistanceTransformPBA functions. 2.7.5. NPP:Release11.2Update1 NewFeaturesNewAPIsaddedtocomputeDistanceTransformusing ParallelBandingAlgorithm(PBA): nppiDistanceTransformPBA_xxxxx_C1R_Ctx() –wherexxxxxspecifiestheinputandoutputcombination: 8u16u,8s16u,16u16u,16s16u,8u32f,8s32f,16u32f,16s32f nppiSignedDistanceTransformPBA_32f_C1R_Ctx() ResolvedIssues FixedtheissueinwhichLabelMarkersaddszeropixelasobject region. 2.7.6. NPP:Release11.0 NewFeatures BatchedImageLabelMarkersCompressionthatremovessparseness betweenmarkerlabelIDsoutputfromLabelMarkerscall. ImageFloodFillfunctionalityfillsaconnectedregionofanimage withaspecifiednewvalue. StabilityandperformancefixestoImageLabelMarkersandImage LabelMarkersCompression. 2.7.7. NPP:Release11.0RC NewFeatures BatchedImageLabelMarkersCompressionthatremovessparseness betweenmarkerlabelIDsoutputfromLabelMarkerscall. ImageFloodFillfunctionalityfillsaconnectedregionofanimage withaspecifiednewvalue. AddedbatchingsupportfornppiLabelMarkersUFfunctions. Addedthe nppiCompressMarkerLabelsUF_32u_C1IR function. AddednppiSegmentWatershed functions. AddedsampleappsonGitHubdemonstratingtheuseofNPPapplication managedstreamcontextsalongwithwatershedsegmentationand batchedandcompressedUFimagelabelmarkersfunctions. Addedsupportfornon-blockingstreams. ResolvedIssues StabilityandperformancefixestoImageLabelMarkersandImage LabelMarkersCompression. Improvedqualityof nppiLabelMarkersUFfunctions. nppiCompressMarkerLabelsUF_32u_C1IR cannowhandleahugenumberoflabelsgeneratedby thenppiLabelMarkersUF function. KnownIssues ThenppiCopyAPIislimitedbyCUDA threadforlargeimagesize.Maximumimagelimitsisaminimumof16 *65,535=1,048,560horizontalpixelsofanydatatypeandnumber ofchannelsand8*65,535=524,280verticalpixelsforamaximum totalof549,739,036,800pixels. 2.8. nvJPEGLibrary 2.8.1. nvJPEG:Release11.5Update1 ResolvedIssues Fixedtheissueinwhichnvcuvid()released uncompressedframescausingamemoryleak. 2.8.2. nvJPEG:Release11.4 ResolvedIssues Additionalsubsamplingaddedtosolvethe NVJPEG_CSS_2x4. 2.8.3. nvJPEG:Release11.2Update1 NewFeaturesnvJPEGdecoderaddednewAPIstosupportregionof interest(ROI)baseddecodingforbatchedhardwaredecoder: nvjpegDecodeBatchedEx() nvjpegDecodeBatchedSupportedEx() 2.8.4. nvJPEG:Release11.1Update1 NewFeatures AddederrorhandlingcapabilitiesfornonstandardJPEGimages. 2.8.5. nvJPEG:Release11.0Update1 KnownIssues NVJPEG_BACKEND_GPU_HYBRIDhasanissue whenhandlingbit-streamswhichhavecorruptioninthescan. 2.8.6. nvJPEG:Release11.0 NewFeatures nvJPEGallowstheusertoallocateseparatememorypoolsfor eachchromasubsamplingformat.Thishelpsavoidmemory re-allocationoverhead.Thiscanbecontrolledbypassing thenewlyaddedflag NVJPEG_FLAGS_ENABLE_MEMORY_POOLS tothe nvjpegCreateExAPI. nvJPEGencodernowallowcompressedbitstreamontheGPU Memory. 2.8.7. nvJPEG:Release11.0RC NewFeatures nvJPEGallowstheusertoallocateseparatememorypoolsfor eachchromasubsamplingformat.Thishelpsavoidmemory re-allocationoverhead.Thiscanbecontrolledbypassing thenewlyaddedflag NVJPEG_FLAGS_ENABLE_MEMORY_POOLS tothe nvjpegCreateExAPI. nvJPEGencodernowallowcompressedbitstreamontheGPU Memory. HardwareaccelerateddecodeisnowsupportedonNVIDIA A100. ThenvJPEGdecodeAPI(nvjpegDecodeJpeg()) nowhastheflexibilitytoselectthebackendwhencreating nvjpegJpegDecoder_tobject. TheuserhastheoptiontocallthisAPIinsteadofmaking threeseparatecallsto nvjpegDecodeJpegHost(), nvjpegDecodeJpegTransferToDevice(), andnvjpegDecodeJpegDevice(). KnownIssues NVJPEG_BACKEND_GPU_HYBRIDhasan issuewhenhandlingbit-streamswhichhavecorruptioninthe scan. DeprecatedFeaturesThefollowingmultiphaseAPIshavebeenremoved: nvjpegStatus_tNVJPEGAPI nvjpegDecodePhaseOne nvjpegStatus_tNVJPEGAPI nvjpegDecodePhaseTwo nvjpegStatus_tNVJPEGAPI nvjpegDecodePhaseThree nvjpegStatus_tNVJPEGAPI nvjpegDecodeBatchedPhaseOne nvjpegStatus_tNVJPEGAPI nvjpegDecodeBatchedPhaseTwo Notices Notice Thisdocumentisprovidedforinformation purposesonlyandshallnotberegardedasawarrantyofa certainfunctionality,condition,orqualityofaproduct. NVIDIACorporation(“NVIDIA”)makesnorepresentationsor warranties,expressedorimplied,astotheaccuracyor completenessoftheinformationcontainedinthisdocument andassumesnoresponsibilityforanyerrorscontained herein.NVIDIAshallhavenoliabilityfortheconsequences oruseofsuchinformationorforanyinfringementof patentsorotherrightsofthirdpartiesthatmayresult fromitsuse.Thisdocumentisnotacommitmenttodevelop, release,ordeliveranyMaterial(definedbelow),code,or functionality. NVIDIAreservestherighttomakecorrections,modifications, enhancements,improvements,andanyotherchangestothis document,atanytimewithoutnotice. Customershouldobtainthelatestrelevantinformationbefore placingordersandshouldverifythatsuchinformationis currentandcomplete. NVIDIAproductsaresoldsubjecttotheNVIDIAstandardtermsand conditionsofsalesuppliedatthetimeoforder acknowledgement,unlessotherwiseagreedinanindividual salesagreementsignedbyauthorizedrepresentativesof NVIDIAandcustomer(“TermsofSale”).NVIDIAhereby expresslyobjectstoapplyinganycustomergeneraltermsand conditionswithregardstothepurchaseoftheNVIDIA productreferencedinthisdocument.Nocontractual obligationsareformedeitherdirectlyorindirectlybythis document. NVIDIAproductsarenotdesigned,authorized,orwarrantedtobe suitableforuseinmedical,military,aircraft,space,or lifesupportequipment,norinapplicationswherefailureor malfunctionoftheNVIDIAproductcanreasonablybeexpected toresultinpersonalinjury,death,orpropertyor environmentaldamage.NVIDIAacceptsnoliabilityfor inclusionand/oruseofNVIDIAproductsinsuchequipmentor applicationsandthereforesuchinclusionand/oruseisat customer’sownrisk. NVIDIAmakesnorepresentationorwarrantythatproductsbasedon thisdocumentwillbesuitableforanyspecifieduse. Testingofallparametersofeachproductisnotnecessarily performedbyNVIDIA.Itiscustomer’ssoleresponsibilityto evaluateanddeterminetheapplicabilityofanyinformation containedinthisdocument,ensuretheproductissuitable andfitfortheapplicationplannedbycustomer,andperform thenecessarytestingfortheapplicationinordertoavoid adefaultoftheapplicationortheproduct.Weaknessesin customer’sproductdesignsmayaffectthequalityand reliabilityoftheNVIDIAproductandmayresultin additionalordifferentconditionsand/orrequirements beyondthosecontainedinthisdocument.NVIDIAacceptsno liabilityrelatedtoanydefault,damage,costs,orproblem whichmaybebasedonorattributableto:(i)theuseofthe NVIDIAproductinanymannerthatiscontrarytothis documentor(ii)customerproductdesigns. Nolicense,eitherexpressedorimplied,isgrantedunderanyNVIDIA patentright,copyright,orotherNVIDIAintellectual propertyrightunderthisdocument.Informationpublishedby NVIDIAregardingthird-partyproductsorservicesdoesnot constitutealicensefromNVIDIAtousesuchproductsor servicesorawarrantyorendorsementthereof.Useofsuch informationmayrequirealicensefromathirdpartyunder thepatentsorotherintellectualpropertyrightsofthe thirdparty,oralicensefromNVIDIAunderthepatentsor otherintellectualpropertyrightsofNVIDIA. Reproductionofinformationinthisdocumentispermissibleonlyif approvedinadvancebyNVIDIAinwriting,reproducedwithout alterationandinfullcompliancewithallapplicableexport lawsandregulations,andaccompaniedbyallassociated conditions,limitations,andnotices. THISDOCUMENTANDALLNVIDIADESIGNSPECIFICATIONS,REFERENCE BOARDS,FILES,DRAWINGS,DIAGNOSTICS,LISTS,ANDOTHER DOCUMENTS(TOGETHERANDSEPARATELY,“MATERIALS”)AREBEING PROVIDED“ASIS.”NVIDIAMAKESNOWARRANTIES,EXPRESSED, IMPLIED,STATUTORY,OROTHERWISEWITHRESPECTTOTHE MATERIALS,ANDEXPRESSLYDISCLAIMSALLIMPLIEDWARRANTIESOF NONINFRINGEMENT,MERCHANTABILITY,ANDFITNESSFORA PARTICULARPURPOSE.TOTHEEXTENTNOTPROHIBITEDBYLAW,IN NOEVENTWILLNVIDIABELIABLEFORANYDAMAGES,INCLUDING WITHOUTLIMITATIONANYDIRECT,INDIRECT,SPECIAL, INCIDENTAL,PUNITIVE,ORCONSEQUENTIALDAMAGES,HOWEVER CAUSEDANDREGARDLESSOFTHETHEORYOFLIABILITY,ARISING OUTOFANYUSEOFTHISDOCUMENT,EVENIFNVIDIAHASBEEN ADVISEDOFTHEPOSSIBILITYOFSUCHDAMAGES.Notwithstanding anydamagesthatcustomermightincurforanyreason whatsoever,NVIDIA’saggregateandcumulativeliability towardscustomerfortheproductsdescribedhereinshallbe limitedinaccordancewiththeTermsofSaleforthe product. OpenCL OpenCLisatrademarkofAppleInc.usedunderlicensetotheKhronosGroupInc. Trademarks NVIDIAandtheNVIDIAlogoaretrademarksorregisteredtrademarksofNVIDIACorporation intheU.S.andothercountries.Othercompanyandproductnamesmaybetrademarksof therespectivecompanieswithwhichtheyareassociated. Copyright ©2007-2022NVIDIACorporation& affiliates.Allrightsreserved. ThisproductincludessoftwaredevelopedbytheSyncroSoftSRL(http://www.sync.ro/). 1OnlyavailableonselectLinuxdistros