Re: [petsc-users] 32-bit vs 64-bit GPU support
On Fri, 11 Aug 2023, Jed Brown wrote: > Jacob Faibussowitsch writes: > > > More generally, it would be interesting to know the breakdown of installed > > CUDA versions for users. Unlike compilers etc, I suspect that cluster > > admins (and those running on local machines) are much more likely to be > > updating their CUDA toolkits to the latest versions as they often contain > > critical performance improvements. > > One difference is that some sites (not looking at you at all, ALCF) still run > pretty ancient drivers and/or have broken GPU-aware MPI with all but a > specific ancient version of CUDA (OLCF, LLNL). With a normal compiler, you > can choose to use the latest version, but with CUDA, people are firmly stuck > on old versions. > Well Nvidia keeps phasing out support for older GPUs in newer CUDA releases - so unless GPUs are upgraded - they can't really upgrade (to latest) CUDA versions .. [this is in addition to the usual reasons admins don't do software upgrades... Ignore clusters - our CUDA CI machine has random stability issues - so we had to downgrade/freeze cuda/driver versions to keep the machine functional] Satish
Re: [petsc-users] 32-bit vs 64-bit GPU support
Jacob Faibussowitsch writes: > More generally, it would be interesting to know the breakdown of installed > CUDA versions for users. Unlike compilers etc, I suspect that cluster admins > (and those running on local machines) are much more likely to be updating > their CUDA toolkits to the latest versions as they often contain critical > performance improvements. One difference is that some sites (not looking at you at all, ALCF) still run pretty ancient drivers and/or have broken GPU-aware MPI with all but a specific ancient version of CUDA (OLCF, LLNL). With a normal compiler, you can choose to use the latest version, but with CUDA, people are firmly stuck on old versions.
Re: [petsc-users] 32-bit vs 64-bit GPU support
> We should support it, but it still seems hypothetical and not urgent. FWIW, cuBLAS only just added 64-bit int support with CUDA 12 (naturally, with a completely separate API). More generally, it would be interesting to know the breakdown of installed CUDA versions for users. Unlike compilers etc, I suspect that cluster admins (and those running on local machines) are much more likely to be updating their CUDA toolkits to the latest versions as they often contain critical performance improvements. It would help us decide on minimum version to support. We don’t have any real idea of the current minimum version, last time it was estimated to be CUDA 7 IIRC? Best regards, Jacob Faibussowitsch (Jacob Fai - booss - oh - vitch) > On Aug 11, 2023, at 15:38, Jed Brown wrote: > > Rohan Yadav writes: > >> With modern GPU sizes, for example A100's with 80GB of memory, a vector of >> length 2^31 is not that much memory -- one could conceivably run a CG solve >> with local vectors > 2^31. > > Yeah, each vector would be 8 GB (single precision) or 16 GB (double). You > can't store a matrix of this size, and probably not a "mesh", but it's > possible to create such a problem if everything is matrix-free (possibly with > matrix-free geometric multigrid). This is more likely to show up in a > benchmark than any real science or engineering probelm. We should support it, > but it still seems hypothetical and not urgent. > >> Thanks Junchao, I might look into that. However, I currently am not trying >> to solve such a large problem -- these questions just came from wondering >> why the cuSPARSE kernel PETSc was calling was running faster than mine. > > Hah, bandwidth doesn't like. ;-)
Re: [petsc-users] 32-bit vs 64-bit GPU support
Rohan Yadav writes: > With modern GPU sizes, for example A100's with 80GB of memory, a vector of > length 2^31 is not that much memory -- one could conceivably run a CG solve > with local vectors > 2^31. Yeah, each vector would be 8 GB (single precision) or 16 GB (double). You can't store a matrix of this size, and probably not a "mesh", but it's possible to create such a problem if everything is matrix-free (possibly with matrix-free geometric multigrid). This is more likely to show up in a benchmark than any real science or engineering probelm. We should support it, but it still seems hypothetical and not urgent. > Thanks Junchao, I might look into that. However, I currently am not trying > to solve such a large problem -- these questions just came from wondering > why the cuSPARSE kernel PETSc was calling was running faster than mine. Hah, bandwidth doesn't like. ;-)
Re: [petsc-users] 32-bit vs 64-bit GPU support
>We do not currently have any code for using 64 bit integer sizes on the GPUs. Thank you, just wanted confirmation. >Given the current memory available on GPUs is 64 bit integer support needed? I think even a single vector of length 2^31 will use up most of the GPU's memory? Are the practical, not synthetic, situations that require 64 bit integer support on GPUs immediately? For example, is the vector length of the entire parallel vector across all GPUs limited to 32 bits? With modern GPU sizes, for example A100's with 80GB of memory, a vector of length 2^31 is not that much memory -- one could conceivably run a CG solve with local vectors > 2^31. Thanks Junchao, I might look into that. However, I currently am not trying to solve such a large problem -- these questions just came from wondering why the cuSPARSE kernel PETSc was calling was running faster than mine. Rohan
Re: [petsc-users] 32-bit vs 64-bit GPU support
Rohan, You could try the petsc/kokkos backend. I have not tested it, but I guess it should handle 64 bit CUDA index types. I guess the petsc/cuda 32-bit limit came from old CUDA versions where only 32-bit indices were supported such that the original developers hardwired the type to THRUSTINTARRAY32. We try to support generations of cuda toolkits and thus have the current code. Anyway, this should be fixed. --Junchao Zhang On Fri, Aug 11, 2023 at 1:07 PM Barry Smith wrote: > >We do not currently have any code for using 64 bit integer sizes on the > GPUs. > >Given the current memory available on GPUs is 64 bit integer support > needed? I think even a single vector of length 2^31 will use up most of the > GPU's memory? Are the practical, not synthetic, situations that require 64 > bit integer support on GPUs immediately? For example, is the vector length > of the entire parallel vector across all GPUs limited to 32 bits? > >We will certainly add such support, but it is a question of priorities; > there are many things we need to do to improve PETSc GPU support, and they > take time. Unless we have practical use cases, 64 bit integer support for > integer sizes on the GPU is not at the top of the list. Of course, we would > be very happy with a merge request that would provide this support at any > time. > > Barry > > > > On Aug 11, 2023, at 1:23 PM, Rohan Yadav wrote: > > Hi, > > I was wondering what the official status of 64-bit integer support in the > PETSc GPU backend is (specifically CUDA). This question comes from the > result of benchmarking some PETSc code and looking at some sources. In > particular, I found that PETSc's call to cuSPARSE SpMV seems to always be > using the 32-bit integer call, even if I compile PETSc with > `--with-64-bit-indices`. After digging around more, I see that PETSc always > only creates 32-bit cuSPARSE matrices as well: > https://gitlab.com/petsc/petsc/-/blob/v3.19.4/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu?ref_type=tags#L2501. > I was looking around for a switch somewhere to 64 bit integers inside this > code, but everything seems to be pretty hardcoded with `THRUSTINTARRAY32`. > > As expected, this all works when the range of coordinates in each sparse > matrix partition is less than INT_MAX, but PETSc GPU code breaks in > different ways (calling cuBLAS and cuSPARSE) when trying a (synthetic) > problem that needs 64 bit integers: > > ``` > #include "petscmat.h" > #include "petscvec.h" > #include "petsc.h" > > int main(int argc, char** argv) { > PetscInt ierr; > PetscInitialize(, , (char *)0, "GPU bug"); > > PetscInt numRows = 1; > PetscInt numCols = PetscInt(INT_MAX) * 2; > > Mat A; > PetscInt rowStart, rowEnd; > ierr = MatCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr); > MatSetSizes(A, PETSC_DECIDE, PETSC_DECIDE, numRows, numCols); > MatSetType(A, MATMPIAIJ); > MatSetFromOptions(A); > > MatSetValue(A, 0, 0, 1.0, INSERT_VALUES); > MatSetValue(A, 0, numCols - 1, 1.0, INSERT_VALUES); > MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); > MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); > > Vec b; > ierr = VecCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr); > VecSetSizes(b, PETSC_DECIDE, numCols); > VecSetFromOptions(b); > VecSet(b, 0.0); > VecSetValue(b, 0, 42.0, INSERT_VALUES); > VecSetValue(b, numCols - 1, 58.0, INSERT_VALUES); > VecAssemblyBegin(b); > VecAssemblyEnd(b); > > Vec x; > ierr = VecCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr); > VecSetSizes(x, PETSC_DECIDE, numRows); > VecSetFromOptions(x); > VecSet(x, 0.0); > > MatMult(A, b, x); > PetscScalar result; > VecSum(x, ); > PetscPrintf(PETSC_COMM_WORLD, "Result of mult: %f\n", result); > PetscFinalize(); > } > ``` > > When this program is run on CPUs, it outputs 100.0, as expected. > > When run on a single GPU with `-vec_type cuda -mat_type aijcusparse > -use_gpu_aware_mpi 0` it fails with > ``` > [0]PETSC ERROR: - Error Message > -- > [0]PETSC ERROR: Argument out of range > [0]PETSC ERROR: 4294967294 is too big for cuBLAS, which may be restricted > to 32-bit integers > [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting. > [0]PETSC ERROR: Petsc Release Version 3.19.4, unknown > [0]PETSC ERROR: ./gpu-bug on a named sean-dgx2 by rohany Fri Aug 11 > 09:34:10 2023 > [0]PETSC ERROR: Configure options --with-cuda=1 > --prefix=/local/home/rohany/petsc/petsc-install/ > --with-cuda-dir=/usr/local/cuda-11.7/ CXXFLAGS=-O3 COPTFLAGS=-O3 > CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 --download-fblaslapack=1 --with-debugging=0 > --with-64-bit-indices > [0]PETSC ERROR: #1 checkCupmBlasIntCast() at > /local/home/rohany/petsc/include/petsc/private/cupmblasinterface.hpp:435 > [0]PETSC ERROR: #2 VecAllocateCheck_() at > /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:335 > [0]PETSC ERROR: #3 VecCUPMAllocateCheck_() at >
Re: [petsc-users] 32-bit vs 64-bit GPU support
We do not currently have any code for using 64 bit integer sizes on the GPUs. Given the current memory available on GPUs is 64 bit integer support needed? I think even a single vector of length 2^31 will use up most of the GPU's memory? Are the practical, not synthetic, situations that require 64 bit integer support on GPUs immediately? For example, is the vector length of the entire parallel vector across all GPUs limited to 32 bits? We will certainly add such support, but it is a question of priorities; there are many things we need to do to improve PETSc GPU support, and they take time. Unless we have practical use cases, 64 bit integer support for integer sizes on the GPU is not at the top of the list. Of course, we would be very happy with a merge request that would provide this support at any time. Barry > On Aug 11, 2023, at 1:23 PM, Rohan Yadav wrote: > > Hi, > > I was wondering what the official status of 64-bit integer support in the > PETSc GPU backend is (specifically CUDA). This question comes from the result > of benchmarking some PETSc code and looking at some sources. In particular, I > found that PETSc's call to cuSPARSE SpMV seems to always be using the 32-bit > integer call, even if I compile PETSc with `--with-64-bit-indices`. After > digging around more, I see that PETSc always only creates 32-bit cuSPARSE > matrices as well: > https://gitlab.com/petsc/petsc/-/blob/v3.19.4/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu?ref_type=tags#L2501. > I was looking around for a switch somewhere to 64 bit integers inside this > code, but everything seems to be pretty hardcoded with `THRUSTINTARRAY32`. > > As expected, this all works when the range of coordinates in each sparse > matrix partition is less than INT_MAX, but PETSc GPU code breaks in different > ways (calling cuBLAS and cuSPARSE) when trying a (synthetic) problem that > needs 64 bit integers: > > ``` > #include "petscmat.h" > #include "petscvec.h" > #include "petsc.h" > > int main(int argc, char** argv) { > PetscInt ierr; > PetscInitialize(, , (char *)0, "GPU bug"); > > PetscInt numRows = 1; > PetscInt numCols = PetscInt(INT_MAX) * 2; > > Mat A; > PetscInt rowStart, rowEnd; > ierr = MatCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr); > MatSetSizes(A, PETSC_DECIDE, PETSC_DECIDE, numRows, numCols); > MatSetType(A, MATMPIAIJ); > MatSetFromOptions(A); > > MatSetValue(A, 0, 0, 1.0, INSERT_VALUES); > MatSetValue(A, 0, numCols - 1, 1.0, INSERT_VALUES); > MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); > MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); > > Vec b; > ierr = VecCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr); > VecSetSizes(b, PETSC_DECIDE, numCols); > VecSetFromOptions(b); > VecSet(b, 0.0); > VecSetValue(b, 0, 42.0, INSERT_VALUES); > VecSetValue(b, numCols - 1, 58.0, INSERT_VALUES); > VecAssemblyBegin(b); > VecAssemblyEnd(b); > > Vec x; > ierr = VecCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr); > VecSetSizes(x, PETSC_DECIDE, numRows); > VecSetFromOptions(x); > VecSet(x, 0.0); > > MatMult(A, b, x); > PetscScalar result; > VecSum(x, ); > PetscPrintf(PETSC_COMM_WORLD, "Result of mult: %f\n", result); > PetscFinalize(); > } > ``` > > When this program is run on CPUs, it outputs 100.0, as expected. > > When run on a single GPU with `-vec_type cuda -mat_type aijcusparse > -use_gpu_aware_mpi 0` it fails with > ``` > [0]PETSC ERROR: - Error Message > -- > [0]PETSC ERROR: Argument out of range > [0]PETSC ERROR: 4294967294 is too big for cuBLAS, which may be restricted to > 32-bit integers > [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting. > [0]PETSC ERROR: Petsc Release Version 3.19.4, unknown > [0]PETSC ERROR: ./gpu-bug on a named sean-dgx2 by rohany Fri Aug 11 09:34:10 > 2023 > [0]PETSC ERROR: Configure options --with-cuda=1 > --prefix=/local/home/rohany/petsc/petsc-install/ > --with-cuda-dir=/usr/local/cuda-11.7/ CXXFLAGS=-O3 COPTFLAGS=-O3 > CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 --download-fblaslapack=1 --with-debugging=0 > --with-64-bit-indices > [0]PETSC ERROR: #1 checkCupmBlasIntCast() at > /local/home/rohany/petsc/include/petsc/private/cupmblasinterface.hpp:435 > [0]PETSC ERROR: #2 VecAllocateCheck_() at > /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:335 > [0]PETSC ERROR: #3 VecCUPMAllocateCheck_() at > /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:360 > [0]PETSC ERROR: #4 DeviceAllocateCheck_() at > /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:389 > [0]PETSC ERROR: #5 GetArray() at > /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:545 > [0]PETSC ERROR: #6 VectorArray() at > /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:273 > -- > MPI_ABORT was invoked on rank 0 in communicator
[petsc-users] 32-bit vs 64-bit GPU support
Hi, I was wondering what the official status of 64-bit integer support in the PETSc GPU backend is (specifically CUDA). This question comes from the result of benchmarking some PETSc code and looking at some sources. In particular, I found that PETSc's call to cuSPARSE SpMV seems to always be using the 32-bit integer call, even if I compile PETSc with `--with-64-bit-indices`. After digging around more, I see that PETSc always only creates 32-bit cuSPARSE matrices as well: https://gitlab.com/petsc/petsc/-/blob/v3.19.4/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu?ref_type=tags#L2501. I was looking around for a switch somewhere to 64 bit integers inside this code, but everything seems to be pretty hardcoded with `THRUSTINTARRAY32`. As expected, this all works when the range of coordinates in each sparse matrix partition is less than INT_MAX, but PETSc GPU code breaks in different ways (calling cuBLAS and cuSPARSE) when trying a (synthetic) problem that needs 64 bit integers: ``` #include "petscmat.h" #include "petscvec.h" #include "petsc.h" int main(int argc, char** argv) { PetscInt ierr; PetscInitialize(, , (char *)0, "GPU bug"); PetscInt numRows = 1; PetscInt numCols = PetscInt(INT_MAX) * 2; Mat A; PetscInt rowStart, rowEnd; ierr = MatCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr); MatSetSizes(A, PETSC_DECIDE, PETSC_DECIDE, numRows, numCols); MatSetType(A, MATMPIAIJ); MatSetFromOptions(A); MatSetValue(A, 0, 0, 1.0, INSERT_VALUES); MatSetValue(A, 0, numCols - 1, 1.0, INSERT_VALUES); MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY); MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY); Vec b; ierr = VecCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr); VecSetSizes(b, PETSC_DECIDE, numCols); VecSetFromOptions(b); VecSet(b, 0.0); VecSetValue(b, 0, 42.0, INSERT_VALUES); VecSetValue(b, numCols - 1, 58.0, INSERT_VALUES); VecAssemblyBegin(b); VecAssemblyEnd(b); Vec x; ierr = VecCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr); VecSetSizes(x, PETSC_DECIDE, numRows); VecSetFromOptions(x); VecSet(x, 0.0); MatMult(A, b, x); PetscScalar result; VecSum(x, ); PetscPrintf(PETSC_COMM_WORLD, "Result of mult: %f\n", result); PetscFinalize(); } ``` When this program is run on CPUs, it outputs 100.0, as expected. When run on a single GPU with `-vec_type cuda -mat_type aijcusparse -use_gpu_aware_mpi 0` it fails with ``` [0]PETSC ERROR: - Error Message -- [0]PETSC ERROR: Argument out of range [0]PETSC ERROR: 4294967294 is too big for cuBLAS, which may be restricted to 32-bit integers [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting. [0]PETSC ERROR: Petsc Release Version 3.19.4, unknown [0]PETSC ERROR: ./gpu-bug on a named sean-dgx2 by rohany Fri Aug 11 09:34:10 2023 [0]PETSC ERROR: Configure options --with-cuda=1 --prefix=/local/home/rohany/petsc/petsc-install/ --with-cuda-dir=/usr/local/cuda-11.7/ CXXFLAGS=-O3 COPTFLAGS=-O3 CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 --download-fblaslapack=1 --with-debugging=0 --with-64-bit-indices [0]PETSC ERROR: #1 checkCupmBlasIntCast() at /local/home/rohany/petsc/include/petsc/private/cupmblasinterface.hpp:435 [0]PETSC ERROR: #2 VecAllocateCheck_() at /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:335 [0]PETSC ERROR: #3 VecCUPMAllocateCheck_() at /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:360 [0]PETSC ERROR: #4 DeviceAllocateCheck_() at /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:389 [0]PETSC ERROR: #5 GetArray() at /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:545 [0]PETSC ERROR: #6 VectorArray() at /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:273 -- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_SELF with errorcode 63. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -- ``` and when run with just `-mat_type aijcusparse -use_gpu_aware_mpi 0` it fails with ``` ** On entry to cusparseCreateCsr(): dimension mismatch for CUSPARSE_INDEX_32I, cols (4294967294) + base (0) > INT32_MAX (2147483647) [0]PETSC ERROR: - Error Message -- [0]PETSC ERROR: GPU error [0]PETSC ERROR: cuSPARSE errorcode 3 (CUSPARSE_STATUS_INVALID_VALUE) : invalid value [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting. [0]PETSC ERROR: Petsc Release Version 3.19.4, unknown [0]PETSC ERROR: ./gpu-bug on a named sean-dgx2 by rohany Fri Aug 11 09:43:07 2023 [0]PETSC ERROR: Configure options --with-cuda=1 --prefix=/local/home/rohany/petsc/petsc-install/ --with-cuda-dir=/usr/local/cuda-11.7/ CXXFLAGS=-O3 COPTFLAGS=-O3