Re: [petsc-users] 32-bit vs 64-bit GPU support

2023-08-11 Thread Satish Balay via petsc-users
On Fri, 11 Aug 2023, Jed Brown wrote:

> Jacob Faibussowitsch  writes:
> 
> > More generally, it would be interesting to know the breakdown of installed 
> > CUDA versions for users. Unlike compilers etc, I suspect that cluster 
> > admins (and those running on local machines) are much more likely to be 
> > updating their CUDA toolkits to the latest versions as they often contain 
> > critical performance improvements.
> 
> One difference is that some sites (not looking at you at all, ALCF) still run 
> pretty ancient drivers and/or have broken GPU-aware MPI with all but a 
> specific ancient version of CUDA (OLCF, LLNL). With a normal compiler, you 
> can choose to use the latest version, but with CUDA, people are firmly stuck 
> on old versions.
> 

Well Nvidia keeps phasing out support for older GPUs in newer CUDA releases - 
so unless GPUs are upgraded - they can't really upgrade (to latest) CUDA 
versions ..

[this is in addition to the usual reasons admins don't do software upgrades... 
Ignore clusters - our CUDA CI machine has random stability issues - so we had 
to downgrade/freeze cuda/driver versions to keep the machine functional]

Satish



Re: [petsc-users] 32-bit vs 64-bit GPU support

2023-08-11 Thread Jed Brown
Jacob Faibussowitsch  writes:

> More generally, it would be interesting to know the breakdown of installed 
> CUDA versions for users. Unlike compilers etc, I suspect that cluster admins 
> (and those running on local machines) are much more likely to be updating 
> their CUDA toolkits to the latest versions as they often contain critical 
> performance improvements.

One difference is that some sites (not looking at you at all, ALCF) still run 
pretty ancient drivers and/or have broken GPU-aware MPI with all but a specific 
ancient version of CUDA (OLCF, LLNL). With a normal compiler, you can choose to 
use the latest version, but with CUDA, people are firmly stuck on old versions.


Re: [petsc-users] 32-bit vs 64-bit GPU support

2023-08-11 Thread Jacob Faibussowitsch
> We should support it, but it still seems hypothetical and not urgent.

FWIW, cuBLAS only just added 64-bit int support with CUDA 12 (naturally, with a 
completely separate API). 

More generally, it would be interesting to know the breakdown of installed CUDA 
versions for users. Unlike compilers etc, I suspect that cluster admins (and 
those running on local machines) are much more likely to be updating their CUDA 
toolkits to the latest versions as they often contain critical performance 
improvements.

It would help us decide on minimum version to support. We don’t have any real 
idea of the current minimum version, last time it was estimated to be CUDA 7 
IIRC?

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

> On Aug 11, 2023, at 15:38, Jed Brown  wrote:
> 
> Rohan Yadav  writes:
> 
>> With modern GPU sizes, for example A100's with 80GB of memory, a vector of
>> length 2^31 is not that much memory -- one could conceivably run a CG solve
>> with local vectors > 2^31.
> 
> Yeah, each vector would be 8 GB (single precision) or 16 GB (double). You 
> can't store a matrix of this size, and probably not a "mesh", but it's 
> possible to create such a problem if everything is matrix-free (possibly with 
> matrix-free geometric multigrid). This is more likely to show up in a 
> benchmark than any real science or engineering probelm. We should support it, 
> but it still seems hypothetical and not urgent.
> 
>> Thanks Junchao, I might look into that. However, I currently am not trying
>> to solve such a large problem -- these questions just came from wondering
>> why the cuSPARSE kernel PETSc was calling was running faster than mine.
> 
> Hah, bandwidth doesn't like. ;-)



Re: [petsc-users] 32-bit vs 64-bit GPU support

2023-08-11 Thread Jed Brown
Rohan Yadav  writes:

> With modern GPU sizes, for example A100's with 80GB of memory, a vector of
> length 2^31 is not that much memory -- one could conceivably run a CG solve
> with local vectors > 2^31.

Yeah, each vector would be 8 GB (single precision) or 16 GB (double). You can't 
store a matrix of this size, and probably not a "mesh", but it's possible to 
create such a problem if everything is matrix-free (possibly with matrix-free 
geometric multigrid). This is more likely to show up in a benchmark than any 
real science or engineering probelm. We should support it, but it still seems 
hypothetical and not urgent.

> Thanks Junchao, I might look into that. However, I currently am not trying
> to solve such a large problem -- these questions just came from wondering
> why the cuSPARSE kernel PETSc was calling was running faster than mine.

Hah, bandwidth doesn't like. ;-)


Re: [petsc-users] 32-bit vs 64-bit GPU support

2023-08-11 Thread Rohan Yadav
>We do not currently have any code for using 64 bit integer sizes on
the GPUs.

Thank you, just wanted confirmation.

>Given the current memory available on GPUs is 64 bit integer support
needed? I think even a single vector of length 2^31 will use up most of the
GPU's memory? Are the practical, not synthetic, situations that require 64
bit integer support on GPUs immediately?  For example, is the vector length
of the entire parallel vector across all GPUs limited to 32 bits?

With modern GPU sizes, for example A100's with 80GB of memory, a vector of
length 2^31 is not that much memory -- one could conceivably run a CG solve
with local vectors > 2^31.

Thanks Junchao, I might look into that. However, I currently am not trying
to solve such a large problem -- these questions just came from wondering
why the cuSPARSE kernel PETSc was calling was running faster than mine.

Rohan


Re: [petsc-users] 32-bit vs 64-bit GPU support

2023-08-11 Thread Junchao Zhang
Rohan,
  You could try the petsc/kokkos backend.  I have not tested it, but I
guess it should handle 64 bit CUDA index types.
  I guess the petsc/cuda 32-bit limit came from old CUDA versions where
only 32-bit indices were supported such that the original developers
hardwired the type to THRUSTINTARRAY32.  We try to support generations of
cuda toolkits and thus have the current code.

  Anyway, this should be fixed.
--Junchao Zhang


On Fri, Aug 11, 2023 at 1:07 PM Barry Smith  wrote:

>
>We do not currently have any code for using 64 bit integer sizes on the
> GPUs.
>
>Given the current memory available on GPUs is 64 bit integer support
> needed? I think even a single vector of length 2^31 will use up most of the
> GPU's memory? Are the practical, not synthetic, situations that require 64
> bit integer support on GPUs immediately?  For example, is the vector length
> of the entire parallel vector across all GPUs limited to 32 bits?
>
>We will certainly add such support, but it is a question of priorities;
> there are many things we need to do to improve PETSc GPU support, and they
> take time. Unless we have practical use cases, 64 bit integer support for
> integer sizes on the GPU is not at the top of the list. Of course, we would
> be very happy with a merge request that would provide this support at any
> time.
>
>   Barry
>
>
>
> On Aug 11, 2023, at 1:23 PM, Rohan Yadav  wrote:
>
> Hi,
>
> I was wondering what the official status of 64-bit integer support in the
> PETSc GPU backend is (specifically CUDA). This question comes from the
> result of benchmarking some PETSc code and looking at some sources. In
> particular, I found that PETSc's call to cuSPARSE SpMV seems to always be
> using the 32-bit integer call, even if I compile PETSc with
> `--with-64-bit-indices`. After digging around more, I see that PETSc always
> only creates 32-bit cuSPARSE matrices as well:
> https://gitlab.com/petsc/petsc/-/blob/v3.19.4/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu?ref_type=tags#L2501.
> I was looking around for a switch somewhere to 64 bit integers inside this
> code, but everything seems to be pretty hardcoded with `THRUSTINTARRAY32`.
>
> As expected, this all works when the range of coordinates in each sparse
> matrix partition is less than INT_MAX, but PETSc GPU code breaks in
> different ways (calling cuBLAS and cuSPARSE) when trying a (synthetic)
> problem that needs 64 bit integers:
>
> ```
> #include "petscmat.h"
> #include "petscvec.h"
> #include "petsc.h"
>
> int main(int argc, char** argv) {
>   PetscInt ierr;
>   PetscInitialize(, , (char *)0, "GPU bug");
>
>   PetscInt numRows = 1;
>   PetscInt numCols = PetscInt(INT_MAX) * 2;
>
>   Mat A;
>   PetscInt rowStart, rowEnd;
>   ierr = MatCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr);
>   MatSetSizes(A, PETSC_DECIDE, PETSC_DECIDE, numRows, numCols);
>   MatSetType(A, MATMPIAIJ);
>   MatSetFromOptions(A);
>
>   MatSetValue(A, 0, 0, 1.0, INSERT_VALUES);
>   MatSetValue(A, 0, numCols - 1, 1.0, INSERT_VALUES);
>   MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY);
>   MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY);
>
>   Vec b;
>   ierr = VecCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr);
>   VecSetSizes(b, PETSC_DECIDE, numCols);
>   VecSetFromOptions(b);
>   VecSet(b, 0.0);
>   VecSetValue(b, 0, 42.0, INSERT_VALUES);
>   VecSetValue(b, numCols - 1, 58.0, INSERT_VALUES);
>   VecAssemblyBegin(b);
>   VecAssemblyEnd(b);
>
>   Vec x;
>   ierr = VecCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr);
>   VecSetSizes(x, PETSC_DECIDE, numRows);
>   VecSetFromOptions(x);
>   VecSet(x, 0.0);
>
>   MatMult(A, b, x);
>   PetscScalar result;
>   VecSum(x, );
>   PetscPrintf(PETSC_COMM_WORLD, "Result of mult: %f\n", result);
>   PetscFinalize();
> }
> ```
>
> When this program is run on CPUs, it outputs 100.0, as expected.
>
> When run on a single GPU with `-vec_type cuda -mat_type aijcusparse
> -use_gpu_aware_mpi 0` it fails with
> ```
> [0]PETSC ERROR: - Error Message
> --
> [0]PETSC ERROR: Argument out of range
> [0]PETSC ERROR: 4294967294 is too big for cuBLAS, which may be restricted
> to 32-bit integers
> [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
> [0]PETSC ERROR: Petsc Release Version 3.19.4, unknown
> [0]PETSC ERROR: ./gpu-bug on a  named sean-dgx2 by rohany Fri Aug 11
> 09:34:10 2023
> [0]PETSC ERROR: Configure options --with-cuda=1
> --prefix=/local/home/rohany/petsc/petsc-install/
> --with-cuda-dir=/usr/local/cuda-11.7/ CXXFLAGS=-O3 COPTFLAGS=-O3
> CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 --download-fblaslapack=1 --with-debugging=0
> --with-64-bit-indices
> [0]PETSC ERROR: #1 checkCupmBlasIntCast() at
> /local/home/rohany/petsc/include/petsc/private/cupmblasinterface.hpp:435
> [0]PETSC ERROR: #2 VecAllocateCheck_() at
> /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:335
> [0]PETSC ERROR: #3 VecCUPMAllocateCheck_() at
> 

Re: [petsc-users] 32-bit vs 64-bit GPU support

2023-08-11 Thread Barry Smith

   We do not currently have any code for using 64 bit integer sizes on the 
GPUs. 

   Given the current memory available on GPUs is 64 bit integer support needed? 
I think even a single vector of length 2^31 will use up most of the GPU's 
memory? Are the practical, not synthetic, situations that require 64 bit 
integer support on GPUs immediately?  For example, is the vector length of the 
entire parallel vector across all GPUs limited to 32 bits? 

   We will certainly add such support, but it is a question of priorities; 
there are many things we need to do to improve PETSc GPU support, and they take 
time. Unless we have practical use cases, 64 bit integer support for integer 
sizes on the GPU is not at the top of the list. Of course, we would be very 
happy with a merge request that would provide this support at any time.

  Barry



> On Aug 11, 2023, at 1:23 PM, Rohan Yadav  wrote:
> 
> Hi,
> 
> I was wondering what the official status of 64-bit integer support in the 
> PETSc GPU backend is (specifically CUDA). This question comes from the result 
> of benchmarking some PETSc code and looking at some sources. In particular, I 
> found that PETSc's call to cuSPARSE SpMV seems to always be using the 32-bit 
> integer call, even if I compile PETSc with `--with-64-bit-indices`. After 
> digging around more, I see that PETSc always only creates 32-bit cuSPARSE 
> matrices as well: 
> https://gitlab.com/petsc/petsc/-/blob/v3.19.4/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu?ref_type=tags#L2501.
>  I was looking around for a switch somewhere to 64 bit integers inside this 
> code, but everything seems to be pretty hardcoded with `THRUSTINTARRAY32`.
> 
> As expected, this all works when the range of coordinates in each sparse 
> matrix partition is less than INT_MAX, but PETSc GPU code breaks in different 
> ways (calling cuBLAS and cuSPARSE) when trying a (synthetic) problem that 
> needs 64 bit integers:
> 
> ```
> #include "petscmat.h"
> #include "petscvec.h"
> #include "petsc.h"
> 
> int main(int argc, char** argv) {
>   PetscInt ierr;
>   PetscInitialize(, , (char *)0, "GPU bug");
> 
>   PetscInt numRows = 1;
>   PetscInt numCols = PetscInt(INT_MAX) * 2;
> 
>   Mat A;
>   PetscInt rowStart, rowEnd;
>   ierr = MatCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr);
>   MatSetSizes(A, PETSC_DECIDE, PETSC_DECIDE, numRows, numCols);
>   MatSetType(A, MATMPIAIJ);
>   MatSetFromOptions(A);
> 
>   MatSetValue(A, 0, 0, 1.0, INSERT_VALUES);
>   MatSetValue(A, 0, numCols - 1, 1.0, INSERT_VALUES);
>   MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY);
>   MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY);
> 
>   Vec b;
>   ierr = VecCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr);
>   VecSetSizes(b, PETSC_DECIDE, numCols);
>   VecSetFromOptions(b);
>   VecSet(b, 0.0);
>   VecSetValue(b, 0, 42.0, INSERT_VALUES);
>   VecSetValue(b, numCols - 1, 58.0, INSERT_VALUES);
>   VecAssemblyBegin(b);
>   VecAssemblyEnd(b);
> 
>   Vec x;
>   ierr = VecCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr);
>   VecSetSizes(x, PETSC_DECIDE, numRows);
>   VecSetFromOptions(x);
>   VecSet(x, 0.0);
> 
>   MatMult(A, b, x);
>   PetscScalar result;
>   VecSum(x, );
>   PetscPrintf(PETSC_COMM_WORLD, "Result of mult: %f\n", result);
>   PetscFinalize();
> }
> ```
> 
> When this program is run on CPUs, it outputs 100.0, as expected.
> 
> When run on a single GPU with `-vec_type cuda -mat_type aijcusparse 
> -use_gpu_aware_mpi 0` it fails with
> ```
> [0]PETSC ERROR: - Error Message 
> --
> [0]PETSC ERROR: Argument out of range
> [0]PETSC ERROR: 4294967294 is too big for cuBLAS, which may be restricted to 
> 32-bit integers
> [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
> [0]PETSC ERROR: Petsc Release Version 3.19.4, unknown
> [0]PETSC ERROR: ./gpu-bug on a  named sean-dgx2 by rohany Fri Aug 11 09:34:10 
> 2023
> [0]PETSC ERROR: Configure options --with-cuda=1 
> --prefix=/local/home/rohany/petsc/petsc-install/ 
> --with-cuda-dir=/usr/local/cuda-11.7/ CXXFLAGS=-O3 COPTFLAGS=-O3 
> CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 --download-fblaslapack=1 --with-debugging=0 
> --with-64-bit-indices
> [0]PETSC ERROR: #1 checkCupmBlasIntCast() at 
> /local/home/rohany/petsc/include/petsc/private/cupmblasinterface.hpp:435
> [0]PETSC ERROR: #2 VecAllocateCheck_() at 
> /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:335
> [0]PETSC ERROR: #3 VecCUPMAllocateCheck_() at 
> /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:360
> [0]PETSC ERROR: #4 DeviceAllocateCheck_() at 
> /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:389
> [0]PETSC ERROR: #5 GetArray() at 
> /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:545
> [0]PETSC ERROR: #6 VectorArray() at 
> /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:273
> --
> MPI_ABORT was invoked on rank 0 in communicator 

[petsc-users] 32-bit vs 64-bit GPU support

2023-08-11 Thread Rohan Yadav
Hi,

I was wondering what the official status of 64-bit integer support in the
PETSc GPU backend is (specifically CUDA). This question comes from the
result of benchmarking some PETSc code and looking at some sources. In
particular, I found that PETSc's call to cuSPARSE SpMV seems to always be
using the 32-bit integer call, even if I compile PETSc with
`--with-64-bit-indices`. After digging around more, I see that PETSc always
only creates 32-bit cuSPARSE matrices as well:
https://gitlab.com/petsc/petsc/-/blob/v3.19.4/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu?ref_type=tags#L2501.
I was looking around for a switch somewhere to 64 bit integers inside this
code, but everything seems to be pretty hardcoded with `THRUSTINTARRAY32`.

As expected, this all works when the range of coordinates in each sparse
matrix partition is less than INT_MAX, but PETSc GPU code breaks in
different ways (calling cuBLAS and cuSPARSE) when trying a (synthetic)
problem that needs 64 bit integers:

```
#include "petscmat.h"
#include "petscvec.h"
#include "petsc.h"

int main(int argc, char** argv) {
  PetscInt ierr;
  PetscInitialize(, , (char *)0, "GPU bug");

  PetscInt numRows = 1;
  PetscInt numCols = PetscInt(INT_MAX) * 2;

  Mat A;
  PetscInt rowStart, rowEnd;
  ierr = MatCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr);
  MatSetSizes(A, PETSC_DECIDE, PETSC_DECIDE, numRows, numCols);
  MatSetType(A, MATMPIAIJ);
  MatSetFromOptions(A);

  MatSetValue(A, 0, 0, 1.0, INSERT_VALUES);
  MatSetValue(A, 0, numCols - 1, 1.0, INSERT_VALUES);
  MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY);
  MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY);

  Vec b;
  ierr = VecCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr);
  VecSetSizes(b, PETSC_DECIDE, numCols);
  VecSetFromOptions(b);
  VecSet(b, 0.0);
  VecSetValue(b, 0, 42.0, INSERT_VALUES);
  VecSetValue(b, numCols - 1, 58.0, INSERT_VALUES);
  VecAssemblyBegin(b);
  VecAssemblyEnd(b);

  Vec x;
  ierr = VecCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr);
  VecSetSizes(x, PETSC_DECIDE, numRows);
  VecSetFromOptions(x);
  VecSet(x, 0.0);

  MatMult(A, b, x);
  PetscScalar result;
  VecSum(x, );
  PetscPrintf(PETSC_COMM_WORLD, "Result of mult: %f\n", result);
  PetscFinalize();
}
```

When this program is run on CPUs, it outputs 100.0, as expected.

When run on a single GPU with `-vec_type cuda -mat_type aijcusparse
-use_gpu_aware_mpi 0` it fails with
```

[0]PETSC ERROR: - Error Message
--

[0]PETSC ERROR: Argument out of range

[0]PETSC ERROR: 4294967294 is too big for cuBLAS, which may be restricted
to 32-bit integers

[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.

[0]PETSC ERROR: Petsc Release Version 3.19.4, unknown

[0]PETSC ERROR: ./gpu-bug on a  named sean-dgx2 by rohany Fri Aug 11
09:34:10 2023

[0]PETSC ERROR: Configure options --with-cuda=1
--prefix=/local/home/rohany/petsc/petsc-install/
--with-cuda-dir=/usr/local/cuda-11.7/ CXXFLAGS=-O3 COPTFLAGS=-O3
CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 --download-fblaslapack=1 --with-debugging=0
--with-64-bit-indices

[0]PETSC ERROR: #1 checkCupmBlasIntCast() at
/local/home/rohany/petsc/include/petsc/private/cupmblasinterface.hpp:435

[0]PETSC ERROR: #2 VecAllocateCheck_() at
/local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:335

[0]PETSC ERROR: #3 VecCUPMAllocateCheck_() at
/local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:360

[0]PETSC ERROR: #4 DeviceAllocateCheck_() at
/local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:389

[0]PETSC ERROR: #5 GetArray() at
/local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:545

[0]PETSC ERROR: #6 VectorArray() at
/local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:273

--

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_SELF

with errorcode 63.


NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

You may or may not see output from other processes, depending on

exactly when Open MPI kills them.

--

```


and when run with just `-mat_type aijcusparse -use_gpu_aware_mpi 0` it
fails with

```

 ** On entry to cusparseCreateCsr(): dimension mismatch for
CUSPARSE_INDEX_32I, cols (4294967294) + base (0) > INT32_MAX (2147483647)


[0]PETSC ERROR: - Error Message
--

[0]PETSC ERROR: GPU error

[0]PETSC ERROR: cuSPARSE errorcode 3 (CUSPARSE_STATUS_INVALID_VALUE) :
invalid value

[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.

[0]PETSC ERROR: Petsc Release Version 3.19.4, unknown

[0]PETSC ERROR: ./gpu-bug on a  named sean-dgx2 by rohany Fri Aug 11
09:43:07 2023

[0]PETSC ERROR: Configure options --with-cuda=1
--prefix=/local/home/rohany/petsc/petsc-install/
--with-cuda-dir=/usr/local/cuda-11.7/ CXXFLAGS=-O3 COPTFLAGS=-O3