Re: [petsc-users] Using PETSc GPU backend

2023-08-11 Thread Ng, Cho-Kuen via petsc-users
Barry,

I tried again today on Perlmutter and running on multiple GPU nodes worked. 
Likely, I had messed up something the other day. Also, I was able to have 
multiple MPI tasks on a GPU using Nvidia MPS. The petsc output shows the number 
of MPI tasks:

KSP Object: 32 MPI processes

Can petsc show the number of GPUs used?

Thanks,
Cho


From: Barry Smith 
Sent: Wednesday, August 9, 2023 4:09 PM
To: Ng, Cho-Kuen 
Cc: petsc-users@mcs.anl.gov 
Subject: Re: [petsc-users] Using PETSc GPU backend


  We would need more information about "hanging". Do PETSc examples and tiny 
problems "hang" on multiple nodes? If you run with -info what are the last 
messages printed? Can you run with a debugger to see where it is "hanging"?



On Aug 9, 2023, at 5:59 PM, Ng, Cho-Kuen  wrote:

Barry and Matt,

Thanks for your help. Now I can use petsc GPU backend on Perlmutter: 1 node, 4 
MPI tasks and 4 GPUs. However, I ran into problems with multiple nodes: 2 
nodes, 8 MPI tasks and 8 GPUs. The run hung on KSPSolve. How can I fix this?

Best,
Cho


From: Barry Smith mailto:bsm...@petsc.dev>>
Sent: Monday, July 17, 2023 6:58 AM
To: Ng, Cho-Kuen mailto:c...@slac.stanford.edu>>
Cc: petsc-users@mcs.anl.gov 
mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] Using PETSc GPU backend


 The examples that use DM, in particular DMDA all trivially support using the 
GPU with -dm_mat_type aijcusparse -dm_vec_type cuda



On Jul 17, 2023, at 1:45 AM, Ng, Cho-Kuen 
mailto:c...@slac.stanford.edu>> wrote:

Barry,

Thank you so much for the clarification.

I see that ex104.c and ex300.c use  MatXAIJSetPreallocation(). Are there other 
tutorials available?

Cho

From: Barry Smith mailto:bsm...@petsc.dev>>
Sent: Saturday, July 15, 2023 8:36 AM
To: Ng, Cho-Kuen mailto:c...@slac.stanford.edu>>
Cc: petsc-users@mcs.anl.gov 
mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] Using PETSc GPU backend



   Cho,

We currently have a crappy API for turning on GPU support, and our 
documentation is misleading in places.

People constantly say "to use GPU's with PETSc you only need to use 
-mat_type aijcusparse (for example)" This is incorrect.

 This does not work with code that uses the convenience Mat constructors such 
as MatCreateAIJ(), MatCreateAIJWithArrays etc. It only works if you use the 
constructor approach of MatCreate(), MatSetSizes(), MatSetFromOptions(), 
MatXXXSetPreallocation(). ...  Similarly you need to use VecCreate(), 
VecSetSizes(), VecSetFromOptions() and -vec_type cuda

   If you use DM to create the matrices and vectors then you can use 
-dm_mat_type aijcusparse -dm_vec_type cuda

   Sorry for the confusion.

   Barry




On Jul 15, 2023, at 8:03 AM, Matthew Knepley 
mailto:knep...@gmail.com>> wrote:

On Sat, Jul 15, 2023 at 1:44 AM Ng, Cho-Kuen 
mailto:c...@slac.stanford.edu>> wrote:
Matt,

After inserting 2 lines in the code:

  ierr = MatCreate(PETSC_COMM_WORLD,);CHKERRQ(ierr);
  ierr = MatSetFromOptions(A);CHKERRQ(ierr);
  ierr = MatCreateAIJ(PETSC_COMM_WORLD,mlocal,mlocal,m,n,
  d_nz,PETSC_NULL,o_nz,PETSC_NULL,);;CHKERRQ(ierr);

"There are no unused options." However, there is no improvement on the GPU 
performance.

1. MatCreateAIJ() sets the type, and in fact it overwrites the Mat you created 
in steps 1 and 2. This is detailed in the manual.

2. You should replace MatCreateAIJ(), with MatSetSizes() before 
MatSetFromOptions().

  THanks,

Matt

Thanks,
Cho


From: Matthew Knepley mailto:knep...@gmail.com>>
Sent: Friday, July 14, 2023 5:57 PM
To: Ng, Cho-Kuen mailto:c...@slac.stanford.edu>>
Cc: Barry Smith mailto:bsm...@petsc.dev>>; Mark Adams 
mailto:mfad...@lbl.gov>>; 
petsc-users@mcs.anl.gov 
mailto:petsc-users@mcs.anl.gov>>
Subject: Re: [petsc-users] Using PETSc GPU backend

On Fri, Jul 14, 2023 at 7:57 PM Ng, Cho-Kuen 
mailto:c...@slac.stanford.edu>> wrote:
I managed to pass the following options to PETSc using a GPU node on Perlmutter.

-mat_type aijcusparse -vec_type cuda -log_view -options_left

Below is a summary of the test using 4 MPI tasks and 1 GPU per task.

o #PETSc Option Table entries:
   -log_view
   -mat_type aijcusparse
   -options_left
   -vec_type cuda
   #End of PETSc Option Table entries
   WARNING! There are options you set that were not used!
   WARNING! could be spelling mistake, etc!
   There is one unused database option. It is:
   Option left: name:-mat_type value: aijcusparse

The -mat_type option has not been used. In the application code, we use

ierr = MatCreateAIJ(PETSC_COMM_WORLD,mlocal,mlocal,m,n,
 d_nz,PETSC_NULL,o_nz,PETSC_NULL,);;CHKERRQ(ierr);


If you create the Mat this way, then you need MatSetFromOptions() in order to 
set the type from the command line.

  Thanks,

 Matt

o The 

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-11 Thread Junchao Zhang
Before digging into the details, could you try to run
src/ksp/ksp/tests/ex60.c to make sure the environment is ok.

The comment at the end shows how to run it
   test:
  requires: cuda
  suffix: 1_cuda
  nsize: 4
  args: -ksp_view -mat_type aijcusparse -sub_pc_factor_mat_solver_type
cusparse

--Junchao Zhang


On Fri, Aug 11, 2023 at 4:36 PM Vanella, Marcos (Fed) <
marcos.vane...@nist.gov> wrote:

> Hi Junchao, thank you for the info. I compiled the main branch of PETSc in
> another machine that has the  openmpi/4.1.4/gcc-11.2.1-cuda-11.7 toolchain
> and don't see the fortran compilation error. It might have been related to
> gcc-9.3.
> I tried the case again, 2 CPUs and one GPU and get this error now:
>
> terminate called after throwing an instance of
> 'thrust::system::system_error'
> terminate called after throwing an instance of
> 'thrust::system::system_error'
>   what():  parallel_for failed: cudaErrorInvalidConfiguration: invalid
> configuration argument
>   what():  parallel_for failed: cudaErrorInvalidConfiguration: invalid
> configuration argument
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
> #0  0x2000397fcd8f in ???
> #1  0x2000397fb657 in ???
> #0  0x2000397fcd8f in ???
> #1  0x2000397fb657 in ???
> #2  0x200604d7 in ???
> #2  0x200604d7 in ???
> #3  0x200039cb9628 in ???
> #4  0x200039c93eb3 in ???
> #5  0x200039364a97 in ???
> #6  0x20003935f6d3 in ???
> #7  0x20003935f78f in ???
> #8  0x20003935fc6b in ???
> #3  0x200039cb9628 in ???
> #4  0x200039c93eb3 in ???
> #5  0x200039364a97 in ???
> #6  0x20003935f6d3 in ???
> #7  0x20003935f78f in ???
> #8  0x20003935fc6b in ???
> #9  0x11ec425b in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
> at /usr/local/cuda-11.7/include/thrust/system/cuda/detail/util.h:225
> #10  0x11ec425b in
> _ZN6thrust8cuda_cub20uninitialized_fill_nINS0_3tagENS_10device_ptrIiEEmiEET0_RNS0_16execution_policyIT_EES5_T1_RKT2_
> #9  0x11ec425b in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
> at /usr/local/cuda-11.7/include/thrust/system/cuda/detail/util.h:225
> #10  0x11ec425b in
> _ZN6thrust8cuda_cub20uninitialized_fill_nINS0_3tagENS_10device_ptrIiEEmiEET0_RNS0_16execution_policyIT_EES5_T1_RKT2_
> at
> /usr/local/cuda-11.7/include/thrust/system/cuda/detail/uninitialized_fill.h:88
> #11  0x11efa263 in
> _ZN6thrust20uninitialized_fill_nINS_8cuda_cub3tagENS_10device_ptrIiEEmiEET0_RKNS_6detail21execution_policy_baseIT_EES5_T1_RKT2_
> at
> /usr/local/cuda-11.7/include/thrust/system/cuda/detail/uninitialized_fill.h:88
> #11  0x11efa263 in
> _ZN6thrust20uninitialized_fill_nINS_8cuda_cub3tagENS_10device_ptrIiEEmiEET0_RKNS_6detail21execution_policy_baseIT_EES5_T1_RKT2_
> at /usr/local/cuda-11.7/include/thrust/detail/uninitialized_fill.inl:55
> #12  0x11efa263 in
> _ZN6thrust6detail23allocator_traits_detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEENS0_10disable_ifIXsrNS1_37needs_default_construct_via_allocatorIT_NS0_15pointer_elementIT0_E4typeEEE5valueEvE4typeERS9_SB_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:93
> #13  0x11efa263 in
> _ZN6thrust6detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEEvRT_T0_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:104
> at /usr/local/cuda-11.7/include/thrust/detail/uninitialized_fill.inl:55
> #12  0x11efa263 in
> _ZN6thrust6detail23allocator_traits_detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEENS0_10disable_ifIXsrNS1_37needs_default_construct_via_allocatorIT_NS0_15pointer_elementIT0_E4typeEEE5valueEvE4typeERS9_SB_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:93
> #13  0x11efa263 in
> _ZN6thrust6detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEEvRT_T0_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:104
> #14  0x11efa263 in
> _ZN6thrust6detail18contiguous_storageIiNS_16device_allocatorIiEEE19default_construct_nENS0_15normal_iteratorINS_10device_ptrIim
> at /usr/local/cuda-11.7/include/thrust/detail/contiguous_storage.inl:254
> #15  0x11efa263 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:220
> #14  0x11efa263 in
> _ZN6thrust6detail18contiguous_storageIiNS_16device_allocatorIiEEE19default_construct_nENS0_15normal_iteratorINS_10device_ptrIim
> at /usr/local/cuda-11.7/include/thrust/detail/contiguous_storage.inl:254
> #15  0x11efa263 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:220
> #16  0x11efa263 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
> at 

Re: [petsc-users] error related to 'valgrind' when using MatView

2023-08-11 Thread Barry Smith

  New error checking to prevent this confusion in the future: 
https://gitlab.com/petsc/petsc/-/merge_requests/6804


> On Aug 10, 2023, at 6:54 AM, Matthew Knepley  wrote:
> 
> On Thu, Aug 10, 2023 at 2:30 AM maitri ksh  > wrote:
>> I am unable to understand what possibly went wrong with my code, I could 
>> load a matrix (large sparse matrix) into petsc, write it out and read it 
>> back into Matlab but when I tried to use MatView to see the matrix-info, it 
>> produces error of some 'corrupt argument, #valgrind'. Can anyone please help?
> 
> You use
> 
>   viewer = PETSC_VIEWER_STDOUT_WORLD
> 
> but then you Destroy() that viewer. You should not since you did not create 
> it.
> 
>   THanks,
> 
>  Matt
>  
>> Maitri
> 
> 
> -- 
> What most experimenters take for granted before they begin their experiments 
> is infinitely more interesting than any results to which their experiments 
> lead.
> -- Norbert Wiener
> 
> https://www.cse.buffalo.edu/~knepley/ 



Re: [petsc-users] 32-bit vs 64-bit GPU support

2023-08-11 Thread Satish Balay via petsc-users
On Fri, 11 Aug 2023, Jed Brown wrote:

> Jacob Faibussowitsch  writes:
> 
> > More generally, it would be interesting to know the breakdown of installed 
> > CUDA versions for users. Unlike compilers etc, I suspect that cluster 
> > admins (and those running on local machines) are much more likely to be 
> > updating their CUDA toolkits to the latest versions as they often contain 
> > critical performance improvements.
> 
> One difference is that some sites (not looking at you at all, ALCF) still run 
> pretty ancient drivers and/or have broken GPU-aware MPI with all but a 
> specific ancient version of CUDA (OLCF, LLNL). With a normal compiler, you 
> can choose to use the latest version, but with CUDA, people are firmly stuck 
> on old versions.
> 

Well Nvidia keeps phasing out support for older GPUs in newer CUDA releases - 
so unless GPUs are upgraded - they can't really upgrade (to latest) CUDA 
versions ..

[this is in addition to the usual reasons admins don't do software upgrades... 
Ignore clusters - our CUDA CI machine has random stability issues - so we had 
to downgrade/freeze cuda/driver versions to keep the machine functional]

Satish



Re: [petsc-users] 32-bit vs 64-bit GPU support

2023-08-11 Thread Jed Brown
Jacob Faibussowitsch  writes:

> More generally, it would be interesting to know the breakdown of installed 
> CUDA versions for users. Unlike compilers etc, I suspect that cluster admins 
> (and those running on local machines) are much more likely to be updating 
> their CUDA toolkits to the latest versions as they often contain critical 
> performance improvements.

One difference is that some sites (not looking at you at all, ALCF) still run 
pretty ancient drivers and/or have broken GPU-aware MPI with all but a specific 
ancient version of CUDA (OLCF, LLNL). With a normal compiler, you can choose to 
use the latest version, but with CUDA, people are firmly stuck on old versions.


Re: [petsc-users] 32-bit vs 64-bit GPU support

2023-08-11 Thread Jacob Faibussowitsch
> We should support it, but it still seems hypothetical and not urgent.

FWIW, cuBLAS only just added 64-bit int support with CUDA 12 (naturally, with a 
completely separate API). 

More generally, it would be interesting to know the breakdown of installed CUDA 
versions for users. Unlike compilers etc, I suspect that cluster admins (and 
those running on local machines) are much more likely to be updating their CUDA 
toolkits to the latest versions as they often contain critical performance 
improvements.

It would help us decide on minimum version to support. We don’t have any real 
idea of the current minimum version, last time it was estimated to be CUDA 7 
IIRC?

Best regards,

Jacob Faibussowitsch
(Jacob Fai - booss - oh - vitch)

> On Aug 11, 2023, at 15:38, Jed Brown  wrote:
> 
> Rohan Yadav  writes:
> 
>> With modern GPU sizes, for example A100's with 80GB of memory, a vector of
>> length 2^31 is not that much memory -- one could conceivably run a CG solve
>> with local vectors > 2^31.
> 
> Yeah, each vector would be 8 GB (single precision) or 16 GB (double). You 
> can't store a matrix of this size, and probably not a "mesh", but it's 
> possible to create such a problem if everything is matrix-free (possibly with 
> matrix-free geometric multigrid). This is more likely to show up in a 
> benchmark than any real science or engineering probelm. We should support it, 
> but it still seems hypothetical and not urgent.
> 
>> Thanks Junchao, I might look into that. However, I currently am not trying
>> to solve such a large problem -- these questions just came from wondering
>> why the cuSPARSE kernel PETSc was calling was running faster than mine.
> 
> Hah, bandwidth doesn't like. ;-)



Re: [petsc-users] 32-bit vs 64-bit GPU support

2023-08-11 Thread Jed Brown
Rohan Yadav  writes:

> With modern GPU sizes, for example A100's with 80GB of memory, a vector of
> length 2^31 is not that much memory -- one could conceivably run a CG solve
> with local vectors > 2^31.

Yeah, each vector would be 8 GB (single precision) or 16 GB (double). You can't 
store a matrix of this size, and probably not a "mesh", but it's possible to 
create such a problem if everything is matrix-free (possibly with matrix-free 
geometric multigrid). This is more likely to show up in a benchmark than any 
real science or engineering probelm. We should support it, but it still seems 
hypothetical and not urgent.

> Thanks Junchao, I might look into that. However, I currently am not trying
> to solve such a large problem -- these questions just came from wondering
> why the cuSPARSE kernel PETSc was calling was running faster than mine.

Hah, bandwidth doesn't like. ;-)


Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-11 Thread Junchao Zhang
Marcos,
  We do not have good petsc/gpu documentation, but see
https://petsc.org/main/faq/#doc-faq-gpuhowto, and also search "requires:
cuda" in petsc tests and you will find examples using GPU.
  For the Fortran compile errors, attach your configure.log and Satish
(Cc'ed) or others should know how to fix them.

  Thanks.
--Junchao Zhang


On Fri, Aug 11, 2023 at 2:22 PM Vanella, Marcos (Fed) <
marcos.vane...@nist.gov> wrote:

> Hi Junchao, thanks for the explanation. Is there some development
> documentation on the GPU work? I'm interested learning about it.
> I checked out the main branch and configured petsc. when compiling with
> gcc/gfortran I come across this error:
>
> 
>   CUDAC
> arch-linux-c-opt/obj/src/mat/impls/aij/seq/seqcusparse/aijcusparse.o
>   CUDAC.dep
> arch-linux-c-opt/obj/src/mat/impls/aij/seq/seqcusparse/aijcusparse.o
>  FC arch-linux-c-opt/obj/src/ksp/f90-mod/petsckspdefmod.o
>  FC arch-linux-c-opt/obj/src/ksp/f90-mod/petscpcmod.o
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:37:61:
>
>37 |   subroutine PCASMCreateSubdomains2D(a,b,c,d,e,f,g,h,i,z)
>   | 1
> *Error: Symbol ‘pcasmcreatesubdomains2d’ at (1) already has an explicit
> interface*
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:38:13:
>
>38 |import tIS
>   | 1
> Error: IMPORT statement at (1) only permitted in an INTERFACE body
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:39:80:
>
>39 |PetscInt a ! PetscInt
>   |
>  1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:40:80:
>
>40 |PetscInt b ! PetscInt
>   |
>  1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:41:80:
>
>41 |PetscInt c ! PetscInt
>   |
>  1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:42:80:
>
>42 |PetscInt d ! PetscInt
>   |
>  1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:43:80:
>
>43 |PetscInt e ! PetscInt
>   |
>  1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:44:80:
>
>44 |PetscInt f ! PetscInt
>   |
>  1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:45:80:
>
>45 |PetscInt g ! PetscInt
>   |
>  1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:46:30:
>
>46 |IS h ! IS
>   |  1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:47:30:
>
>47 |IS i ! IS
>   |  1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:48:43:
>
>48 |PetscErrorCode z
>   |   1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:49:10:
>
>49 |end subroutine PCASMCreateSubdomains2D
>   |  1
> Error: Expecting END INTERFACE statement at (1)
> make[3]: *** [gmakefile:225:
> arch-linux-c-opt/obj/src/ksp/f90-mod/petscpcmod.o] Error 1
> make[3]: *** Waiting for unfinished jobs
>  CC
> arch-linux-c-opt/obj/src/tao/leastsquares/impls/pounders/pounders.o
>  CC arch-linux-c-opt/obj/src/ksp/pc/impls/bddc/bddcprivate.o
>   CUDAC
> arch-linux-c-opt/obj/src/vec/vec/impls/seq/cupm/cuda/vecseqcupm.o
>   CUDAC.dep
> arch-linux-c-opt/obj/src/vec/vec/impls/seq/cupm/cuda/vecseqcupm.o
> make[3]: Leaving directory '/home/mnv/Software/petsc'
> make[2]: *** [/home/mnv/Software/petsc/lib/petsc/conf/rules.doc:28: libs]
> Error 2
> make[2]: Leaving directory '/home/mnv/Software/petsc'
> **ERROR*
>   Error during compile, check arch-linux-c-opt/lib/petsc/conf/make.log
>   Send it and 

Re: [petsc-users] 32-bit vs 64-bit GPU support

2023-08-11 Thread Rohan Yadav
>We do not currently have any code for using 64 bit integer sizes on
the GPUs.

Thank you, just wanted confirmation.

>Given the current memory available on GPUs is 64 bit integer support
needed? I think even a single vector of length 2^31 will use up most of the
GPU's memory? Are the practical, not synthetic, situations that require 64
bit integer support on GPUs immediately?  For example, is the vector length
of the entire parallel vector across all GPUs limited to 32 bits?

With modern GPU sizes, for example A100's with 80GB of memory, a vector of
length 2^31 is not that much memory -- one could conceivably run a CG solve
with local vectors > 2^31.

Thanks Junchao, I might look into that. However, I currently am not trying
to solve such a large problem -- these questions just came from wondering
why the cuSPARSE kernel PETSc was calling was running faster than mine.

Rohan


Re: [petsc-users] 32-bit vs 64-bit GPU support

2023-08-11 Thread Junchao Zhang
Rohan,
  You could try the petsc/kokkos backend.  I have not tested it, but I
guess it should handle 64 bit CUDA index types.
  I guess the petsc/cuda 32-bit limit came from old CUDA versions where
only 32-bit indices were supported such that the original developers
hardwired the type to THRUSTINTARRAY32.  We try to support generations of
cuda toolkits and thus have the current code.

  Anyway, this should be fixed.
--Junchao Zhang


On Fri, Aug 11, 2023 at 1:07 PM Barry Smith  wrote:

>
>We do not currently have any code for using 64 bit integer sizes on the
> GPUs.
>
>Given the current memory available on GPUs is 64 bit integer support
> needed? I think even a single vector of length 2^31 will use up most of the
> GPU's memory? Are the practical, not synthetic, situations that require 64
> bit integer support on GPUs immediately?  For example, is the vector length
> of the entire parallel vector across all GPUs limited to 32 bits?
>
>We will certainly add such support, but it is a question of priorities;
> there are many things we need to do to improve PETSc GPU support, and they
> take time. Unless we have practical use cases, 64 bit integer support for
> integer sizes on the GPU is not at the top of the list. Of course, we would
> be very happy with a merge request that would provide this support at any
> time.
>
>   Barry
>
>
>
> On Aug 11, 2023, at 1:23 PM, Rohan Yadav  wrote:
>
> Hi,
>
> I was wondering what the official status of 64-bit integer support in the
> PETSc GPU backend is (specifically CUDA). This question comes from the
> result of benchmarking some PETSc code and looking at some sources. In
> particular, I found that PETSc's call to cuSPARSE SpMV seems to always be
> using the 32-bit integer call, even if I compile PETSc with
> `--with-64-bit-indices`. After digging around more, I see that PETSc always
> only creates 32-bit cuSPARSE matrices as well:
> https://gitlab.com/petsc/petsc/-/blob/v3.19.4/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu?ref_type=tags#L2501.
> I was looking around for a switch somewhere to 64 bit integers inside this
> code, but everything seems to be pretty hardcoded with `THRUSTINTARRAY32`.
>
> As expected, this all works when the range of coordinates in each sparse
> matrix partition is less than INT_MAX, but PETSc GPU code breaks in
> different ways (calling cuBLAS and cuSPARSE) when trying a (synthetic)
> problem that needs 64 bit integers:
>
> ```
> #include "petscmat.h"
> #include "petscvec.h"
> #include "petsc.h"
>
> int main(int argc, char** argv) {
>   PetscInt ierr;
>   PetscInitialize(, , (char *)0, "GPU bug");
>
>   PetscInt numRows = 1;
>   PetscInt numCols = PetscInt(INT_MAX) * 2;
>
>   Mat A;
>   PetscInt rowStart, rowEnd;
>   ierr = MatCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr);
>   MatSetSizes(A, PETSC_DECIDE, PETSC_DECIDE, numRows, numCols);
>   MatSetType(A, MATMPIAIJ);
>   MatSetFromOptions(A);
>
>   MatSetValue(A, 0, 0, 1.0, INSERT_VALUES);
>   MatSetValue(A, 0, numCols - 1, 1.0, INSERT_VALUES);
>   MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY);
>   MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY);
>
>   Vec b;
>   ierr = VecCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr);
>   VecSetSizes(b, PETSC_DECIDE, numCols);
>   VecSetFromOptions(b);
>   VecSet(b, 0.0);
>   VecSetValue(b, 0, 42.0, INSERT_VALUES);
>   VecSetValue(b, numCols - 1, 58.0, INSERT_VALUES);
>   VecAssemblyBegin(b);
>   VecAssemblyEnd(b);
>
>   Vec x;
>   ierr = VecCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr);
>   VecSetSizes(x, PETSC_DECIDE, numRows);
>   VecSetFromOptions(x);
>   VecSet(x, 0.0);
>
>   MatMult(A, b, x);
>   PetscScalar result;
>   VecSum(x, );
>   PetscPrintf(PETSC_COMM_WORLD, "Result of mult: %f\n", result);
>   PetscFinalize();
> }
> ```
>
> When this program is run on CPUs, it outputs 100.0, as expected.
>
> When run on a single GPU with `-vec_type cuda -mat_type aijcusparse
> -use_gpu_aware_mpi 0` it fails with
> ```
> [0]PETSC ERROR: - Error Message
> --
> [0]PETSC ERROR: Argument out of range
> [0]PETSC ERROR: 4294967294 is too big for cuBLAS, which may be restricted
> to 32-bit integers
> [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
> [0]PETSC ERROR: Petsc Release Version 3.19.4, unknown
> [0]PETSC ERROR: ./gpu-bug on a  named sean-dgx2 by rohany Fri Aug 11
> 09:34:10 2023
> [0]PETSC ERROR: Configure options --with-cuda=1
> --prefix=/local/home/rohany/petsc/petsc-install/
> --with-cuda-dir=/usr/local/cuda-11.7/ CXXFLAGS=-O3 COPTFLAGS=-O3
> CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 --download-fblaslapack=1 --with-debugging=0
> --with-64-bit-indices
> [0]PETSC ERROR: #1 checkCupmBlasIntCast() at
> /local/home/rohany/petsc/include/petsc/private/cupmblasinterface.hpp:435
> [0]PETSC ERROR: #2 VecAllocateCheck_() at
> /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:335
> [0]PETSC ERROR: #3 VecCUPMAllocateCheck_() at
> 

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-11 Thread Junchao Zhang
Hi, Macros,
  I saw MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic() in the error stack.
We recently refactored the COO code and got rid of that function.  So could
you try petsc/main?
  We map MPI processes to GPUs in a round-robin fashion. We query the
number of visible CUDA devices (g), and assign the device (rank%g) to the
MPI process (rank).   In that sense, the work distribution is totally
determined by your MPI work partition (i.e, yourself).
  On clusters, this MPI process to GPU binding is usually done by the job
scheduler like slurm.  You need to check your cluster's users' guide to see
how to bind MPI processes to GPUs. If the job scheduler has done that, the
number of visible CUDA devices to a process might just appear to be 1,
making petsc's own mapping void.

   Thanks.
--Junchao Zhang


On Fri, Aug 11, 2023 at 12:43 PM Vanella, Marcos (Fed) <
marcos.vane...@nist.gov> wrote:

> Hi Junchao, thank you for replying. I compiled petsc in debug mode and
> this is what I get for the case:
>
> terminate called after throwing an instance of
> 'thrust::system::system_error'
>   what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an
> illegal memory access was encountered
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
> #0  0x15264731ead0 in ???
> #1  0x15264731dc35 in ???
> #2  0x15264711551f in ???
> #3  0x152647169a7c in ???
> #4  0x152647115475 in ???
> #5  0x1526470fb7f2 in ???
> #6  0x152647678bbd in ???
> #7  0x15264768424b in ???
> #8  0x1526476842b6 in ???
> #9  0x152647684517 in ???
> #10  0x55bb46342ebb in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
> at /usr/local/cuda/include/thrust/system/cuda/detail/util.h:224
> #11  0x55bb46342ebb in
> _ZN6thrust8cuda_cub12__merge_sort10merge_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_NS3_15normal_iteratorISB_EE9IJCompareEEvRNS0_16execution_policyIT1_EET2_SM_T3_T4_
> at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1316
> #12  0x55bb46342ebb in
> _ZN6thrust8cuda_cub12__smart_sort10smart_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_16execution_policyINS0_3tagEEENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESD_NS_9null_typeESE_SE_SE_SE_SE_SE_SE_NS3_15normal_iteratorISD_EE9IJCompareEENS1_25enable_if_comparison_sortIT2_T4_E4typeERT1_SL_SL_T3_SM_
> at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1544
> #13  0x55bb46342ebb in
> _ZN6thrust8cuda_cub11sort_by_keyINS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_NS_6detail15normal_iteratorIS6_EE9IJCompareEEvRNS0_16execution_policyIT_EET0_SI_T1_T2_
> at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1669
> #14  0x55bb46317bc5 in
> _ZN6thrust11sort_by_keyINS_8cuda_cub3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_NS_6detail15normal_iteratorIS6_EE9IJCompareEEvRKNSA_21execution_policy_baseIT_EET0_SJ_T1_T2_
> at /usr/local/cuda/include/thrust/detail/sort.inl:115
> #15  0x55bb46317bc5 in
> _ZN6thrust11sort_by_keyINS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES4_NS_9null_typeES5_S5_S5_S5_S5_S5_S5_NS_6detail15normal_iteratorIS4_EE9IJCompareEEvT_SC_T0_T1_
> at /usr/local/cuda/include/thrust/detail/sort.inl:305
> #16  0x55bb46317bc5 in MatSetPreallocationCOO_SeqAIJCUSPARSE_Basic
> at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/
> aijcusparse.cu:4452
> #17  0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic
> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/
> mpiaijcusparse.cu:173
> #18  0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE
> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/
> mpiaijcusparse.cu:222
> #19  0x55bb468e01cf in MatSetPreallocationCOO
> at /home/mnv/Software/petsc/src/mat/utils/gcreate.c:606
> #20  0x55bb46b39c9b in MatProductSymbolic_MPIAIJBACKEND
> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:7547
> #21  0x55bb469015e5 in MatProductSymbolic
> at /home/mnv/Software/petsc/src/mat/interface/matproduct.c:803
> #22  0x55bb4694ade2 in MatPtAP
> at /home/mnv/Software/petsc/src/mat/interface/matrix.c:9897
> #23  0x55bb4696d3ec in MatCoarsenApply_MISK_private
> at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:283
> #24  0x55bb4696eb67 in MatCoarsenApply_MISK
> at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:368
> #25  0x55bb4695bd91 in MatCoarsenApply
> at /home/mnv/Software/petsc/src/mat/coarsen/coarsen.c:97
> #26  0x55bb478294d8 in PCGAMGCoarsen_AGG
> at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/agg.c:524
> #27  0x55bb471d1cb4 in PCSetUp_GAMG
> at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/gamg.c:631
> #28  0x55bb464022cf in PCSetUp
> at /home/mnv/Software/petsc/src/ksp/pc/interface/precon.c:994
> #29  0x55bb4718b8a7 in KSPSetUp
> at 

Re: [petsc-users] 32-bit vs 64-bit GPU support

2023-08-11 Thread Barry Smith

   We do not currently have any code for using 64 bit integer sizes on the 
GPUs. 

   Given the current memory available on GPUs is 64 bit integer support needed? 
I think even a single vector of length 2^31 will use up most of the GPU's 
memory? Are the practical, not synthetic, situations that require 64 bit 
integer support on GPUs immediately?  For example, is the vector length of the 
entire parallel vector across all GPUs limited to 32 bits? 

   We will certainly add such support, but it is a question of priorities; 
there are many things we need to do to improve PETSc GPU support, and they take 
time. Unless we have practical use cases, 64 bit integer support for integer 
sizes on the GPU is not at the top of the list. Of course, we would be very 
happy with a merge request that would provide this support at any time.

  Barry



> On Aug 11, 2023, at 1:23 PM, Rohan Yadav  wrote:
> 
> Hi,
> 
> I was wondering what the official status of 64-bit integer support in the 
> PETSc GPU backend is (specifically CUDA). This question comes from the result 
> of benchmarking some PETSc code and looking at some sources. In particular, I 
> found that PETSc's call to cuSPARSE SpMV seems to always be using the 32-bit 
> integer call, even if I compile PETSc with `--with-64-bit-indices`. After 
> digging around more, I see that PETSc always only creates 32-bit cuSPARSE 
> matrices as well: 
> https://gitlab.com/petsc/petsc/-/blob/v3.19.4/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu?ref_type=tags#L2501.
>  I was looking around for a switch somewhere to 64 bit integers inside this 
> code, but everything seems to be pretty hardcoded with `THRUSTINTARRAY32`.
> 
> As expected, this all works when the range of coordinates in each sparse 
> matrix partition is less than INT_MAX, but PETSc GPU code breaks in different 
> ways (calling cuBLAS and cuSPARSE) when trying a (synthetic) problem that 
> needs 64 bit integers:
> 
> ```
> #include "petscmat.h"
> #include "petscvec.h"
> #include "petsc.h"
> 
> int main(int argc, char** argv) {
>   PetscInt ierr;
>   PetscInitialize(, , (char *)0, "GPU bug");
> 
>   PetscInt numRows = 1;
>   PetscInt numCols = PetscInt(INT_MAX) * 2;
> 
>   Mat A;
>   PetscInt rowStart, rowEnd;
>   ierr = MatCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr);
>   MatSetSizes(A, PETSC_DECIDE, PETSC_DECIDE, numRows, numCols);
>   MatSetType(A, MATMPIAIJ);
>   MatSetFromOptions(A);
> 
>   MatSetValue(A, 0, 0, 1.0, INSERT_VALUES);
>   MatSetValue(A, 0, numCols - 1, 1.0, INSERT_VALUES);
>   MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY);
>   MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY);
> 
>   Vec b;
>   ierr = VecCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr);
>   VecSetSizes(b, PETSC_DECIDE, numCols);
>   VecSetFromOptions(b);
>   VecSet(b, 0.0);
>   VecSetValue(b, 0, 42.0, INSERT_VALUES);
>   VecSetValue(b, numCols - 1, 58.0, INSERT_VALUES);
>   VecAssemblyBegin(b);
>   VecAssemblyEnd(b);
> 
>   Vec x;
>   ierr = VecCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr);
>   VecSetSizes(x, PETSC_DECIDE, numRows);
>   VecSetFromOptions(x);
>   VecSet(x, 0.0);
> 
>   MatMult(A, b, x);
>   PetscScalar result;
>   VecSum(x, );
>   PetscPrintf(PETSC_COMM_WORLD, "Result of mult: %f\n", result);
>   PetscFinalize();
> }
> ```
> 
> When this program is run on CPUs, it outputs 100.0, as expected.
> 
> When run on a single GPU with `-vec_type cuda -mat_type aijcusparse 
> -use_gpu_aware_mpi 0` it fails with
> ```
> [0]PETSC ERROR: - Error Message 
> --
> [0]PETSC ERROR: Argument out of range
> [0]PETSC ERROR: 4294967294 is too big for cuBLAS, which may be restricted to 
> 32-bit integers
> [0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
> [0]PETSC ERROR: Petsc Release Version 3.19.4, unknown
> [0]PETSC ERROR: ./gpu-bug on a  named sean-dgx2 by rohany Fri Aug 11 09:34:10 
> 2023
> [0]PETSC ERROR: Configure options --with-cuda=1 
> --prefix=/local/home/rohany/petsc/petsc-install/ 
> --with-cuda-dir=/usr/local/cuda-11.7/ CXXFLAGS=-O3 COPTFLAGS=-O3 
> CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 --download-fblaslapack=1 --with-debugging=0 
> --with-64-bit-indices
> [0]PETSC ERROR: #1 checkCupmBlasIntCast() at 
> /local/home/rohany/petsc/include/petsc/private/cupmblasinterface.hpp:435
> [0]PETSC ERROR: #2 VecAllocateCheck_() at 
> /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:335
> [0]PETSC ERROR: #3 VecCUPMAllocateCheck_() at 
> /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:360
> [0]PETSC ERROR: #4 DeviceAllocateCheck_() at 
> /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:389
> [0]PETSC ERROR: #5 GetArray() at 
> /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:545
> [0]PETSC ERROR: #6 VectorArray() at 
> /local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:273
> --
> MPI_ABORT was invoked on rank 0 in communicator 

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-11 Thread Vanella, Marcos (Fed) via petsc-users
Hi Junchao, thank you for replying. I compiled petsc in debug mode and this is 
what I get for the case:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an 
illegal memory access was encountered

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x15264731ead0 in ???
#1  0x15264731dc35 in ???
#2  0x15264711551f in ???
#3  0x152647169a7c in ???
#4  0x152647115475 in ???
#5  0x1526470fb7f2 in ???
#6  0x152647678bbd in ???
#7  0x15264768424b in ???
#8  0x1526476842b6 in ???
#9  0x152647684517 in ???
#10  0x55bb46342ebb in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
  at /usr/local/cuda/include/thrust/system/cuda/detail/util.h:224
#11  0x55bb46342ebb in 
_ZN6thrust8cuda_cub12__merge_sort10merge_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_NS3_15normal_iteratorISB_EE9IJCompareEEvRNS0_16execution_policyIT1_EET2_SM_T3_T4_
  at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1316
#12  0x55bb46342ebb in 
_ZN6thrust8cuda_cub12__smart_sort10smart_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_16execution_policyINS0_3tagEEENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESD_NS_9null_typeESE_SE_SE_SE_SE_SE_SE_NS3_15normal_iteratorISD_EE9IJCompareEENS1_25enable_if_comparison_sortIT2_T4_E4typeERT1_SL_SL_T3_SM_
  at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1544
#13  0x55bb46342ebb in 
_ZN6thrust8cuda_cub11sort_by_keyINS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_NS_6detail15normal_iteratorIS6_EE9IJCompareEEvRNS0_16execution_policyIT_EET0_SI_T1_T2_
  at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1669
#14  0x55bb46317bc5 in 
_ZN6thrust11sort_by_keyINS_8cuda_cub3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_NS_6detail15normal_iteratorIS6_EE9IJCompareEEvRKNSA_21execution_policy_baseIT_EET0_SJ_T1_T2_
  at /usr/local/cuda/include/thrust/detail/sort.inl:115
#15  0x55bb46317bc5 in 
_ZN6thrust11sort_by_keyINS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES4_NS_9null_typeES5_S5_S5_S5_S5_S5_S5_NS_6detail15normal_iteratorIS4_EE9IJCompareEEvT_SC_T0_T1_
  at /usr/local/cuda/include/thrust/detail/sort.inl:305
#16  0x55bb46317bc5 in MatSetPreallocationCOO_SeqAIJCUSPARSE_Basic
  at 
/home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu:4452
#17  0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic
  at 
/home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:173
#18  0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE
  at 
/home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:222
#19  0x55bb468e01cf in MatSetPreallocationCOO
  at /home/mnv/Software/petsc/src/mat/utils/gcreate.c:606
#20  0x55bb46b39c9b in MatProductSymbolic_MPIAIJBACKEND
  at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:7547
#21  0x55bb469015e5 in MatProductSymbolic
  at /home/mnv/Software/petsc/src/mat/interface/matproduct.c:803
#22  0x55bb4694ade2 in MatPtAP
  at /home/mnv/Software/petsc/src/mat/interface/matrix.c:9897
#23  0x55bb4696d3ec in MatCoarsenApply_MISK_private
  at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:283
#24  0x55bb4696eb67 in MatCoarsenApply_MISK
  at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:368
#25  0x55bb4695bd91 in MatCoarsenApply
  at /home/mnv/Software/petsc/src/mat/coarsen/coarsen.c:97
#26  0x55bb478294d8 in PCGAMGCoarsen_AGG
  at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/agg.c:524
#27  0x55bb471d1cb4 in PCSetUp_GAMG
  at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/gamg.c:631
#28  0x55bb464022cf in PCSetUp
  at /home/mnv/Software/petsc/src/ksp/pc/interface/precon.c:994
#29  0x55bb4718b8a7 in KSPSetUp
  at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:406
#30  0x55bb4718f22e in KSPSolve_Private
  at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:824
#31  0x55bb47192c0c in KSPSolve
  at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:1070
#32  0x55bb463efd35 in kspsolve_
  at /home/mnv/Software/petsc/src/ksp/ksp/interface/ftn-auto/itfuncf.c:320
#33  0x55bb45e94b32 in ???
#34  0x55bb46048044 in ???
#35  0x55bb46052ea1 in ???
#36  0x55bb45ac5f8e in ???
#37  0x1526470fcd8f in ???
#38  0x1526470fce3f in ???
#39  0x55bb45aef55d in ???
#40  0x in ???
--
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--

[petsc-users] 32-bit vs 64-bit GPU support

2023-08-11 Thread Rohan Yadav
Hi,

I was wondering what the official status of 64-bit integer support in the
PETSc GPU backend is (specifically CUDA). This question comes from the
result of benchmarking some PETSc code and looking at some sources. In
particular, I found that PETSc's call to cuSPARSE SpMV seems to always be
using the 32-bit integer call, even if I compile PETSc with
`--with-64-bit-indices`. After digging around more, I see that PETSc always
only creates 32-bit cuSPARSE matrices as well:
https://gitlab.com/petsc/petsc/-/blob/v3.19.4/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu?ref_type=tags#L2501.
I was looking around for a switch somewhere to 64 bit integers inside this
code, but everything seems to be pretty hardcoded with `THRUSTINTARRAY32`.

As expected, this all works when the range of coordinates in each sparse
matrix partition is less than INT_MAX, but PETSc GPU code breaks in
different ways (calling cuBLAS and cuSPARSE) when trying a (synthetic)
problem that needs 64 bit integers:

```
#include "petscmat.h"
#include "petscvec.h"
#include "petsc.h"

int main(int argc, char** argv) {
  PetscInt ierr;
  PetscInitialize(, , (char *)0, "GPU bug");

  PetscInt numRows = 1;
  PetscInt numCols = PetscInt(INT_MAX) * 2;

  Mat A;
  PetscInt rowStart, rowEnd;
  ierr = MatCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr);
  MatSetSizes(A, PETSC_DECIDE, PETSC_DECIDE, numRows, numCols);
  MatSetType(A, MATMPIAIJ);
  MatSetFromOptions(A);

  MatSetValue(A, 0, 0, 1.0, INSERT_VALUES);
  MatSetValue(A, 0, numCols - 1, 1.0, INSERT_VALUES);
  MatAssemblyBegin(A, MAT_FINAL_ASSEMBLY);
  MatAssemblyEnd(A, MAT_FINAL_ASSEMBLY);

  Vec b;
  ierr = VecCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr);
  VecSetSizes(b, PETSC_DECIDE, numCols);
  VecSetFromOptions(b);
  VecSet(b, 0.0);
  VecSetValue(b, 0, 42.0, INSERT_VALUES);
  VecSetValue(b, numCols - 1, 58.0, INSERT_VALUES);
  VecAssemblyBegin(b);
  VecAssemblyEnd(b);

  Vec x;
  ierr = VecCreate(PETSC_COMM_WORLD, ); CHKERRQ(ierr);
  VecSetSizes(x, PETSC_DECIDE, numRows);
  VecSetFromOptions(x);
  VecSet(x, 0.0);

  MatMult(A, b, x);
  PetscScalar result;
  VecSum(x, );
  PetscPrintf(PETSC_COMM_WORLD, "Result of mult: %f\n", result);
  PetscFinalize();
}
```

When this program is run on CPUs, it outputs 100.0, as expected.

When run on a single GPU with `-vec_type cuda -mat_type aijcusparse
-use_gpu_aware_mpi 0` it fails with
```

[0]PETSC ERROR: - Error Message
--

[0]PETSC ERROR: Argument out of range

[0]PETSC ERROR: 4294967294 is too big for cuBLAS, which may be restricted
to 32-bit integers

[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.

[0]PETSC ERROR: Petsc Release Version 3.19.4, unknown

[0]PETSC ERROR: ./gpu-bug on a  named sean-dgx2 by rohany Fri Aug 11
09:34:10 2023

[0]PETSC ERROR: Configure options --with-cuda=1
--prefix=/local/home/rohany/petsc/petsc-install/
--with-cuda-dir=/usr/local/cuda-11.7/ CXXFLAGS=-O3 COPTFLAGS=-O3
CXXOPTFLAGS=-O3 FOPTFLAGS=-O3 --download-fblaslapack=1 --with-debugging=0
--with-64-bit-indices

[0]PETSC ERROR: #1 checkCupmBlasIntCast() at
/local/home/rohany/petsc/include/petsc/private/cupmblasinterface.hpp:435

[0]PETSC ERROR: #2 VecAllocateCheck_() at
/local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:335

[0]PETSC ERROR: #3 VecCUPMAllocateCheck_() at
/local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:360

[0]PETSC ERROR: #4 DeviceAllocateCheck_() at
/local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:389

[0]PETSC ERROR: #5 GetArray() at
/local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:545

[0]PETSC ERROR: #6 VectorArray() at
/local/home/rohany/petsc/include/petsc/private/veccupmimpl.h:273

--

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_SELF

with errorcode 63.


NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

You may or may not see output from other processes, depending on

exactly when Open MPI kills them.

--

```


and when run with just `-mat_type aijcusparse -use_gpu_aware_mpi 0` it
fails with

```

 ** On entry to cusparseCreateCsr(): dimension mismatch for
CUSPARSE_INDEX_32I, cols (4294967294) + base (0) > INT32_MAX (2147483647)


[0]PETSC ERROR: - Error Message
--

[0]PETSC ERROR: GPU error

[0]PETSC ERROR: cuSPARSE errorcode 3 (CUSPARSE_STATUS_INVALID_VALUE) :
invalid value

[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.

[0]PETSC ERROR: Petsc Release Version 3.19.4, unknown

[0]PETSC ERROR: ./gpu-bug on a  named sean-dgx2 by rohany Fri Aug 11
09:43:07 2023

[0]PETSC ERROR: Configure options --with-cuda=1
--prefix=/local/home/rohany/petsc/petsc-install/
--with-cuda-dir=/usr/local/cuda-11.7/ CXXFLAGS=-O3 COPTFLAGS=-O3

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-11 Thread Junchao Zhang
Hi, Marcos,
  Could you build petsc in debug mode and then copy and paste the whole
error stack message?

   Thanks
--Junchao Zhang


On Thu, Aug 10, 2023 at 5:51 PM Vanella, Marcos (Fed) via petsc-users <
petsc-users@mcs.anl.gov> wrote:

> Hi, I'm trying to run a parallel matrix vector build and linear solution
> with PETSc on 2 MPI processes + one V100 GPU. I tested that the matrix
> build and solution is successful in CPUs only. I'm using cuda 11.5 and cuda
> enabled openmpi and gcc 9.3. When I run the job with GPU enabled I get the
> following error:
>
> terminate called after throwing an instance of
> 'thrust::system::system_error'
>   * what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress:
> an illegal memory access was encountered*
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
> terminate called after throwing an instance of
> 'thrust::system::system_error'
>   what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an
> illegal memory access was encountered
>
> Program received signal SIGABRT: Process abort signal.
>
> I'm new to submitting jobs in slurm that also use GPU resources, so I
> might be doing something wrong in my submission script. This is it:
>
> #!/bin/bash
> #SBATCH -J test
> #SBATCH -e /home/Issues/PETSc/test.err
> #SBATCH -o /home/Issues/PETSc/test.log
> #SBATCH --partition=batch
> #SBATCH --ntasks=2
> #SBATCH --nodes=1
> #SBATCH --cpus-per-task=1
> #SBATCH --ntasks-per-node=2
> #SBATCH --time=01:00:00
> #SBATCH --gres=gpu:1
>
> export OMP_NUM_THREADS=1
> module load cuda/11.5
> module load openmpi/4.1.1
>
> cd /home/Issues/PETSc
> *mpirun -n 2 */home/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds 
> *-vec_type
> mpicuda -mat_type mpiaijcusparse -pc_type gamg*
>
> If anyone has any suggestions on how o troubleshoot this please let me
> know.
> Thanks!
> Marcos
>
>
>
>