Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-24 Thread Barry Smith


> On Aug 24, 2023, at 2:00 PM, Vanella, Marcos (Fed)  
> wrote:
> 
> Thank you Barry, I will dial back the MPI_F08 use in our source code and try 
> compiling it. I haven't found much information regarding using MPI and 
> MPI_F08 in different modules other than the following link from several years 
> ago:
> 
> https://users.open-mpi.narkive.com/eCCG36Ni/ompi-fortran-problem-when-mixing-use-mpi-and-use-mpi-f08-with-gfortran-5
> 
> Looks like this has been fixed for openmpi and newer gfortran versions 
> because I don't have issues with this MPI lib/compiler combination. Same with 
> openmpi/ifort.
> What I find quite interesting is: I assumed the PRIVATE statement in a module 
> should provide a backstop on the access propagation of variables not 
> explicitly stated in the PUBLIC statement in a module, including the ones 
> that belong to other modules upstream visible through USE. This does not seem 
> to be the case here.

   I agree, you had seemingly inconsistent results with your different tests; 
it could be bugs in the handling of modules by the Fortran system.


> 
> Best,
> Marcos
> 
>  
> From: Barry Smith mailto:bsm...@petsc.dev>>
> Sent: Thursday, August 24, 2023 12:40 PM
> To: Vanella, Marcos (Fed)  >
> Cc: PETSc users list  >; Guan, Collin X. (Fed) 
> mailto:collin.g...@nist.gov>>
> Subject: Re: [petsc-users] CUDA error trying to run a job with two mpi 
> processes and 1 GPU
>  
> 
>PETSc uses the non-MPI_F08 Fortran modules so I am guessing when you also 
> use the MPI_F08 modules the compiler sees two sets of interfaces for the same 
> functions hence the error.  I am not sure if it portable to use PETSc with 
> the F08 Fortran modules  in the same program or routine.
> 
> 
> 
> 
> 
>> On Aug 24, 2023, at 12:22 PM, Vanella, Marcos (Fed) via petsc-users 
>> mailto:petsc-users@mcs.anl.gov>> wrote:
>> 
>> Thank you Matt and Junchao. I've been testing further with nvhpc on summit. 
>> You might have an idea on what is going on here. 
>> These are my modules:
>> 
>> Currently Loaded Modules:
>>   1) lsf-tools/2.0   3) darshan-runtime/3.4.0-lite   5) DefApps   7) 
>> spectrum-mpi/10.4.0.3-20210112   9) nsight-systems/2021.3.1.54
>>   2) hsi/5.0.2.p54) xalt/1.2.1   6) nvhpc/22.11   8) 
>> nsight-compute/2021.2.1 10) cuda/11.7.1
>> 
>> I configured and compiled petsc with these options:
>> 
>> ./configure COPTFLAGS="-O2" CXXOPTFLAGS="-O2" FOPTFLAGS="-O2" 
>> FCOPTFLAGS="-O2" CUDAOPTFLAGS="-O2" --with-debugging=0 
>> --download-suitesparse --download-hypre --download-fblaslapack --with-cuda
>> 
>> without issues. The MPI checks did not go through as this was done in the 
>> login node.
>> 
>> Then, I started getting (similarly to what I saw with pgi and gcc in summit) 
>> ambiguous interface errors related to mpi routines. I was able to make a 
>> simple piece of code that reproduces this. It has to do with having a USE 
>> PETSC statement in a module (TEST_MOD) and a USE MPI_F08 on the main program 
>> (MAIN) using that module, even though the PRIVATE statement has been used in 
>> said (TEST_MOD) module.
>> 
>> MODULE TEST_MOD
>> ! In this module we use PETSC.
>> USE PETSC
>> !USE MPI
>> IMPLICIT NONE
>> PRIVATE
>> PUBLIC :: TEST1
>> 
>> CONTAINS
>> SUBROUTINE TEST1(A)
>> IMPLICIT NONE
>> REAL, INTENT(INOUT) :: A
>> INTEGER :: IERR
>> A=0.
>> ENDSUBROUTINE TEST1
>> 
>> ENDMODULE TEST_MOD
>> 
>> 
>> PROGRAM MAIN
>> 
>> ! Assume in main we use some MPI_F08 features.
>> USE MPI_F08
>> USE TEST_MOD, ONLY : TEST1
>> IMPLICIT NONE
>> INTEGER :: MY_RANK,IERR=0
>> INTEGER :: PNAMELEN=0
>> INTEGER :: PROVIDED
>> INTEGER, PARAMETER :: REQUIRED=MPI_THREAD_FUNNELED
>> REAL :: A=0.
>> CALL MPI_INIT_THREAD(REQUIRED,PROVIDED,IERR)
>> CALL MPI_COMM_RANK(MPI_COMM_WORLD, MY_RANK, IERR)
>> CALL TEST1(A)
>> CALL MPI_FINALIZE(IERR)
>> 
>> ENDPROGRAM MAIN
>> 
>> Leaving the USE PETSC statement in TEST_MOD this is what I get when trying 
>> to compile this code:
>> 
>> vanellam@login5 test_spectrum_issue $ mpifort -c 
>> -I"/autofs/nccs-svm1_home1/vanellam/Software/petsc/include/" 
>> -I"/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-c-opt-nvhpc/include"
>>   mpitest.f90
>> NVFORTRAN-S-0155-Ambiguous interfaces for generic procedure mpi_init_thread 
>> (mpitest.f90: 34)
>> NVFORTRAN-S-0155-Ambiguous interfaces for generic procedure mpi_finalize 
>> (mpitest.f90: 37)
>>   0 inform,   0 warnings,   2 severes, 0 fatal for main
>> 
>> Now, if I change USE PETSC by USE MPI in the module TEST_MOD compilation 
>> proceeds correctly. If I leave the USE PETSC statement in the module and 
>> change to USE MPI the statement in main compilation also goes through. So it 
>> seems to be something related to using the PETSC and MPI_F08 modules. My 
>> take is that it is related to spectrum-mpi, as I haven't had issues 
>> compiling the FDS+PETSc with openmpi in other systems.
>> 
>> Well please let me know if you have 

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-24 Thread Barry Smith

   PETSc uses the non-MPI_F08 Fortran modules so I am guessing when you also 
use the MPI_F08 modules the compiler sees two sets of interfaces for the same 
functions hence the error.  I am not sure if it portable to use PETSc with the 
F08 Fortran modules  in the same program or routine.





> On Aug 24, 2023, at 12:22 PM, Vanella, Marcos (Fed) via petsc-users 
>  wrote:
> 
> Thank you Matt and Junchao. I've been testing further with nvhpc on summit. 
> You might have an idea on what is going on here. 
> These are my modules:
> 
> Currently Loaded Modules:
>   1) lsf-tools/2.0   3) darshan-runtime/3.4.0-lite   5) DefApps   7) 
> spectrum-mpi/10.4.0.3-20210112   9) nsight-systems/2021.3.1.54
>   2) hsi/5.0.2.p54) xalt/1.2.1   6) nvhpc/22.11   8) 
> nsight-compute/2021.2.1 10) cuda/11.7.1
> 
> I configured and compiled petsc with these options:
> 
> ./configure COPTFLAGS="-O2" CXXOPTFLAGS="-O2" FOPTFLAGS="-O2" 
> FCOPTFLAGS="-O2" CUDAOPTFLAGS="-O2" --with-debugging=0 --download-suitesparse 
> --download-hypre --download-fblaslapack --with-cuda
> 
> without issues. The MPI checks did not go through as this was done in the 
> login node.
> 
> Then, I started getting (similarly to what I saw with pgi and gcc in summit) 
> ambiguous interface errors related to mpi routines. I was able to make a 
> simple piece of code that reproduces this. It has to do with having a USE 
> PETSC statement in a module (TEST_MOD) and a USE MPI_F08 on the main program 
> (MAIN) using that module, even though the PRIVATE statement has been used in 
> said (TEST_MOD) module.
> 
> MODULE TEST_MOD
> ! In this module we use PETSC.
> USE PETSC
> !USE MPI
> IMPLICIT NONE
> PRIVATE
> PUBLIC :: TEST1
> 
> CONTAINS
> SUBROUTINE TEST1(A)
> IMPLICIT NONE
> REAL, INTENT(INOUT) :: A
> INTEGER :: IERR
> A=0.
> ENDSUBROUTINE TEST1
> 
> ENDMODULE TEST_MOD
> 
> 
> PROGRAM MAIN
> 
> ! Assume in main we use some MPI_F08 features.
> USE MPI_F08
> USE TEST_MOD, ONLY : TEST1
> IMPLICIT NONE
> INTEGER :: MY_RANK,IERR=0
> INTEGER :: PNAMELEN=0
> INTEGER :: PROVIDED
> INTEGER, PARAMETER :: REQUIRED=MPI_THREAD_FUNNELED
> REAL :: A=0.
> CALL MPI_INIT_THREAD(REQUIRED,PROVIDED,IERR)
> CALL MPI_COMM_RANK(MPI_COMM_WORLD, MY_RANK, IERR)
> CALL TEST1(A)
> CALL MPI_FINALIZE(IERR)
> 
> ENDPROGRAM MAIN
> 
> Leaving the USE PETSC statement in TEST_MOD this is what I get when trying to 
> compile this code:
> 
> vanellam@login5 test_spectrum_issue $ mpifort -c 
> -I"/autofs/nccs-svm1_home1/vanellam/Software/petsc/include/" 
> -I"/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-c-opt-nvhpc/include"
>   mpitest.f90
> NVFORTRAN-S-0155-Ambiguous interfaces for generic procedure mpi_init_thread 
> (mpitest.f90: 34)
> NVFORTRAN-S-0155-Ambiguous interfaces for generic procedure mpi_finalize 
> (mpitest.f90: 37)
>   0 inform,   0 warnings,   2 severes, 0 fatal for main
> 
> Now, if I change USE PETSC by USE MPI in the module TEST_MOD compilation 
> proceeds correctly. If I leave the USE PETSC statement in the module and 
> change to USE MPI the statement in main compilation also goes through. So it 
> seems to be something related to using the PETSC and MPI_F08 modules. My take 
> is that it is related to spectrum-mpi, as I haven't had issues compiling the 
> FDS+PETSc with openmpi in other systems.
> 
> Well please let me know if you have any ideas on what might be going on. I'll 
> move to polaris and try with mpich too.
> 
> Thanks!
> Marcos
> 
> 
> From: Junchao Zhang mailto:junchao.zh...@gmail.com>>
> Sent: Tuesday, August 22, 2023 5:25 PM
> To: Matthew Knepley mailto:knep...@gmail.com>>
> Cc: Vanella, Marcos (Fed)  >; PETSc users list  >; Guan, Collin X. (Fed) 
> mailto:collin.g...@nist.gov>>
> Subject: Re: [petsc-users] CUDA error trying to run a job with two mpi 
> processes and 1 GPU
>  
> Macros,
>   yes, refer to the example script Matt mentioned for Summit.  Feel free to 
> turn on/off options in the file.  In my experience, gcc is easier to use.
>   Also, I found 
> https://docs.alcf.anl.gov/polaris/running-jobs/#binding-mpi-ranks-to-gpus, 
> which might be similar to your machine (4 GPUs per node).  The key point is: 
> The Cray MPI on Polaris does not currently support binding MPI ranks to GPUs. 
> For applications that need this support, this instead can be handled by use 
> of a small helper script that will appropriately set CUDA_VISIBLE_DEVICES for 
> each MPI rank.
>   So you can try the helper script set_affinity_gpu_polaris.sh to manually 
> set  CUDA_VISIBLE_DEVICES.  In other words, make the script on your PATH and 
> then run your job with
>   srun -N 2 -n 16 set_affinity_gpu_polaris.sh 
> /home/mnv/Firemodels_fork/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux 
> test.fds -pc_type gamg -mat_type aijcusparse -vec_type cuda
> 
>   Then, check again with nvidia-smi to see if GPU memory is evenly allocated.
> --Junchao Zhang
> 
> 
> On Tue, Aug 

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-24 Thread Vanella, Marcos (Fed) via petsc-users
Thank you Matt and Junchao. I've been testing further with nvhpc on summit. You 
might have an idea on what is going on here.
These are my modules:

Currently Loaded Modules:
  1) lsf-tools/2.0   3) darshan-runtime/3.4.0-lite   5) DefApps   7) 
spectrum-mpi/10.4.0.3-20210112   9) nsight-systems/2021.3.1.54
  2) hsi/5.0.2.p54) xalt/1.2.1   6) nvhpc/22.11   8) 
nsight-compute/2021.2.1 10) cuda/11.7.1

I configured and compiled petsc with these options:

./configure COPTFLAGS="-O2" CXXOPTFLAGS="-O2" FOPTFLAGS="-O2" FCOPTFLAGS="-O2" 
CUDAOPTFLAGS="-O2" --with-debugging=0 --download-suitesparse --download-hypre 
--download-fblaslapack --with-cuda

without issues. The MPI checks did not go through as this was done in the login 
node.

Then, I started getting (similarly to what I saw with pgi and gcc in summit) 
ambiguous interface errors related to mpi routines. I was able to make a simple 
piece of code that reproduces this. It has to do with having a USE PETSC 
statement in a module (TEST_MOD) and a USE MPI_F08 on the main program (MAIN) 
using that module, even though the PRIVATE statement has been used in said 
(TEST_MOD) module.

MODULE TEST_MOD
! In this module we use PETSC.
USE PETSC
!USE MPI
IMPLICIT NONE
PRIVATE
PUBLIC :: TEST1

CONTAINS
SUBROUTINE TEST1(A)
IMPLICIT NONE
REAL, INTENT(INOUT) :: A
INTEGER :: IERR
A=0.
ENDSUBROUTINE TEST1

ENDMODULE TEST_MOD


PROGRAM MAIN

! Assume in main we use some MPI_F08 features.
USE MPI_F08
USE TEST_MOD, ONLY : TEST1
IMPLICIT NONE
INTEGER :: MY_RANK,IERR=0
INTEGER :: PNAMELEN=0
INTEGER :: PROVIDED
INTEGER, PARAMETER :: REQUIRED=MPI_THREAD_FUNNELED
REAL :: A=0.
CALL MPI_INIT_THREAD(REQUIRED,PROVIDED,IERR)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, MY_RANK, IERR)
CALL TEST1(A)
CALL MPI_FINALIZE(IERR)

ENDPROGRAM MAIN

Leaving the USE PETSC statement in TEST_MOD this is what I get when trying to 
compile this code:

vanellam@login5 test_spectrum_issue $ mpifort -c 
-I"/autofs/nccs-svm1_home1/vanellam/Software/petsc/include/" 
-I"/autofs/nccs-svm1_home1/vanellam/Software/petsc/arch-linux-c-opt-nvhpc/include"
  mpitest.f90
NVFORTRAN-S-0155-Ambiguous interfaces for generic procedure mpi_init_thread 
(mpitest.f90: 34)
NVFORTRAN-S-0155-Ambiguous interfaces for generic procedure mpi_finalize 
(mpitest.f90: 37)
  0 inform,   0 warnings,   2 severes, 0 fatal for main

Now, if I change USE PETSC by USE MPI in the module TEST_MOD compilation 
proceeds correctly. If I leave the USE PETSC statement in the module and change 
to USE MPI the statement in main compilation also goes through. So it seems to 
be something related to using the PETSC and MPI_F08 modules. My take is that it 
is related to spectrum-mpi, as I haven't had issues compiling the FDS+PETSc 
with openmpi in other systems.

Well please let me know if you have any ideas on what might be going on. I'll 
move to polaris and try with mpich too.

Thanks!
Marcos



From: Junchao Zhang 
Sent: Tuesday, August 22, 2023 5:25 PM
To: Matthew Knepley 
Cc: Vanella, Marcos (Fed) ; PETSc users list 
; Guan, Collin X. (Fed) 
Subject: Re: [petsc-users] CUDA error trying to run a job with two mpi 
processes and 1 GPU

Macros,
  yes, refer to the example script Matt mentioned for Summit.  Feel free to 
turn on/off options in the file.  In my experience, gcc is easier to use.
  Also, I found 
https://docs.alcf.anl.gov/polaris/running-jobs/#binding-mpi-ranks-to-gpus, 
which might be similar to your machine (4 GPUs per node).  The key point is: 
The Cray MPI on Polaris does not currently support binding MPI ranks to GPUs. 
For applications that need this support, this instead can be handled by use of 
a small helper script that will appropriately set CUDA_VISIBLE_DEVICES for each 
MPI rank.
  So you can try the helper script set_affinity_gpu_polaris.sh to manually set  
CUDA_VISIBLE_DEVICES.  In other words, make the script on your PATH and then 
run your job with
  srun -N 2 -n 16 set_affinity_gpu_polaris.sh 
/home/mnv/Firemodels_fork/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds 
-pc_type gamg -mat_type aijcusparse -vec_type cuda

  Then, check again with nvidia-smi to see if GPU memory is evenly allocated.
--Junchao Zhang


On Tue, Aug 22, 2023 at 3:03 PM Matthew Knepley 
mailto:knep...@gmail.com>> wrote:
On Tue, Aug 22, 2023 at 2:54 PM Vanella, Marcos (Fed) via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:
Hi Junchao, both the slurm scontrol show job_id -dd and looking at 
CUDA_VISIBLE_DEVICES does not provide information about which MPI process is 
associated to which GPU in the node in our system. I can see this with 
nvidia-smi, but if you have any other suggestion using slurm I would like to 
hear it.

I've been trying to compile the code+Petsc in summit, but have been having all 
sorts of issues related to spectrum-mpi, and the different compilers they 
provide (I tried gcc, nvhpc, pgi, xl. Some of them don't handle Fortran 2018, 
others give issues of 

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-22 Thread Junchao Zhang
Macros,
  yes, refer to the example script Matt mentioned for Summit.  Feel free to
turn on/off options in the file.  In my experience, gcc is easier to use.
  Also, I found
https://docs.alcf.anl.gov/polaris/running-jobs/#binding-mpi-ranks-to-gpus,
which might be similar to your machine (4 GPUs per node).  The key point
is: The Cray MPI on Polaris does not currently support binding MPI ranks to
GPUs. For applications that need this support, this instead can be handled
by use of a small helper script that will appropriately set
CUDA_VISIBLE_DEVICES
for each MPI rank.
  So you can try the helper script set_affinity_gpu_polaris.sh to manually
set  CUDA_VISIBLE_DEVICES.  In other words, make the script on your PATH
and then run your job with
  srun -N 2 -n 16 set_affinity_gpu_polaris.sh
/home/mnv/Firemodels_fork/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux
test.fds -pc_type gamg -mat_type aijcusparse -vec_type cuda

  Then, check again with nvidia-smi to see if GPU memory is evenly
allocated.
--Junchao Zhang


On Tue, Aug 22, 2023 at 3:03 PM Matthew Knepley  wrote:

> On Tue, Aug 22, 2023 at 2:54 PM Vanella, Marcos (Fed) via petsc-users <
> petsc-users@mcs.anl.gov> wrote:
>
>> Hi Junchao, both the slurm scontrol show job_id -dd and looking at
>> CUDA_VISIBLE_DEVICES does not provide information about which MPI
>> process is associated to which GPU in the node in our system. I can see
>> this with nvidia-smi, but if you have any other suggestion using slurm I
>> would like to hear it.
>>
>> I've been trying to compile the code+Petsc in summit, but have been
>> having all sorts of issues related to spectrum-mpi, and the different
>> compilers they provide (I tried gcc, nvhpc, pgi, xl. Some of them don't
>> handle Fortran 2018, others give issues of repeated MPI definitions, etc.).
>>
>
> The PETSc configure examples are in the repository:
>
>
> https://gitlab.com/petsc/petsc/-/blob/main/config/examples/arch-olcf-summit-opt.py?ref_type=heads
>
> Thanks,
>
>   Matt
>
>
>> I also wanted to ask you, do you know if it is possible to compile PETSc
>> with the xl/16.1.1-10 suite?
>>
>> Thanks!
>>
>> I configured the library --with-cuda and when compiling I get a
>> compilation error with CUDAC:
>>
>> CUDAC arch-linux-opt-xl/obj/src/sys/classes/random/impls/curand/curand2.o
>> In file included from
>> /autofs/nccs-svm1_home1/vanellam/Software/petsc/src/sys/classes/random/impls/curand/
>> curand2.cu:1:
>> In file included from
>> /autofs/nccs-svm1_home1/vanellam/Software/petsc/include/petsc/private/randomimpl.h:5:
>> In file included from
>> /autofs/nccs-svm1_home1/vanellam/Software/petsc/include/petsc/private/petscimpl.h:7:
>> In file included from
>> /autofs/nccs-svm1_home1/vanellam/Software/petsc/include/petscsys.h:44:
>> In file included from
>> /autofs/nccs-svm1_home1/vanellam/Software/petsc/include/petscsystypes.h:532:
>> In file included from /sw/summit/cuda/11.7.1/include/thrust/complex.h:24:
>> In file included from
>> /sw/summit/cuda/11.7.1/include/thrust/detail/config.h:23:
>> In file included from
>> /sw/summit/cuda/11.7.1/include/thrust/detail/config/config.h:27:
>> /sw/summit/cuda/11.7.1/include/thrust/detail/config/cpp_dialect.h:112:6:
>> warning: Thrust requires at least Clang 7.0. Define
>> THRUST_IGNORE_DEPRECATED_CPP_DIALECT to suppress this message.
>> [-W#pragma-messages]
>>  THRUST_COMPILER_DEPRECATION(Clang 7.0);
>>  ^
>> /sw/summit/cuda/11.7.1/include/thrust/detail/config/cpp_dialect.h:101:3:
>> note: expanded from macro 'THRUST_COMPILER_DEPRECATION'
>>   THRUST_COMP_DEPR_IMPL(Thrust requires at least REQ. Define
>> THRUST_IGNORE_DEPRECATED_CPP_DIALECT to suppress this message.)
>>   ^
>> /sw/summit/cuda/11.7.1/include/thrust/detail/config/cpp_dialect.h:95:38:
>> note: expanded from macro 'THRUST_COMP_DEPR_IMPL'
>> #  define THRUST_COMP_DEPR_IMPL(msg) THRUST_COMP_DEPR_IMPL0(GCC warning
>> #msg)
>>  ^
>> /sw/summit/cuda/11.7.1/include/thrust/detail/config/cpp_dialect.h:96:40:
>> note: expanded from macro 'THRUST_COMP_DEPR_IMPL0'
>> #  define THRUST_COMP_DEPR_IMPL0(expr) _Pragma(#expr)
>>^
>> :141:6: note: expanded from here
>>  GCC warning "Thrust requires at least Clang 7.0. Define
>> THRUST_IGNORE_DEPRECATED_CPP_DIALECT to suppress this message."
>>  ^
>> In file included from
>> /autofs/nccs-svm1_home1/vanellam/Software/petsc/src/sys/classes/random/impls/curand/
>> curand2.cu:2:
>> In file included from
>> /sw/summit/cuda/11.7.1/include/thrust/transform.h:721:
>> In file included from
>> /sw/summit/cuda/11.7.1/include/thrust/detail/transform.inl:27:
>> In file included from
>> /sw/summit/cuda/11.7.1/include/thrust/system/detail/generic/transform.h:104:
>> In file included from
>> /sw/summit/cuda/11.7.1/include/thrust/system/detail/generic/transform.inl:19:
>> In file included from
>> /sw/summit/cuda/11.7.1/include/thrust/for_each.h:277:
>> In file included from
>> 

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-22 Thread Matthew Knepley
On Tue, Aug 22, 2023 at 2:54 PM Vanella, Marcos (Fed) via petsc-users <
petsc-users@mcs.anl.gov> wrote:

> Hi Junchao, both the slurm scontrol show job_id -dd and looking at
> CUDA_VISIBLE_DEVICES does not provide information about which MPI process
> is associated to which GPU in the node in our system. I can see this with
> nvidia-smi, but if you have any other suggestion using slurm I would like
> to hear it.
>
> I've been trying to compile the code+Petsc in summit, but have been having
> all sorts of issues related to spectrum-mpi, and the different compilers
> they provide (I tried gcc, nvhpc, pgi, xl. Some of them don't handle
> Fortran 2018, others give issues of repeated MPI definitions, etc.).
>

The PETSc configure examples are in the repository:


https://gitlab.com/petsc/petsc/-/blob/main/config/examples/arch-olcf-summit-opt.py?ref_type=heads

Thanks,

  Matt


> I also wanted to ask you, do you know if it is possible to compile PETSc
> with the xl/16.1.1-10 suite?
>
> Thanks!
>
> I configured the library --with-cuda and when compiling I get a
> compilation error with CUDAC:
>
> CUDAC arch-linux-opt-xl/obj/src/sys/classes/random/impls/curand/curand2.o
> In file included from
> /autofs/nccs-svm1_home1/vanellam/Software/petsc/src/sys/classes/random/impls/curand/
> curand2.cu:1:
> In file included from
> /autofs/nccs-svm1_home1/vanellam/Software/petsc/include/petsc/private/randomimpl.h:5:
> In file included from
> /autofs/nccs-svm1_home1/vanellam/Software/petsc/include/petsc/private/petscimpl.h:7:
> In file included from
> /autofs/nccs-svm1_home1/vanellam/Software/petsc/include/petscsys.h:44:
> In file included from
> /autofs/nccs-svm1_home1/vanellam/Software/petsc/include/petscsystypes.h:532:
> In file included from /sw/summit/cuda/11.7.1/include/thrust/complex.h:24:
> In file included from
> /sw/summit/cuda/11.7.1/include/thrust/detail/config.h:23:
> In file included from
> /sw/summit/cuda/11.7.1/include/thrust/detail/config/config.h:27:
> /sw/summit/cuda/11.7.1/include/thrust/detail/config/cpp_dialect.h:112:6:
> warning: Thrust requires at least Clang 7.0. Define
> THRUST_IGNORE_DEPRECATED_CPP_DIALECT to suppress this message.
> [-W#pragma-messages]
>  THRUST_COMPILER_DEPRECATION(Clang 7.0);
>  ^
> /sw/summit/cuda/11.7.1/include/thrust/detail/config/cpp_dialect.h:101:3:
> note: expanded from macro 'THRUST_COMPILER_DEPRECATION'
>   THRUST_COMP_DEPR_IMPL(Thrust requires at least REQ. Define
> THRUST_IGNORE_DEPRECATED_CPP_DIALECT to suppress this message.)
>   ^
> /sw/summit/cuda/11.7.1/include/thrust/detail/config/cpp_dialect.h:95:38:
> note: expanded from macro 'THRUST_COMP_DEPR_IMPL'
> #  define THRUST_COMP_DEPR_IMPL(msg) THRUST_COMP_DEPR_IMPL0(GCC warning
> #msg)
>  ^
> /sw/summit/cuda/11.7.1/include/thrust/detail/config/cpp_dialect.h:96:40:
> note: expanded from macro 'THRUST_COMP_DEPR_IMPL0'
> #  define THRUST_COMP_DEPR_IMPL0(expr) _Pragma(#expr)
>^
> :141:6: note: expanded from here
>  GCC warning "Thrust requires at least Clang 7.0. Define
> THRUST_IGNORE_DEPRECATED_CPP_DIALECT to suppress this message."
>  ^
> In file included from
> /autofs/nccs-svm1_home1/vanellam/Software/petsc/src/sys/classes/random/impls/curand/
> curand2.cu:2:
> In file included from
> /sw/summit/cuda/11.7.1/include/thrust/transform.h:721:
> In file included from
> /sw/summit/cuda/11.7.1/include/thrust/detail/transform.inl:27:
> In file included from
> /sw/summit/cuda/11.7.1/include/thrust/system/detail/generic/transform.h:104:
> In file included from
> /sw/summit/cuda/11.7.1/include/thrust/system/detail/generic/transform.inl:19:
> In file included from /sw/summit/cuda/11.7.1/include/thrust/for_each.h:277:
> In file included from
> /sw/summit/cuda/11.7.1/include/thrust/detail/for_each.inl:27:
> In file included from
> /sw/summit/cuda/11.7.1/include/thrust/system/detail/adl/for_each.h:42:
> In file included from
> /sw/summit/cuda/11.7.1/include/thrust/system/cuda/detail/for_each.h:35:
> In file included from
> /sw/summit/cuda/11.7.1/include/thrust/system/cuda/detail/util.h:36:
> In file included from
> /sw/summit/cuda/11.7.1/include/cub/detail/device_synchronize.cuh:19:
> In file included from /sw/summit/cuda/11.7.1/include/cub/util_arch.cuh:36:
> /sw/summit/cuda/11.7.1/include/cub/util_cpp_dialect.cuh:123:6: warning:
> CUB requires at least Clang 7.0. Define CUB_IGNORE_DEPRECATED_CPP_DIALECT
> to suppress this message. [-W#pragma-messages]
>  CUB_COMPILER_DEPRECATION(Clang 7.0);
>  ^
> /sw/summit/cuda/11.7.1/include/cub/util_cpp_dialect.cuh:112:3: note:
> expanded from macro 'CUB_COMPILER_DEPRECATION'
>   CUB_COMP_DEPR_IMPL(CUB requires at least REQ. Define
> CUB_IGNORE_DEPRECATED_CPP_DIALECT to suppress this message.)
>   ^
> /sw/summit/cuda/11.7.1/include/cub/util_cpp_dialect.cuh:106:35: note:
> expanded from macro 'CUB_COMP_DEPR_IMPL'
> #  define CUB_COMP_DEPR_IMPL(msg) CUB_COMP_DEPR_IMPL0(GCC warning #msg)
>  

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-22 Thread Vanella, Marcos (Fed) via petsc-users
Hi Junchao, both the slurm scontrol show job_id -dd and looking at 
CUDA_VISIBLE_DEVICES does not provide information about which MPI process is 
associated to which GPU in the node in our system. I can see this with 
nvidia-smi, but if you have any other suggestion using slurm I would like to 
hear it.

I've been trying to compile the code+Petsc in summit, but have been having all 
sorts of issues related to spectrum-mpi, and the different compilers they 
provide (I tried gcc, nvhpc, pgi, xl. Some of them don't handle Fortran 2018, 
others give issues of repeated MPI definitions, etc.).

I also wanted to ask you, do you know if it is possible to compile PETSc with 
the xl/16.1.1-10 suite?

Thanks!

I configured the library --with-cuda and when compiling I get a compilation 
error with CUDAC:

CUDAC arch-linux-opt-xl/obj/src/sys/classes/random/impls/curand/curand2.o
In file included from 
/autofs/nccs-svm1_home1/vanellam/Software/petsc/src/sys/classes/random/impls/curand/curand2.cu:1:
In file included from 
/autofs/nccs-svm1_home1/vanellam/Software/petsc/include/petsc/private/randomimpl.h:5:
In file included from 
/autofs/nccs-svm1_home1/vanellam/Software/petsc/include/petsc/private/petscimpl.h:7:
In file included from 
/autofs/nccs-svm1_home1/vanellam/Software/petsc/include/petscsys.h:44:
In file included from 
/autofs/nccs-svm1_home1/vanellam/Software/petsc/include/petscsystypes.h:532:
In file included from /sw/summit/cuda/11.7.1/include/thrust/complex.h:24:
In file included from /sw/summit/cuda/11.7.1/include/thrust/detail/config.h:23:
In file included from 
/sw/summit/cuda/11.7.1/include/thrust/detail/config/config.h:27:
/sw/summit/cuda/11.7.1/include/thrust/detail/config/cpp_dialect.h:112:6: 
warning: Thrust requires at least Clang 7.0. Define 
THRUST_IGNORE_DEPRECATED_CPP_DIALECT to suppress this message. 
[-W#pragma-messages]
 THRUST_COMPILER_DEPRECATION(Clang 7.0);
 ^
/sw/summit/cuda/11.7.1/include/thrust/detail/config/cpp_dialect.h:101:3: note: 
expanded from macro 'THRUST_COMPILER_DEPRECATION'
  THRUST_COMP_DEPR_IMPL(Thrust requires at least REQ. Define 
THRUST_IGNORE_DEPRECATED_CPP_DIALECT to suppress this message.)
  ^
/sw/summit/cuda/11.7.1/include/thrust/detail/config/cpp_dialect.h:95:38: note: 
expanded from macro 'THRUST_COMP_DEPR_IMPL'
#  define THRUST_COMP_DEPR_IMPL(msg) THRUST_COMP_DEPR_IMPL0(GCC warning #msg)
 ^
/sw/summit/cuda/11.7.1/include/thrust/detail/config/cpp_dialect.h:96:40: note: 
expanded from macro 'THRUST_COMP_DEPR_IMPL0'
#  define THRUST_COMP_DEPR_IMPL0(expr) _Pragma(#expr)
   ^
:141:6: note: expanded from here
 GCC warning "Thrust requires at least Clang 7.0. Define 
THRUST_IGNORE_DEPRECATED_CPP_DIALECT to suppress this message."
 ^
In file included from 
/autofs/nccs-svm1_home1/vanellam/Software/petsc/src/sys/classes/random/impls/curand/curand2.cu:2:
In file included from /sw/summit/cuda/11.7.1/include/thrust/transform.h:721:
In file included from 
/sw/summit/cuda/11.7.1/include/thrust/detail/transform.inl:27:
In file included from 
/sw/summit/cuda/11.7.1/include/thrust/system/detail/generic/transform.h:104:
In file included from 
/sw/summit/cuda/11.7.1/include/thrust/system/detail/generic/transform.inl:19:
In file included from /sw/summit/cuda/11.7.1/include/thrust/for_each.h:277:
In file included from 
/sw/summit/cuda/11.7.1/include/thrust/detail/for_each.inl:27:
In file included from 
/sw/summit/cuda/11.7.1/include/thrust/system/detail/adl/for_each.h:42:
In file included from 
/sw/summit/cuda/11.7.1/include/thrust/system/cuda/detail/for_each.h:35:
In file included from 
/sw/summit/cuda/11.7.1/include/thrust/system/cuda/detail/util.h:36:
In file included from 
/sw/summit/cuda/11.7.1/include/cub/detail/device_synchronize.cuh:19:
In file included from /sw/summit/cuda/11.7.1/include/cub/util_arch.cuh:36:
/sw/summit/cuda/11.7.1/include/cub/util_cpp_dialect.cuh:123:6: warning: CUB 
requires at least Clang 7.0. Define CUB_IGNORE_DEPRECATED_CPP_DIALECT to 
suppress this message. [-W#pragma-messages]
 CUB_COMPILER_DEPRECATION(Clang 7.0);
 ^
/sw/summit/cuda/11.7.1/include/cub/util_cpp_dialect.cuh:112:3: note: expanded 
from macro 'CUB_COMPILER_DEPRECATION'
  CUB_COMP_DEPR_IMPL(CUB requires at least REQ. Define 
CUB_IGNORE_DEPRECATED_CPP_DIALECT to suppress this message.)
  ^
/sw/summit/cuda/11.7.1/include/cub/util_cpp_dialect.cuh:106:35: note: expanded 
from macro 'CUB_COMP_DEPR_IMPL'
#  define CUB_COMP_DEPR_IMPL(msg) CUB_COMP_DEPR_IMPL0(GCC warning #msg)
  ^
/sw/summit/cuda/11.7.1/include/cub/util_cpp_dialect.cuh:107:37: note: expanded 
from macro 'CUB_COMP_DEPR_IMPL0'
#  define CUB_COMP_DEPR_IMPL0(expr) _Pragma(#expr)
^
:198:6: note: expanded from here
 GCC warning "CUB requires at least Clang 7.0. Define 
CUB_IGNORE_DEPRECATED_CPP_DIALECT to suppress this message."
 ^

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-21 Thread Junchao Zhang
That is a good question.  Looking at
https://slurm.schedmd.com/gres.html#GPU_Management,  I was wondering if you
can share the output of your job so we can search CUDA_VISIBLE_DEVICES and
see how GPUs were allocated.

--Junchao Zhang


On Mon, Aug 21, 2023 at 2:38 PM Vanella, Marcos (Fed) <
marcos.vane...@nist.gov> wrote:

> Ok thanks Junchao, so is GPU 0 actually allocating memory for the 8 MPI
> processes meshes but only working on 2 of them?
> It says in the script it has allocated 2.4GB
> Best,
> Marcos
> --
> *From:* Junchao Zhang 
> *Sent:* Monday, August 21, 2023 3:29 PM
> *To:* Vanella, Marcos (Fed) 
> *Cc:* PETSc users list ; Guan, Collin X. (Fed) <
> collin.g...@nist.gov>
> *Subject:* Re: [petsc-users] CUDA error trying to run a job with two mpi
> processes and 1 GPU
>
> Hi, Macros,
>   If you look at the PIDs of the nvidia-smi output, you will only find 8
> unique PIDs, which is expected since you allocated 8 MPI ranks per node.
>   The duplicate PIDs are usually for threads spawned by the MPI runtime
> (for example, progress threads in MPI implementation).   So your job script
> and output are all good.
>
>   Thanks.
>
> On Mon, Aug 21, 2023 at 2:00 PM Vanella, Marcos (Fed) <
> marcos.vane...@nist.gov> wrote:
>
> Hi Junchao, something I'm noting related to running with cuda enabled
> linear solvers (CG+HYPRE, CG+GAMG) is that for multi cpu-multi gpu
> calculations, the GPU 0 in the node is taking what seems to be all
> sub-matrices corresponding to all the MPI processes in the node. This is
> the result of the nvidia-smi command on a node with 8 MPI processes (each
> advancing the same number of unknowns in the calculation) and 4 GPU V100s:
>
> Mon Aug 21 14:36:07 2023
>
> +---+
> | NVIDIA-SMI 535.54.03  Driver Version: 535.54.03CUDA
> Version: 12.2 |
>
> |-+--+--+
> | GPU  Name Persistence-M | Bus-IdDisp.A |
> Volatile Uncorr. ECC |
> | Fan  Temp   Perf  Pwr:Usage/Cap | Memory-Usage |
> GPU-Util  Compute M. |
> | |  |
>   MIG M. |
>
> |=+==+==|
> |   0  Tesla V100-SXM2-16GB   On  | 0004:04:00.0 Off |
>0 |
> | N/A   34CP0  63W / 300W |   2488MiB / 16384MiB |  0%
>  Default |
> | |  |
>  N/A |
>
> +-+--+--+
> |   1  Tesla V100-SXM2-16GB   On  | 0004:05:00.0 Off |
>0 |
> | N/A   38CP0  56W / 300W |638MiB / 16384MiB |  0%
>  Default |
> | |  |
>  N/A |
>
> +-+--+--+
> |   2  Tesla V100-SXM2-16GB   On  | 0035:03:00.0 Off |
>0 |
> | N/A   35CP0  52W / 300W |638MiB / 16384MiB |  0%
>  Default |
> | |  |
>  N/A |
>
> +-+--+--+
> |   3  Tesla V100-SXM2-16GB   On  | 0035:04:00.0 Off |
>0 |
> | N/A   38CP0  53W / 300W |638MiB / 16384MiB |  0%
>  Default |
> | |  |
>  N/A |
>
> +-+--+--+
>
>
>
> +---+
> | Processes:
>  |
> |  GPU   GI   CIPID   Type   Process name
>GPU Memory |
> |ID   ID
>   Usage  |
>
> |===|
> |0   N/A  N/A214626  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  318MiB |
> |0   N/A  N/A214627  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  308MiB |
> |0   N/A  N/A214628  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  308MiB |
> |0   N/A  N/A214629  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  308MiB |
> |0   N/A  N/A214630  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  318MiB |
> |0   N/A  N/A214631  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  308MiB |
> |0   N/A  N/A214632  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  308MiB |
> |0   N/A  N/A214633  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  308MiB |
> |1   N/A  N/A214627  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  318MiB |
> |1   

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-21 Thread Vanella, Marcos (Fed) via petsc-users
Ok thanks Junchao, so is GPU 0 actually allocating memory for the 8 MPI 
processes meshes but only working on 2 of them?
It says in the script it has allocated 2.4GB
Best,
Marcos

From: Junchao Zhang 
Sent: Monday, August 21, 2023 3:29 PM
To: Vanella, Marcos (Fed) 
Cc: PETSc users list ; Guan, Collin X. (Fed) 

Subject: Re: [petsc-users] CUDA error trying to run a job with two mpi 
processes and 1 GPU

Hi, Macros,
  If you look at the PIDs of the nvidia-smi output, you will only find 8 unique 
PIDs, which is expected since you allocated 8 MPI ranks per node.
  The duplicate PIDs are usually for threads spawned by the MPI runtime (for 
example, progress threads in MPI implementation).   So your job script and 
output are all good.

  Thanks.

On Mon, Aug 21, 2023 at 2:00 PM Vanella, Marcos (Fed) 
mailto:marcos.vane...@nist.gov>> wrote:
Hi Junchao, something I'm noting related to running with cuda enabled linear 
solvers (CG+HYPRE, CG+GAMG) is that for multi cpu-multi gpu calculations, the 
GPU 0 in the node is taking what seems to be all sub-matrices corresponding to 
all the MPI processes in the node. This is the result of the nvidia-smi command 
on a node with 8 MPI processes (each advancing the same number of unknowns in 
the calculation) and 4 GPU V100s:

Mon Aug 21 14:36:07 2023
+---+
| NVIDIA-SMI 535.54.03  Driver Version: 535.54.03CUDA Version: 
12.2 |
|-+--+--+
| GPU  Name Persistence-M | Bus-IdDisp.A | Volatile 
Uncorr. ECC |
| Fan  Temp   Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  
Compute M. |
| |  |  
 MIG M. |
|=+==+==|
|   0  Tesla V100-SXM2-16GB   On  | 0004:04:00.0 Off |  
  0 |
| N/A   34CP0  63W / 300W |   2488MiB / 16384MiB |  0%  
Default |
| |  |  
N/A |
+-+--+--+
|   1  Tesla V100-SXM2-16GB   On  | 0004:05:00.0 Off |  
  0 |
| N/A   38CP0  56W / 300W |638MiB / 16384MiB |  0%  
Default |
| |  |  
N/A |
+-+--+--+
|   2  Tesla V100-SXM2-16GB   On  | 0035:03:00.0 Off |  
  0 |
| N/A   35CP0  52W / 300W |638MiB / 16384MiB |  0%  
Default |
| |  |  
N/A |
+-+--+--+
|   3  Tesla V100-SXM2-16GB   On  | 0035:04:00.0 Off |  
  0 |
| N/A   38CP0  53W / 300W |638MiB / 16384MiB |  0%  
Default |
| |  |  
N/A |
+-+--+--+

+---+
| Processes:
|
|  GPU   GI   CIPID   Type   Process name
GPU Memory |
|ID   ID 
Usage  |
|===|
|0   N/A  N/A214626  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 318MiB |
|0   N/A  N/A214627  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 308MiB |
|0   N/A  N/A214628  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 308MiB |
|0   N/A  N/A214629  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 308MiB |
|0   N/A  N/A214630  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 318MiB |
|0   N/A  N/A214631  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 308MiB |
|0   N/A  N/A214632  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 308MiB |
|0   N/A  N/A214633  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 308MiB |
|1   N/A  N/A214627  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 318MiB |
|1   N/A  N/A214631  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 318MiB |
|2   N/A  N/A214628  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 318MiB |
|2   N/A  N/A214632  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 318MiB |
|3   N/A  N/A214629

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-21 Thread Junchao Zhang
Hi, Macros,
  If you look at the PIDs of the nvidia-smi output, you will only find 8
unique PIDs, which is expected since you allocated 8 MPI ranks per node.
  The duplicate PIDs are usually for threads spawned by the MPI runtime
(for example, progress threads in MPI implementation).   So your job script
and output are all good.

  Thanks.

On Mon, Aug 21, 2023 at 2:00 PM Vanella, Marcos (Fed) <
marcos.vane...@nist.gov> wrote:

> Hi Junchao, something I'm noting related to running with cuda enabled
> linear solvers (CG+HYPRE, CG+GAMG) is that for multi cpu-multi gpu
> calculations, the GPU 0 in the node is taking what seems to be all
> sub-matrices corresponding to all the MPI processes in the node. This is
> the result of the nvidia-smi command on a node with 8 MPI processes (each
> advancing the same number of unknowns in the calculation) and 4 GPU V100s:
>
> Mon Aug 21 14:36:07 2023
>
> +---+
> | NVIDIA-SMI 535.54.03  Driver Version: 535.54.03CUDA
> Version: 12.2 |
>
> |-+--+--+
> | GPU  Name Persistence-M | Bus-IdDisp.A |
> Volatile Uncorr. ECC |
> | Fan  Temp   Perf  Pwr:Usage/Cap | Memory-Usage |
> GPU-Util  Compute M. |
> | |  |
>   MIG M. |
>
> |=+==+==|
> |   0  Tesla V100-SXM2-16GB   On  | 0004:04:00.0 Off |
>0 |
> | N/A   34CP0  63W / 300W |   2488MiB / 16384MiB |  0%
>  Default |
> | |  |
>  N/A |
>
> +-+--+--+
> |   1  Tesla V100-SXM2-16GB   On  | 0004:05:00.0 Off |
>0 |
> | N/A   38CP0  56W / 300W |638MiB / 16384MiB |  0%
>  Default |
> | |  |
>  N/A |
>
> +-+--+--+
> |   2  Tesla V100-SXM2-16GB   On  | 0035:03:00.0 Off |
>0 |
> | N/A   35CP0  52W / 300W |638MiB / 16384MiB |  0%
>  Default |
> | |  |
>  N/A |
>
> +-+--+--+
> |   3  Tesla V100-SXM2-16GB   On  | 0035:04:00.0 Off |
>0 |
> | N/A   38CP0  53W / 300W |638MiB / 16384MiB |  0%
>  Default |
> | |  |
>  N/A |
>
> +-+--+--+
>
>
>
> +---+
> | Processes:
>  |
> |  GPU   GI   CIPID   Type   Process name
>GPU Memory |
> |ID   ID
>   Usage  |
>
> |===|
> |0   N/A  N/A214626  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  318MiB |
> |0   N/A  N/A214627  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  308MiB |
> |0   N/A  N/A214628  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  308MiB |
> |0   N/A  N/A214629  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  308MiB |
> |0   N/A  N/A214630  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  318MiB |
> |0   N/A  N/A214631  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  308MiB |
> |0   N/A  N/A214632  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  308MiB |
> |0   N/A  N/A214633  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  308MiB |
> |1   N/A  N/A214627  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  318MiB |
> |1   N/A  N/A214631  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  318MiB |
> |2   N/A  N/A214628  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  318MiB |
> |2   N/A  N/A214632  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  318MiB |
> |3   N/A  N/A214629  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  318MiB |
> |3   N/A  N/A214633  C
> ...d/ompi_gnu_linux/fds_ompi_gnu_linux  318MiB |
>
> +---+
>
>
> You can see that GPU 0 is connected to all 8 MPI Processes, each taking
> about 300MB on it, whereas GPUs 1,2 and 3 are working with 2 MPI Processes.
> I'm wondering if this is expected or there are some changes I need to do on
> my submission script/runtime parameters.
> This is the script 

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-21 Thread Vanella, Marcos (Fed) via petsc-users
Hi Junchao, something I'm noting related to running with cuda enabled linear 
solvers (CG+HYPRE, CG+GAMG) is that for multi cpu-multi gpu calculations, the 
GPU 0 in the node is taking what seems to be all sub-matrices corresponding to 
all the MPI processes in the node. This is the result of the nvidia-smi command 
on a node with 8 MPI processes (each advancing the same number of unknowns in 
the calculation) and 4 GPU V100s:

Mon Aug 21 14:36:07 2023
+---+
| NVIDIA-SMI 535.54.03  Driver Version: 535.54.03CUDA Version: 
12.2 |
|-+--+--+
| GPU  Name Persistence-M | Bus-IdDisp.A | Volatile 
Uncorr. ECC |
| Fan  Temp   Perf  Pwr:Usage/Cap | Memory-Usage | GPU-Util  
Compute M. |
| |  |  
 MIG M. |
|=+==+==|
|   0  Tesla V100-SXM2-16GB   On  | 0004:04:00.0 Off |  
  0 |
| N/A   34CP0  63W / 300W |   2488MiB / 16384MiB |  0%  
Default |
| |  |  
N/A |
+-+--+--+
|   1  Tesla V100-SXM2-16GB   On  | 0004:05:00.0 Off |  
  0 |
| N/A   38CP0  56W / 300W |638MiB / 16384MiB |  0%  
Default |
| |  |  
N/A |
+-+--+--+
|   2  Tesla V100-SXM2-16GB   On  | 0035:03:00.0 Off |  
  0 |
| N/A   35CP0  52W / 300W |638MiB / 16384MiB |  0%  
Default |
| |  |  
N/A |
+-+--+--+
|   3  Tesla V100-SXM2-16GB   On  | 0035:04:00.0 Off |  
  0 |
| N/A   38CP0  53W / 300W |638MiB / 16384MiB |  0%  
Default |
| |  |  
N/A |
+-+--+--+

+---+
| Processes:
|
|  GPU   GI   CIPID   Type   Process name
GPU Memory |
|ID   ID 
Usage  |
|===|
|0   N/A  N/A214626  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 318MiB |
|0   N/A  N/A214627  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 308MiB |
|0   N/A  N/A214628  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 308MiB |
|0   N/A  N/A214629  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 308MiB |
|0   N/A  N/A214630  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 318MiB |
|0   N/A  N/A214631  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 308MiB |
|0   N/A  N/A214632  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 308MiB |
|0   N/A  N/A214633  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 308MiB |
|1   N/A  N/A214627  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 318MiB |
|1   N/A  N/A214631  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 318MiB |
|2   N/A  N/A214628  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 318MiB |
|2   N/A  N/A214632  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 318MiB |
|3   N/A  N/A214629  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 318MiB |
|3   N/A  N/A214633  C   ...d/ompi_gnu_linux/fds_ompi_gnu_linux 
 318MiB |
+---+


You can see that GPU 0 is connected to all 8 MPI Processes, each taking about 
300MB on it, whereas GPUs 1,2 and 3 are working with 2 MPI Processes. I'm 
wondering if this is expected or there are some changes I need to do on my 
submission script/runtime parameters.
This is the script in this case (2 nodes, 8 MPI processes/node, 4 GPU/node):

#!/bin/bash
# ../../Utilities/Scripts/qfds.sh -p 2  -T db -d test.fds
#SBATCH -J test
#SBATCH -e /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.err
#SBATCH -o /home/mnv/Firemodels_fork/fds/Issues/PETSc/test.log
#SBATCH --partition=gpu
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=8

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-14 Thread Junchao Zhang
I don't see a problem in the matrix assembly.
If you point me to your repo and show me how to build it, I can try to
reproduce.

--Junchao Zhang


On Mon, Aug 14, 2023 at 2:53 PM Vanella, Marcos (Fed) <
marcos.vane...@nist.gov> wrote:

> Hi Junchao, I've tried for my case using the -ksp_type gmres and -pc_type
> asm with -mat_type aijcusparse -sub_pc_factor_mat_solver_type cusparse as
> (I understand) is done in the ex60. The error is always the same, so it
> seems it is not related to ksp,pc. Indeed it seems to happen when trying to
> offload the Matrix to the GPU:
>
> terminate called after throwing an instance of
> 'thrust::system::system_error'
> terminate called after throwing an instance of
> 'thrust::system::system_error'
>   what():  parallel_for failed: cudaErrorInvalidConfiguration: invalid
> configuration argument
>   what():  parallel_for failed: cudaErrorInvalidConfiguration: invalid
> configuration argument
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
> #0  0x2000397fcd8f in ???
> ...
> #8  0x20003935fc6b in ???
> #9  0x11ec769b in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
> at /usr/local/cuda-11.7/include/thrust/system/cuda/detail/util.h:225
> #10  0x11ec769b in
> _ZN6thrust8cuda_cub20uninitialized_fill_nINS0_3tagENS_10device_ptrIiEEmiEET0_RNS0_16execution_policyIT_EES5_T1_RKT2_
> at
> /usr/local/cuda-11.7/include/thrust/system/cuda/detail/uninitialized_fill.h:88
> #11  0x11efd6a3 in
> _ZN6thrust20uninitialized_fill_nINS_8cuda_cub3tagENS_10device_ptrIiEEmiEET0_RKNS_6detail21execution_policy_baseIT_EES5_T1_RKT2_
> #9  0x11ec769b in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
> at /usr/local/cuda-11.7/include/thrust/system/cuda/detail/util.h:225
> #10  0x11ec769b in
> _ZN6thrust8cuda_cub20uninitialized_fill_nINS0_3tagENS_10device_ptrIiEEmiEET0_RNS0_16execution_policyIT_EES5_T1_RKT2_
> at
> /usr/local/cuda-11.7/include/thrust/system/cuda/detail/uninitialized_fill.h:88
> #11  0x11efd6a3 in
> _ZN6thrust20uninitialized_fill_nINS_8cuda_cub3tagENS_10device_ptrIiEEmiEET0_RKNS_6detail21execution_policy_baseIT_EES5_T1_RKT2_
> at /usr/local/cuda-11.7/include/thrust/detail/uninitialized_fill.inl:55
> #12  0x11efd6a3 in
> _ZN6thrust6detail23allocator_traits_detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEENS0_10disable_ifIXsrNS1_37needs_default_construct_via_allocatorIT_NS0_15pointer_elementIT0_E4typeEEE5valueEvE4typeERS9_SB_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:93
> at /usr/local/cuda-11.7/include/thrust/detail/uninitialized_fill.inl:55
> #12  0x11efd6a3 in
> _ZN6thrust6detail23allocator_traits_detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEENS0_10disable_ifIXsrNS1_37needs_default_construct_via_allocatorIT_NS0_15pointer_elementIT0_E4typeEEE5valueEvE4typeERS9_SB_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:93
> #13  0x11efd6a3 in
> _ZN6thrust6detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEEvRT_T0_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:104
> #14  0x11efd6a3 in
> _ZN6thrust6detail18contiguous_storageIiNS_16device_allocatorIiEEE19default_construct_nENS0_15normal_iteratorINS_10device_ptrIim
> at /usr/local/cuda-11.7/include/thrust/detail/contiguous_storage.inl:254
> #15  0x11efd6a3 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
> #13  0x11efd6a3 in
> _ZN6thrust6detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEEvRT_T0_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:104
> #14  0x11efd6a3 in
> _ZN6thrust6detail18contiguous_storageIiNS_16device_allocatorIiEEE19default_construct_nENS0_15normal_iteratorINS_10device_ptrIim
> at /usr/local/cuda-11.7/include/thrust/detail/contiguous_storage.inl:254
> #15  0x11efd6a3 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:220
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:220
> #16  0x11efd6a3 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:213
> #17  0x11efd6a3 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEEC2Em
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:65
> #18  0x11edb287 in
> _ZN6thrust13device_vectorIiNS_16device_allocatorIiEEEC4Em
> at /usr/local/cuda-11.7/include/thrust/device_vector.h:88
> #19  0x11edb287 in *MatSeqAIJCUSPARSECopyToGPU*
> at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/
> aijcusparse.cu:2488
> #20  0x11edfd1b in *MatSeqAIJCUSPARSEGetIJ*
> ...
> ...
>
> This is the piece of fortran code I have doing 

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-14 Thread Junchao Zhang
Yeah, it looks like ex60 was run correctly.
Double check your code again and if you still run into errors, we can try
to reproduce on our end.

Thanks.
--Junchao Zhang


On Mon, Aug 14, 2023 at 1:05 PM Vanella, Marcos (Fed) <
marcos.vane...@nist.gov> wrote:

> Hi Junchao, I compiled and run ex60 through slurm in our Enki system. The
> batch script for slurm submission, ex60.log and gpu stats files are
> attached.
> Nothing stands out as wrong to me but please have a look.
> I'll revisit running the original 2 MPI process + 1 GPU Poisson problem.
> Thanks!
> Marcos
> --
> *From:* Junchao Zhang 
> *Sent:* Friday, August 11, 2023 5:52 PM
> *To:* Vanella, Marcos (Fed) 
> *Cc:* PETSc users list ; Satish Balay <
> ba...@mcs.anl.gov>
> *Subject:* Re: [petsc-users] CUDA error trying to run a job with two mpi
> processes and 1 GPU
>
> Before digging into the details, could you try to run
> src/ksp/ksp/tests/ex60.c to make sure the environment is ok.
>
> The comment at the end shows how to run it
>test:
>   requires: cuda
>   suffix: 1_cuda
>   nsize: 4
>   args: -ksp_view -mat_type aijcusparse -sub_pc_factor_mat_solver_type
> cusparse
>
> --Junchao Zhang
>
>
> On Fri, Aug 11, 2023 at 4:36 PM Vanella, Marcos (Fed) <
> marcos.vane...@nist.gov> wrote:
>
> Hi Junchao, thank you for the info. I compiled the main branch of PETSc in
> another machine that has the  openmpi/4.1.4/gcc-11.2.1-cuda-11.7 toolchain
> and don't see the fortran compilation error. It might have been related to
> gcc-9.3.
> I tried the case again, 2 CPUs and one GPU and get this error now:
>
> terminate called after throwing an instance of
> 'thrust::system::system_error'
> terminate called after throwing an instance of
> 'thrust::system::system_error'
>   what():  parallel_for failed: cudaErrorInvalidConfiguration: invalid
> configuration argument
>   what():  parallel_for failed: cudaErrorInvalidConfiguration: invalid
> configuration argument
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
> #0  0x2000397fcd8f in ???
> #1  0x2000397fb657 in ???
> #0  0x2000397fcd8f in ???
> #1  0x2000397fb657 in ???
> #2  0x200604d7 in ???
> #2  0x200604d7 in ???
> #3  0x200039cb9628 in ???
> #4  0x200039c93eb3 in ???
> #5  0x200039364a97 in ???
> #6  0x20003935f6d3 in ???
> #7  0x20003935f78f in ???
> #8  0x20003935fc6b in ???
> #3  0x200039cb9628 in ???
> #4  0x200039c93eb3 in ???
> #5  0x200039364a97 in ???
> #6  0x20003935f6d3 in ???
> #7  0x20003935f78f in ???
> #8  0x20003935fc6b in ???
> #9  0x11ec425b in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
> at /usr/local/cuda-11.7/include/thrust/system/cuda/detail/util.h:225
> #10  0x11ec425b in
> _ZN6thrust8cuda_cub20uninitialized_fill_nINS0_3tagENS_10device_ptrIiEEmiEET0_RNS0_16execution_policyIT_EES5_T1_RKT2_
> #9  0x11ec425b in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
> at /usr/local/cuda-11.7/include/thrust/system/cuda/detail/util.h:225
> #10  0x11ec425b in
> _ZN6thrust8cuda_cub20uninitialized_fill_nINS0_3tagENS_10device_ptrIiEEmiEET0_RNS0_16execution_policyIT_EES5_T1_RKT2_
> at
> /usr/local/cuda-11.7/include/thrust/system/cuda/detail/uninitialized_fill.h:88
> #11  0x11efa263 in
> _ZN6thrust20uninitialized_fill_nINS_8cuda_cub3tagENS_10device_ptrIiEEmiEET0_RKNS_6detail21execution_policy_baseIT_EES5_T1_RKT2_
> at
> /usr/local/cuda-11.7/include/thrust/system/cuda/detail/uninitialized_fill.h:88
> #11  0x11efa263 in
> _ZN6thrust20uninitialized_fill_nINS_8cuda_cub3tagENS_10device_ptrIiEEmiEET0_RKNS_6detail21execution_policy_baseIT_EES5_T1_RKT2_
> at /usr/local/cuda-11.7/include/thrust/detail/uninitialized_fill.inl:55
> #12  0x11efa263 in
> _ZN6thrust6detail23allocator_traits_detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEENS0_10disable_ifIXsrNS1_37needs_default_construct_via_allocatorIT_NS0_15pointer_elementIT0_E4typeEEE5valueEvE4typeERS9_SB_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:93
> #13  0x11efa263 in
> _ZN6thrust6detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEEvRT_T0_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:104
> at /usr/local/cuda-11.7/include/thrust/detail/uninitialized_fill.inl:55
> #12  0x11efa263 in
> _ZN6thrust6detail23allocator_traits_detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEENS0_10disable_ifIXsrNS1_37needs_default_construct_via_allocatorIT_NS0_15pointer_elementIT0_E4typeEEE5valueEvE4typeERS9_SB_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:93
> #13  0x11efa263 in
> _ZN6thrust6detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEEvRT_T0_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:104
> #14  

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-11 Thread Junchao Zhang
Before digging into the details, could you try to run
src/ksp/ksp/tests/ex60.c to make sure the environment is ok.

The comment at the end shows how to run it
   test:
  requires: cuda
  suffix: 1_cuda
  nsize: 4
  args: -ksp_view -mat_type aijcusparse -sub_pc_factor_mat_solver_type
cusparse

--Junchao Zhang


On Fri, Aug 11, 2023 at 4:36 PM Vanella, Marcos (Fed) <
marcos.vane...@nist.gov> wrote:

> Hi Junchao, thank you for the info. I compiled the main branch of PETSc in
> another machine that has the  openmpi/4.1.4/gcc-11.2.1-cuda-11.7 toolchain
> and don't see the fortran compilation error. It might have been related to
> gcc-9.3.
> I tried the case again, 2 CPUs and one GPU and get this error now:
>
> terminate called after throwing an instance of
> 'thrust::system::system_error'
> terminate called after throwing an instance of
> 'thrust::system::system_error'
>   what():  parallel_for failed: cudaErrorInvalidConfiguration: invalid
> configuration argument
>   what():  parallel_for failed: cudaErrorInvalidConfiguration: invalid
> configuration argument
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
> #0  0x2000397fcd8f in ???
> #1  0x2000397fb657 in ???
> #0  0x2000397fcd8f in ???
> #1  0x2000397fb657 in ???
> #2  0x200604d7 in ???
> #2  0x200604d7 in ???
> #3  0x200039cb9628 in ???
> #4  0x200039c93eb3 in ???
> #5  0x200039364a97 in ???
> #6  0x20003935f6d3 in ???
> #7  0x20003935f78f in ???
> #8  0x20003935fc6b in ???
> #3  0x200039cb9628 in ???
> #4  0x200039c93eb3 in ???
> #5  0x200039364a97 in ???
> #6  0x20003935f6d3 in ???
> #7  0x20003935f78f in ???
> #8  0x20003935fc6b in ???
> #9  0x11ec425b in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
> at /usr/local/cuda-11.7/include/thrust/system/cuda/detail/util.h:225
> #10  0x11ec425b in
> _ZN6thrust8cuda_cub20uninitialized_fill_nINS0_3tagENS_10device_ptrIiEEmiEET0_RNS0_16execution_policyIT_EES5_T1_RKT2_
> #9  0x11ec425b in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
> at /usr/local/cuda-11.7/include/thrust/system/cuda/detail/util.h:225
> #10  0x11ec425b in
> _ZN6thrust8cuda_cub20uninitialized_fill_nINS0_3tagENS_10device_ptrIiEEmiEET0_RNS0_16execution_policyIT_EES5_T1_RKT2_
> at
> /usr/local/cuda-11.7/include/thrust/system/cuda/detail/uninitialized_fill.h:88
> #11  0x11efa263 in
> _ZN6thrust20uninitialized_fill_nINS_8cuda_cub3tagENS_10device_ptrIiEEmiEET0_RKNS_6detail21execution_policy_baseIT_EES5_T1_RKT2_
> at
> /usr/local/cuda-11.7/include/thrust/system/cuda/detail/uninitialized_fill.h:88
> #11  0x11efa263 in
> _ZN6thrust20uninitialized_fill_nINS_8cuda_cub3tagENS_10device_ptrIiEEmiEET0_RKNS_6detail21execution_policy_baseIT_EES5_T1_RKT2_
> at /usr/local/cuda-11.7/include/thrust/detail/uninitialized_fill.inl:55
> #12  0x11efa263 in
> _ZN6thrust6detail23allocator_traits_detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEENS0_10disable_ifIXsrNS1_37needs_default_construct_via_allocatorIT_NS0_15pointer_elementIT0_E4typeEEE5valueEvE4typeERS9_SB_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:93
> #13  0x11efa263 in
> _ZN6thrust6detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEEvRT_T0_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:104
> at /usr/local/cuda-11.7/include/thrust/detail/uninitialized_fill.inl:55
> #12  0x11efa263 in
> _ZN6thrust6detail23allocator_traits_detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEENS0_10disable_ifIXsrNS1_37needs_default_construct_via_allocatorIT_NS0_15pointer_elementIT0_E4typeEEE5valueEvE4typeERS9_SB_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:93
> #13  0x11efa263 in
> _ZN6thrust6detail23default_construct_rangeINS_16device_allocatorIiEENS_10device_ptrIiEEmEEvRT_T0_T1_
> at
> /usr/local/cuda-11.7/include/thrust/detail/allocator/default_construct_range.inl:104
> #14  0x11efa263 in
> _ZN6thrust6detail18contiguous_storageIiNS_16device_allocatorIiEEE19default_construct_nENS0_15normal_iteratorINS_10device_ptrIim
> at /usr/local/cuda-11.7/include/thrust/detail/contiguous_storage.inl:254
> #15  0x11efa263 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:220
> #14  0x11efa263 in
> _ZN6thrust6detail18contiguous_storageIiNS_16device_allocatorIiEEE19default_construct_nENS0_15normal_iteratorINS_10device_ptrIim
> at /usr/local/cuda-11.7/include/thrust/detail/contiguous_storage.inl:254
> #15  0x11efa263 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
> at /usr/local/cuda-11.7/include/thrust/detail/vector_base.inl:220
> #16  0x11efa263 in
> _ZN6thrust6detail11vector_baseIiNS_16device_allocatorIiEEE12default_initEm
> at 

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-11 Thread Junchao Zhang
Marcos,
  We do not have good petsc/gpu documentation, but see
https://petsc.org/main/faq/#doc-faq-gpuhowto, and also search "requires:
cuda" in petsc tests and you will find examples using GPU.
  For the Fortran compile errors, attach your configure.log and Satish
(Cc'ed) or others should know how to fix them.

  Thanks.
--Junchao Zhang


On Fri, Aug 11, 2023 at 2:22 PM Vanella, Marcos (Fed) <
marcos.vane...@nist.gov> wrote:

> Hi Junchao, thanks for the explanation. Is there some development
> documentation on the GPU work? I'm interested learning about it.
> I checked out the main branch and configured petsc. when compiling with
> gcc/gfortran I come across this error:
>
> 
>   CUDAC
> arch-linux-c-opt/obj/src/mat/impls/aij/seq/seqcusparse/aijcusparse.o
>   CUDAC.dep
> arch-linux-c-opt/obj/src/mat/impls/aij/seq/seqcusparse/aijcusparse.o
>  FC arch-linux-c-opt/obj/src/ksp/f90-mod/petsckspdefmod.o
>  FC arch-linux-c-opt/obj/src/ksp/f90-mod/petscpcmod.o
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:37:61:
>
>37 |   subroutine PCASMCreateSubdomains2D(a,b,c,d,e,f,g,h,i,z)
>   | 1
> *Error: Symbol ‘pcasmcreatesubdomains2d’ at (1) already has an explicit
> interface*
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:38:13:
>
>38 |import tIS
>   | 1
> Error: IMPORT statement at (1) only permitted in an INTERFACE body
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:39:80:
>
>39 |PetscInt a ! PetscInt
>   |
>  1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:40:80:
>
>40 |PetscInt b ! PetscInt
>   |
>  1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:41:80:
>
>41 |PetscInt c ! PetscInt
>   |
>  1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:42:80:
>
>42 |PetscInt d ! PetscInt
>   |
>  1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:43:80:
>
>43 |PetscInt e ! PetscInt
>   |
>  1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:44:80:
>
>44 |PetscInt f ! PetscInt
>   |
>  1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:45:80:
>
>45 |PetscInt g ! PetscInt
>   |
>  1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:46:30:
>
>46 |IS h ! IS
>   |  1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:47:30:
>
>47 |IS i ! IS
>   |  1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:48:43:
>
>48 |PetscErrorCode z
>   |   1
> Error: Unexpected data declaration statement in INTERFACE block at (1)
>
> /home/mnv/Software/petsc/include/../src/ksp/f90-mod/ftn-auto-interfaces/petscpc.h90:49:10:
>
>49 |end subroutine PCASMCreateSubdomains2D
>   |  1
> Error: Expecting END INTERFACE statement at (1)
> make[3]: *** [gmakefile:225:
> arch-linux-c-opt/obj/src/ksp/f90-mod/petscpcmod.o] Error 1
> make[3]: *** Waiting for unfinished jobs
>  CC
> arch-linux-c-opt/obj/src/tao/leastsquares/impls/pounders/pounders.o
>  CC arch-linux-c-opt/obj/src/ksp/pc/impls/bddc/bddcprivate.o
>   CUDAC
> arch-linux-c-opt/obj/src/vec/vec/impls/seq/cupm/cuda/vecseqcupm.o
>   CUDAC.dep
> arch-linux-c-opt/obj/src/vec/vec/impls/seq/cupm/cuda/vecseqcupm.o
> make[3]: Leaving directory '/home/mnv/Software/petsc'
> make[2]: *** [/home/mnv/Software/petsc/lib/petsc/conf/rules.doc:28: libs]
> Error 2
> make[2]: Leaving directory '/home/mnv/Software/petsc'
> **ERROR*
>   Error during compile, check arch-linux-c-opt/lib/petsc/conf/make.log
>   Send it and 

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-11 Thread Junchao Zhang
Hi, Macros,
  I saw MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic() in the error stack.
We recently refactored the COO code and got rid of that function.  So could
you try petsc/main?
  We map MPI processes to GPUs in a round-robin fashion. We query the
number of visible CUDA devices (g), and assign the device (rank%g) to the
MPI process (rank).   In that sense, the work distribution is totally
determined by your MPI work partition (i.e, yourself).
  On clusters, this MPI process to GPU binding is usually done by the job
scheduler like slurm.  You need to check your cluster's users' guide to see
how to bind MPI processes to GPUs. If the job scheduler has done that, the
number of visible CUDA devices to a process might just appear to be 1,
making petsc's own mapping void.

   Thanks.
--Junchao Zhang


On Fri, Aug 11, 2023 at 12:43 PM Vanella, Marcos (Fed) <
marcos.vane...@nist.gov> wrote:

> Hi Junchao, thank you for replying. I compiled petsc in debug mode and
> this is what I get for the case:
>
> terminate called after throwing an instance of
> 'thrust::system::system_error'
>   what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an
> illegal memory access was encountered
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
> #0  0x15264731ead0 in ???
> #1  0x15264731dc35 in ???
> #2  0x15264711551f in ???
> #3  0x152647169a7c in ???
> #4  0x152647115475 in ???
> #5  0x1526470fb7f2 in ???
> #6  0x152647678bbd in ???
> #7  0x15264768424b in ???
> #8  0x1526476842b6 in ???
> #9  0x152647684517 in ???
> #10  0x55bb46342ebb in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
> at /usr/local/cuda/include/thrust/system/cuda/detail/util.h:224
> #11  0x55bb46342ebb in
> _ZN6thrust8cuda_cub12__merge_sort10merge_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_NS3_15normal_iteratorISB_EE9IJCompareEEvRNS0_16execution_policyIT1_EET2_SM_T3_T4_
> at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1316
> #12  0x55bb46342ebb in
> _ZN6thrust8cuda_cub12__smart_sort10smart_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_16execution_policyINS0_3tagEEENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESD_NS_9null_typeESE_SE_SE_SE_SE_SE_SE_NS3_15normal_iteratorISD_EE9IJCompareEENS1_25enable_if_comparison_sortIT2_T4_E4typeERT1_SL_SL_T3_SM_
> at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1544
> #13  0x55bb46342ebb in
> _ZN6thrust8cuda_cub11sort_by_keyINS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_NS_6detail15normal_iteratorIS6_EE9IJCompareEEvRNS0_16execution_policyIT_EET0_SI_T1_T2_
> at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1669
> #14  0x55bb46317bc5 in
> _ZN6thrust11sort_by_keyINS_8cuda_cub3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_NS_6detail15normal_iteratorIS6_EE9IJCompareEEvRKNSA_21execution_policy_baseIT_EET0_SJ_T1_T2_
> at /usr/local/cuda/include/thrust/detail/sort.inl:115
> #15  0x55bb46317bc5 in
> _ZN6thrust11sort_by_keyINS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES4_NS_9null_typeES5_S5_S5_S5_S5_S5_S5_NS_6detail15normal_iteratorIS4_EE9IJCompareEEvT_SC_T0_T1_
> at /usr/local/cuda/include/thrust/detail/sort.inl:305
> #16  0x55bb46317bc5 in MatSetPreallocationCOO_SeqAIJCUSPARSE_Basic
> at /home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/
> aijcusparse.cu:4452
> #17  0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic
> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/
> mpiaijcusparse.cu:173
> #18  0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE
> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/
> mpiaijcusparse.cu:222
> #19  0x55bb468e01cf in MatSetPreallocationCOO
> at /home/mnv/Software/petsc/src/mat/utils/gcreate.c:606
> #20  0x55bb46b39c9b in MatProductSymbolic_MPIAIJBACKEND
> at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:7547
> #21  0x55bb469015e5 in MatProductSymbolic
> at /home/mnv/Software/petsc/src/mat/interface/matproduct.c:803
> #22  0x55bb4694ade2 in MatPtAP
> at /home/mnv/Software/petsc/src/mat/interface/matrix.c:9897
> #23  0x55bb4696d3ec in MatCoarsenApply_MISK_private
> at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:283
> #24  0x55bb4696eb67 in MatCoarsenApply_MISK
> at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:368
> #25  0x55bb4695bd91 in MatCoarsenApply
> at /home/mnv/Software/petsc/src/mat/coarsen/coarsen.c:97
> #26  0x55bb478294d8 in PCGAMGCoarsen_AGG
> at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/agg.c:524
> #27  0x55bb471d1cb4 in PCSetUp_GAMG
> at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/gamg.c:631
> #28  0x55bb464022cf in PCSetUp
> at /home/mnv/Software/petsc/src/ksp/pc/interface/precon.c:994
> #29  0x55bb4718b8a7 in KSPSetUp
> at 

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-11 Thread Vanella, Marcos (Fed) via petsc-users
Hi Junchao, thank you for replying. I compiled petsc in debug mode and this is 
what I get for the case:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an 
illegal memory access was encountered

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x15264731ead0 in ???
#1  0x15264731dc35 in ???
#2  0x15264711551f in ???
#3  0x152647169a7c in ???
#4  0x152647115475 in ???
#5  0x1526470fb7f2 in ???
#6  0x152647678bbd in ???
#7  0x15264768424b in ???
#8  0x1526476842b6 in ???
#9  0x152647684517 in ???
#10  0x55bb46342ebb in _ZN6thrust8cuda_cub14throw_on_errorE9cudaErrorPKc
  at /usr/local/cuda/include/thrust/system/cuda/detail/util.h:224
#11  0x55bb46342ebb in 
_ZN6thrust8cuda_cub12__merge_sort10merge_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESB_NS_9null_typeESC_SC_SC_SC_SC_SC_SC_NS3_15normal_iteratorISB_EE9IJCompareEEvRNS0_16execution_policyIT1_EET2_SM_T3_T4_
  at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1316
#12  0x55bb46342ebb in 
_ZN6thrust8cuda_cub12__smart_sort10smart_sortINS_6detail17integral_constantIbLb1EEENS4_IbLb0EEENS0_16execution_policyINS0_3tagEEENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEESD_NS_9null_typeESE_SE_SE_SE_SE_SE_SE_NS3_15normal_iteratorISD_EE9IJCompareEENS1_25enable_if_comparison_sortIT2_T4_E4typeERT1_SL_SL_T3_SM_
  at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1544
#13  0x55bb46342ebb in 
_ZN6thrust8cuda_cub11sort_by_keyINS0_3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_NS_6detail15normal_iteratorIS6_EE9IJCompareEEvRNS0_16execution_policyIT_EET0_SI_T1_T2_
  at /usr/local/cuda/include/thrust/system/cuda/detail/sort.h:1669
#14  0x55bb46317bc5 in 
_ZN6thrust11sort_by_keyINS_8cuda_cub3tagENS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES6_NS_9null_typeES7_S7_S7_S7_S7_S7_S7_NS_6detail15normal_iteratorIS6_EE9IJCompareEEvRKNSA_21execution_policy_baseIT_EET0_SJ_T1_T2_
  at /usr/local/cuda/include/thrust/detail/sort.inl:115
#15  0x55bb46317bc5 in 
_ZN6thrust11sort_by_keyINS_12zip_iteratorINS_5tupleINS_10device_ptrIiEES4_NS_9null_typeES5_S5_S5_S5_S5_S5_S5_NS_6detail15normal_iteratorIS4_EE9IJCompareEEvT_SC_T0_T1_
  at /usr/local/cuda/include/thrust/detail/sort.inl:305
#16  0x55bb46317bc5 in MatSetPreallocationCOO_SeqAIJCUSPARSE_Basic
  at 
/home/mnv/Software/petsc/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu:4452
#17  0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic
  at 
/home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:173
#18  0x55bb46c5b27c in MatSetPreallocationCOO_MPIAIJCUSPARSE
  at 
/home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpicusparse/mpiaijcusparse.cu:222
#19  0x55bb468e01cf in MatSetPreallocationCOO
  at /home/mnv/Software/petsc/src/mat/utils/gcreate.c:606
#20  0x55bb46b39c9b in MatProductSymbolic_MPIAIJBACKEND
  at /home/mnv/Software/petsc/src/mat/impls/aij/mpi/mpiaij.c:7547
#21  0x55bb469015e5 in MatProductSymbolic
  at /home/mnv/Software/petsc/src/mat/interface/matproduct.c:803
#22  0x55bb4694ade2 in MatPtAP
  at /home/mnv/Software/petsc/src/mat/interface/matrix.c:9897
#23  0x55bb4696d3ec in MatCoarsenApply_MISK_private
  at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:283
#24  0x55bb4696eb67 in MatCoarsenApply_MISK
  at /home/mnv/Software/petsc/src/mat/coarsen/impls/misk/misk.c:368
#25  0x55bb4695bd91 in MatCoarsenApply
  at /home/mnv/Software/petsc/src/mat/coarsen/coarsen.c:97
#26  0x55bb478294d8 in PCGAMGCoarsen_AGG
  at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/agg.c:524
#27  0x55bb471d1cb4 in PCSetUp_GAMG
  at /home/mnv/Software/petsc/src/ksp/pc/impls/gamg/gamg.c:631
#28  0x55bb464022cf in PCSetUp
  at /home/mnv/Software/petsc/src/ksp/pc/interface/precon.c:994
#29  0x55bb4718b8a7 in KSPSetUp
  at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:406
#30  0x55bb4718f22e in KSPSolve_Private
  at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:824
#31  0x55bb47192c0c in KSPSolve
  at /home/mnv/Software/petsc/src/ksp/ksp/interface/itfunc.c:1070
#32  0x55bb463efd35 in kspsolve_
  at /home/mnv/Software/petsc/src/ksp/ksp/interface/ftn-auto/itfuncf.c:320
#33  0x55bb45e94b32 in ???
#34  0x55bb46048044 in ???
#35  0x55bb46052ea1 in ???
#36  0x55bb45ac5f8e in ???
#37  0x1526470fcd8f in ???
#38  0x1526470fce3f in ???
#39  0x55bb45aef55d in ???
#40  0x in ???
--
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-11 Thread Junchao Zhang
Hi, Marcos,
  Could you build petsc in debug mode and then copy and paste the whole
error stack message?

   Thanks
--Junchao Zhang


On Thu, Aug 10, 2023 at 5:51 PM Vanella, Marcos (Fed) via petsc-users <
petsc-users@mcs.anl.gov> wrote:

> Hi, I'm trying to run a parallel matrix vector build and linear solution
> with PETSc on 2 MPI processes + one V100 GPU. I tested that the matrix
> build and solution is successful in CPUs only. I'm using cuda 11.5 and cuda
> enabled openmpi and gcc 9.3. When I run the job with GPU enabled I get the
> following error:
>
> terminate called after throwing an instance of
> 'thrust::system::system_error'
>   * what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress:
> an illegal memory access was encountered*
>
> Program received signal SIGABRT: Process abort signal.
>
> Backtrace for this error:
> terminate called after throwing an instance of
> 'thrust::system::system_error'
>   what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an
> illegal memory access was encountered
>
> Program received signal SIGABRT: Process abort signal.
>
> I'm new to submitting jobs in slurm that also use GPU resources, so I
> might be doing something wrong in my submission script. This is it:
>
> #!/bin/bash
> #SBATCH -J test
> #SBATCH -e /home/Issues/PETSc/test.err
> #SBATCH -o /home/Issues/PETSc/test.log
> #SBATCH --partition=batch
> #SBATCH --ntasks=2
> #SBATCH --nodes=1
> #SBATCH --cpus-per-task=1
> #SBATCH --ntasks-per-node=2
> #SBATCH --time=01:00:00
> #SBATCH --gres=gpu:1
>
> export OMP_NUM_THREADS=1
> module load cuda/11.5
> module load openmpi/4.1.1
>
> cd /home/Issues/PETSc
> *mpirun -n 2 */home/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds 
> *-vec_type
> mpicuda -mat_type mpiaijcusparse -pc_type gamg*
>
> If anyone has any suggestions on how o troubleshoot this please let me
> know.
> Thanks!
> Marcos
>
>
>
>


[petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-10 Thread Vanella, Marcos (Fed) via petsc-users
Hi, I'm trying to run a parallel matrix vector build and linear solution with 
PETSc on 2 MPI processes + one V100 GPU. I tested that the matrix build and 
solution is successful in CPUs only. I'm using cuda 11.5 and cuda enabled 
openmpi and gcc 9.3. When I run the job with GPU enabled I get the following 
error:

terminate called after throwing an instance of 'thrust::system::system_error'
  what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an 
illegal memory access was encountered

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  merge_sort: failed to synchronize: cudaErrorIllegalAddress: an 
illegal memory access was encountered

Program received signal SIGABRT: Process abort signal.

I'm new to submitting jobs in slurm that also use GPU resources, so I might be 
doing something wrong in my submission script. This is it:

#!/bin/bash
#SBATCH -J test
#SBATCH -e /home/Issues/PETSc/test.err
#SBATCH -o /home/Issues/PETSc/test.log
#SBATCH --partition=batch
#SBATCH --ntasks=2
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=2
#SBATCH --time=01:00:00
#SBATCH --gres=gpu:1

export OMP_NUM_THREADS=1
module load cuda/11.5
module load openmpi/4.1.1

cd /home/Issues/PETSc
mpirun -n 2 /home/fds/Build/ompi_gnu_linux/fds_ompi_gnu_linux test.fds 
-vec_type mpicuda -mat_type mpiaijcusparse -pc_type gamg

If anyone has any suggestions on how o troubleshoot this please let me know.
Thanks!
Marcos