Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-24 Thread Barry Smith
> On Aug 24, 2023, at 2:00 PM, Vanella, Marcos (Fed) > wrote: > > Thank you Barry, I will dial back the MPI_F08 use in our source code and try > compiling it. I haven't found much information regarding using MPI and > MPI_F08 in different modules other than the following link from several

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-24 Thread Barry Smith
PETSc uses the non-MPI_F08 Fortran modules so I am guessing when you also use the MPI_F08 modules the compiler sees two sets of interfaces for the same functions hence the error. I am not sure if it portable to use PETSc with the F08 Fortran modules in the same program or routine. >

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-24 Thread Vanella, Marcos (Fed) via petsc-users
Thank you Matt and Junchao. I've been testing further with nvhpc on summit. You might have an idea on what is going on here. These are my modules: Currently Loaded Modules: 1) lsf-tools/2.0 3) darshan-runtime/3.4.0-lite 5) DefApps 7) spectrum-mpi/10.4.0.3-20210112 9)

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-22 Thread Junchao Zhang
Macros, yes, refer to the example script Matt mentioned for Summit. Feel free to turn on/off options in the file. In my experience, gcc is easier to use. Also, I found https://docs.alcf.anl.gov/polaris/running-jobs/#binding-mpi-ranks-to-gpus, which might be similar to your machine (4 GPUs

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-22 Thread Matthew Knepley
On Tue, Aug 22, 2023 at 2:54 PM Vanella, Marcos (Fed) via petsc-users < petsc-users@mcs.anl.gov> wrote: > Hi Junchao, both the slurm scontrol show job_id -dd and looking at > CUDA_VISIBLE_DEVICES does not provide information about which MPI process > is associated to which GPU in the node in our

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-22 Thread Vanella, Marcos (Fed) via petsc-users
Hi Junchao, both the slurm scontrol show job_id -dd and looking at CUDA_VISIBLE_DEVICES does not provide information about which MPI process is associated to which GPU in the node in our system. I can see this with nvidia-smi, but if you have any other suggestion using slurm I would like to

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-21 Thread Junchao Zhang
That is a good question. Looking at https://slurm.schedmd.com/gres.html#GPU_Management, I was wondering if you can share the output of your job so we can search CUDA_VISIBLE_DEVICES and see how GPUs were allocated. --Junchao Zhang On Mon, Aug 21, 2023 at 2:38 PM Vanella, Marcos (Fed) <

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-21 Thread Vanella, Marcos (Fed) via petsc-users
Ok thanks Junchao, so is GPU 0 actually allocating memory for the 8 MPI processes meshes but only working on 2 of them? It says in the script it has allocated 2.4GB Best, Marcos From: Junchao Zhang Sent: Monday, August 21, 2023 3:29 PM To: Vanella, Marcos (Fed)

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-21 Thread Junchao Zhang
Hi, Macros, If you look at the PIDs of the nvidia-smi output, you will only find 8 unique PIDs, which is expected since you allocated 8 MPI ranks per node. The duplicate PIDs are usually for threads spawned by the MPI runtime (for example, progress threads in MPI implementation). So your job

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-21 Thread Vanella, Marcos (Fed) via petsc-users
Hi Junchao, something I'm noting related to running with cuda enabled linear solvers (CG+HYPRE, CG+GAMG) is that for multi cpu-multi gpu calculations, the GPU 0 in the node is taking what seems to be all sub-matrices corresponding to all the MPI processes in the node. This is the result of the

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-14 Thread Junchao Zhang
I don't see a problem in the matrix assembly. If you point me to your repo and show me how to build it, I can try to reproduce. --Junchao Zhang On Mon, Aug 14, 2023 at 2:53 PM Vanella, Marcos (Fed) < marcos.vane...@nist.gov> wrote: > Hi Junchao, I've tried for my case using the -ksp_type gmres

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-14 Thread Junchao Zhang
Yeah, it looks like ex60 was run correctly. Double check your code again and if you still run into errors, we can try to reproduce on our end. Thanks. --Junchao Zhang On Mon, Aug 14, 2023 at 1:05 PM Vanella, Marcos (Fed) < marcos.vane...@nist.gov> wrote: > Hi Junchao, I compiled and run ex60

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-11 Thread Junchao Zhang
Before digging into the details, could you try to run src/ksp/ksp/tests/ex60.c to make sure the environment is ok. The comment at the end shows how to run it test: requires: cuda suffix: 1_cuda nsize: 4 args: -ksp_view -mat_type aijcusparse

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-11 Thread Junchao Zhang
Marcos, We do not have good petsc/gpu documentation, but see https://petsc.org/main/faq/#doc-faq-gpuhowto, and also search "requires: cuda" in petsc tests and you will find examples using GPU. For the Fortran compile errors, attach your configure.log and Satish (Cc'ed) or others should know

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-11 Thread Junchao Zhang
Hi, Macros, I saw MatSetPreallocationCOO_MPIAIJCUSPARSE_Basic() in the error stack. We recently refactored the COO code and got rid of that function. So could you try petsc/main? We map MPI processes to GPUs in a round-robin fashion. We query the number of visible CUDA devices (g), and assign

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-11 Thread Vanella, Marcos (Fed) via petsc-users
Hi Junchao, thank you for replying. I compiled petsc in debug mode and this is what I get for the case: terminate called after throwing an instance of 'thrust::system::system_error' what(): merge_sort: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

Re: [petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-11 Thread Junchao Zhang
Hi, Marcos, Could you build petsc in debug mode and then copy and paste the whole error stack message? Thanks --Junchao Zhang On Thu, Aug 10, 2023 at 5:51 PM Vanella, Marcos (Fed) via petsc-users < petsc-users@mcs.anl.gov> wrote: > Hi, I'm trying to run a parallel matrix vector build and

[petsc-users] CUDA error trying to run a job with two mpi processes and 1 GPU

2023-08-10 Thread Vanella, Marcos (Fed) via petsc-users
Hi, I'm trying to run a parallel matrix vector build and linear solution with PETSc on 2 MPI processes + one V100 GPU. I tested that the matrix build and solution is successful in CPUs only. I'm using cuda 11.5 and cuda enabled openmpi and gcc 9.3. When I run the job with GPU enabled I get the