Re: [petsc-users] Status of PETScSF failures with GPU-aware MPI on Perlmutter

2023-11-02 Thread Jed Brown
What modules do you have loaded. I don't know if it currently works with 
cuda-11.7. I assume you're following these instructions carefully.

https://docs.nersc.gov/development/programming-models/mpi/cray-mpich/#cuda-aware-mpi

In our experience, GPU-aware MPI continues to be brittle on these machines. 
Maybe you can inquire with NERSC exactly which CUDA versions are tested with 
GPU-aware MPI.

Sajid Ali  writes:

> Hi PETSc-developers,
>
> I had posted about crashes within PETScSF when using GPU-aware MPI on
> Perlmutter a while ago (
> https://lists.mcs.anl.gov/mailman/htdig/petsc-users/2022-February/045585.html).
> Now that the software stacks have stabilized, I was wondering if there was
> a fix for the same as I am still observing similar crashes.
>
> I am attaching the trace of the latest crash (with PETSc-3.20.0) for
> reference.
>
> Thank You,
> Sajid Ali (he/him) | Research Associate
> Data Science, Simulation, and Learning Division
> Fermi National Accelerator Laboratory
> s-sajid-ali.github.io


Re: [petsc-users] Status of PETScSF failures with GPU-aware MPI on Perlmutter

2023-11-02 Thread Junchao Zhang
Hi, Sajid,
  Do you have a test example to reproduce the error?
--Junchao Zhang


On Thu, Nov 2, 2023 at 3:37 PM Sajid Ali 
wrote:

> Hi PETSc-developers,
>
> I had posted about crashes within PETScSF when using GPU-aware MPI on
> Perlmutter a while ago (
> https://lists.mcs.anl.gov/mailman/htdig/petsc-users/2022-February/045585.html).
> Now that the software stacks have stabilized, I was wondering if there was
> a fix for the same as I am still observing similar crashes.
>
> I am attaching the trace of the latest crash (with PETSc-3.20.0) for
> reference.
>
> Thank You,
> Sajid Ali (he/him) | Research Associate
> Data Science, Simulation, and Learning Division
> Fermi National Accelerator Laboratory
> s-sajid-ali.github.io
>


[petsc-users] Status of PETScSF failures with GPU-aware MPI on Perlmutter

2023-11-02 Thread Sajid Ali
Hi PETSc-developers,

I had posted about crashes within PETScSF when using GPU-aware MPI on
Perlmutter a while ago (
https://lists.mcs.anl.gov/mailman/htdig/petsc-users/2022-February/045585.html).
Now that the software stacks have stabilized, I was wondering if there was
a fix for the same as I am still observing similar crashes.

I am attaching the trace of the latest crash (with PETSc-3.20.0) for
reference.

Thank You,
Sajid Ali (he/him) | Research Associate
Data Science, Simulation, and Learning Division
Fermi National Accelerator Laboratory
s-sajid-ali.github.io


2_gpu_crash
Description: Binary data