Re: [petsc-users] Status of PETScSF failures with GPU-aware MPI on Perlmutter
What modules do you have loaded. I don't know if it currently works with cuda-11.7. I assume you're following these instructions carefully. https://docs.nersc.gov/development/programming-models/mpi/cray-mpich/#cuda-aware-mpi In our experience, GPU-aware MPI continues to be brittle on these machines. Maybe you can inquire with NERSC exactly which CUDA versions are tested with GPU-aware MPI. Sajid Ali writes: > Hi PETSc-developers, > > I had posted about crashes within PETScSF when using GPU-aware MPI on > Perlmutter a while ago ( > https://lists.mcs.anl.gov/mailman/htdig/petsc-users/2022-February/045585.html). > Now that the software stacks have stabilized, I was wondering if there was > a fix for the same as I am still observing similar crashes. > > I am attaching the trace of the latest crash (with PETSc-3.20.0) for > reference. > > Thank You, > Sajid Ali (he/him) | Research Associate > Data Science, Simulation, and Learning Division > Fermi National Accelerator Laboratory > s-sajid-ali.github.io
Re: [petsc-users] Status of PETScSF failures with GPU-aware MPI on Perlmutter
Hi, Sajid, Do you have a test example to reproduce the error? --Junchao Zhang On Thu, Nov 2, 2023 at 3:37 PM Sajid Ali wrote: > Hi PETSc-developers, > > I had posted about crashes within PETScSF when using GPU-aware MPI on > Perlmutter a while ago ( > https://lists.mcs.anl.gov/mailman/htdig/petsc-users/2022-February/045585.html). > Now that the software stacks have stabilized, I was wondering if there was > a fix for the same as I am still observing similar crashes. > > I am attaching the trace of the latest crash (with PETSc-3.20.0) for > reference. > > Thank You, > Sajid Ali (he/him) | Research Associate > Data Science, Simulation, and Learning Division > Fermi National Accelerator Laboratory > s-sajid-ali.github.io >
[petsc-users] Status of PETScSF failures with GPU-aware MPI on Perlmutter
Hi PETSc-developers, I had posted about crashes within PETScSF when using GPU-aware MPI on Perlmutter a while ago ( https://lists.mcs.anl.gov/mailman/htdig/petsc-users/2022-February/045585.html). Now that the software stacks have stabilized, I was wondering if there was a fix for the same as I am still observing similar crashes. I am attaching the trace of the latest crash (with PETSc-3.20.0) for reference. Thank You, Sajid Ali (he/him) | Research Associate Data Science, Simulation, and Learning Division Fermi National Accelerator Laboratory s-sajid-ali.github.io 2_gpu_crash Description: Binary data