Valgrind was not useful. Just an MPI abort message with 128 process output.
Can we merge my MR and I can test your branch.
On Wed, Jan 26, 2022 at 2:51 PM Barry Smith wrote:
>
> I have added a mini-MR to print out the key so we can see if it is 0 or
> some crazy number.
I have added a mini-MR to print out the key so we can see if it is 0 or some
crazy number. https://gitlab.com/petsc/petsc/-/merge_requests/4766
Note that the table data structure is not sent through MPI so if MPI is the
culprit it is not just that MPI is putting incorrect (or no)
On Wed, Jan 26, 2022 at 2:32 PM Justin Chang wrote:
> rocgdb requires "-ggdb" in addition to "-g"
>
Ah, OK.
>
> What happens if you lower AMD_LOG_LEVEL to something like 1 or 2? I was
> hoping AMD_LOG_LEVEL could at least give you something like a "stacktrace"
> showing what the last
>
>
> Are the crashes reproducible in the same place with identical runs?
>
>
I have not seen my repoducer work and it is in MatAssemblyEnd with not
finding a table entry. I can't tell if it is the same error everytime.
rocgdb requires "-ggdb" in addition to "-g"
What happens if you lower AMD_LOG_LEVEL to something like 1 or 2? I was
hoping AMD_LOG_LEVEL could at least give you something like a "stacktrace"
showing what the last successful HIP/HSA call was. I believe it should also
show line numbers in the code.
On Wed, Jan 26, 2022 at 1:54 PM Justin Chang wrote:
> Couple suggestions:
>
> 1. Set the environment variable "export AMD_LOG_LEVEL=3" <- this will tell
> you everything that's happening at the HIP level (memcpy's, mallocs, kernel
> execution time, etc)
>
Humm, My reproducer uses 2 nodes and
On Wed, Jan 26, 2022 at 2:25 PM Mark Adams wrote:
> I have used valgrind here. I did not run it on this MPI error. I will.
>
> On Wed, Jan 26, 2022 at 10:56 AM Barry Smith wrote:
>
>>
>> Any way to run with valgrind (or a HIP variant of valgrind)? It looks
>> like a memory corruption issue
I have used valgrind here. I did not run it on this MPI error. I will.
On Wed, Jan 26, 2022 at 10:56 AM Barry Smith wrote:
>
> Any way to run with valgrind (or a HIP variant of valgrind)? It looks
> like a memory corruption issue and tracking down exactly when the
> corruption begins is 3/4's
Couple suggestions:
1. Set the environment variable "export AMD_LOG_LEVEL=3" <- this will tell
you everything that's happening at the HIP level (memcpy's, mallocs, kernel
execution time, etc)
2. Try rocgdb, AFAIK this is the closest "HIP variant of valgrind" that we
officially support. There are
Any way to run with valgrind (or a HIP variant of valgrind)? It looks like a
memory corruption issue and tracking down exactly when the corruption begins is
3/4's of the way to finding the exact cause.
Are the crashes reproducible in the same place with identical runs?
> On Jan 26, 2022,
I think it is an MPI bug. It works with GPU aware MPI turned off.
I am sure Summit will be fine.
We have had users fix this error by switching thier MPI.
On Wed, Jan 26, 2022 at 10:10 AM Junchao Zhang
wrote:
> I don't know if this is due to bugs in petsc/kokkos backend. See if you
> can run 6
I don't know if this is due to bugs in petsc/kokkos backend. See if you
can run 6 nodes (48 mpi ranks). If it fails, then run the same problem on
Summit with 8 nodes to see if it still fails. If yes, it is likely a bug of
our own.
--Junchao Zhang
On Wed, Jan 26, 2022 at 8:44 AM Mark Adams
I am not able to reproduce this with a small problem. 2 nodes or less
refinement works. This is from the 8 node test, the -dm_refine 5 version.
I see that it comes from PtAP.
This is on the fine grid. (I was thinking it could be on a reduced grid
with idle processors, but no)
[15]PETSC ERROR:
The GPU aware MPI is dying going 1 to 8 nodes, 8 processes per node.
I will make a minimum reproducer. start with 2 nodes, one process on each
node.
On Tue, Jan 25, 2022 at 10:19 PM Barry Smith wrote:
>
> So the MPI is killing you in going from 8 to 64. (The GPU flop rate
> scales almost
14 matches
Mail list logo