Noam,

The OB1 provide a mechanism to dump all pending communications in a
particular communicator. To do this I usually call mca_pml_ob1_dump(comm,
1), with comm being the MPI_Comm and 1 being the verbose mode. I have no
idea how you can find the pointer to the communicator out of your code, but
if you compile OMPI in debug mode you will see it as an argument to
the mca_pml_ob1_send
and mca_pml_ob1_recv function.

This information will give us a better idea on what happened to the
message, where is has been sent (or not), and what were the source and tag
used for the matching.

  George.



On Thu, Apr 5, 2018 at 12:01 PM, Edgar Gabriel <egabr...@central.uh.edu>
wrote:

> is the file I/O that you mentioned using MPI I/O for that? If yes, what
> file system are you writing to?
>
> Edgar
>
>
>
> On 4/5/2018 10:15 AM, Noam Bernstein wrote:
>
>> On Apr 5, 2018, at 11:03 AM, Reuti <re...@staff.uni-marburg.de> wrote:
>>>
>>> Hi,
>>>
>>> Am 05.04.2018 um 16:16 schrieb Noam Bernstein <
>>>> noam.bernst...@nrl.navy.mil>:
>>>>
>>>> Hi all - I have a code that uses MPI (vasp), and it’s hanging in a
>>>> strange way.  Basically, there’s a Cartesian communicator, 4x16 (64
>>>> processes total), and despite the fact that the communication pattern is
>>>> rather regular, one particular send/recv pair hangs consistently.
>>>> Basically, across each row of 4, task 0 receives from 1,2,3, and tasks
>>>> 1,2,3 send to 0.  On most of the 16 such sets all those send/recv pairs
>>>> complete.  However, on 2 of them, it hangs (both the send and recv).  I
>>>> have stack traces (with gdb -p on the running processes) from what I
>>>> believe are corresponding send/recv pairs.
>>>>
>>>> <snip>
>>>>
>>>> This is with OpenMPI 3.0.1 (same for 3.0.0, haven’t checked older
>>>> versions), Intel compilers (17.2.174). It seems to be independent of which
>>>> nodes, always happens on this pair of calls and happens after the code has
>>>> been running for a while, and the same code for the other 14 sets of 4 work
>>>> fine, suggesting that it’s an MPI issue, rather than an obvious bug in this
>>>> code or a hardware problem.  Does anyone have any ideas, either about
>>>> possible causes or how to debug things further?
>>>>
>>> Do you use scaLAPACK, and which type of BLAS/LAPACK? I used Intel MKL
>>> with the Intel compilers for VASP and found, that using in addition a
>>> self-compiled scaLAPACK is working fine in combination with Open MPI. Using
>>> Intel scaLAPACK and Intel MPI is also working fine. What I never got
>>> working was the combination Intel scaLAPACK and Open MPI – at one point one
>>> process got a message from a wrong rank IIRC. I tried both: the Intel
>>> supplied Open MPI version of scaLAPACK and also compiling the necessary
>>> interface on my own for Open MPI in $MKLROOT/interfaces/mklmpi with
>>> identical results.
>>>
>> MKL BLAS/LAPACK, with my own self-compiled scalapack, but in this run I
>> set LSCALAPCK=.FALSE. I suppose I could try compiling without it just to
>> test.  In any case, this is when it’s writing out the wavefunctions, which
>> I would assume be unrelated to scalapack operations (unless they’re
>> corrupting some low level MPI thing, I guess).
>>
>>
>>                       Noam
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to