is the file I/O that you mentioned using MPI I/O for that? If yes, what file system are you writing to?

Edgar


On 4/5/2018 10:15 AM, Noam Bernstein wrote:
On Apr 5, 2018, at 11:03 AM, Reuti <re...@staff.uni-marburg.de> wrote:

Hi,

Am 05.04.2018 um 16:16 schrieb Noam Bernstein <noam.bernst...@nrl.navy.mil>:

Hi all - I have a code that uses MPI (vasp), and it’s hanging in a strange way. 
 Basically, there’s a Cartesian communicator, 4x16 (64 processes total), and 
despite the fact that the communication pattern is rather regular, one 
particular send/recv pair hangs consistently.  Basically, across each row of 4, 
task 0 receives from 1,2,3, and tasks 1,2,3 send to 0.  On most of the 16 such 
sets all those send/recv pairs complete.  However, on 2 of them, it hangs (both 
the send and recv).  I have stack traces (with gdb -p on the running processes) 
from what I believe are corresponding send/recv pairs.

<snip>

This is with OpenMPI 3.0.1 (same for 3.0.0, haven’t checked older versions), 
Intel compilers (17.2.174). It seems to be independent of which nodes, always 
happens on this pair of calls and happens after the code has been running for a 
while, and the same code for the other 14 sets of 4 work fine, suggesting that 
it’s an MPI issue, rather than an obvious bug in this code or a hardware 
problem.  Does anyone have any ideas, either about possible causes or how to 
debug things further?
Do you use scaLAPACK, and which type of BLAS/LAPACK? I used Intel MKL with the 
Intel compilers for VASP and found, that using in addition a self-compiled 
scaLAPACK is working fine in combination with Open MPI. Using Intel scaLAPACK 
and Intel MPI is also working fine. What I never got working was the 
combination Intel scaLAPACK and Open MPI – at one point one process got a 
message from a wrong rank IIRC. I tried both: the Intel supplied Open MPI 
version of scaLAPACK and also compiling the necessary interface on my own for 
Open MPI in $MKLROOT/interfaces/mklmpi with identical results.
MKL BLAS/LAPACK, with my own self-compiled scalapack, but in this run I set 
LSCALAPCK=.FALSE. I suppose I could try compiling without it just to test.  In 
any case, this is when it’s writing out the wavefunctions, which I would assume 
be unrelated to scalapack operations (unless they’re corrupting some low level 
MPI thing, I guess).

                                                                                
                Noam

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to