Here's some more info on the problem I've been struggling with; my
apologies for the lengthy posts, but I'm a little desperate here :-) I was able to reduce the size of the experiment that reproduces the problem, both in terms of input data size and the number of slots in the cluster. The cluster now consists of 6 slots (5 clients), with two of the clients running on the same node as the server and three others on another node. This allowed me to follow Brian's advice and run the server and all the clients under gdb and make sure none of the processes terminates (normally or abnormally) when the server reports the "readv failed" errors; this is indeed the case. I then followed Jeff's advice and added a debug loop just prior to the server calling MPI_Waitany(), identifying the entries in the requests array which are not MPI_REQUEST_NULL, and then tracing back these requests. What I found was the following: At some point during the run, the server calls MPI_Waitany() on an array of requests consisting of 96 elements, and gets stuck in it forever; the only thing that happens at some point thereafter is that the server reports a couple of "readv failed" errors: [host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=110 [host1][0,1,0][btl_tcp_frag.c:202:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed with errno=110 According to my debug prints, just before that last call to MPI_Waitany() the array requests[] contains 38 entries which are not MPI_REQUEST_NULL. Half of these entries correspond to calls to Isend(), half to Irecv(). Specifically, for example, entries 4,14,24,34,44,54,64,74,84,94 are used for Isend()'s from server to client #3 (of 5), and entries 5,15,...,95 are used for Irecv() for the same client. I traced back what's going on, for instance, with requests[4]. As I mentioned, it corresponds to a call to MPI_Isend() initiated by the server to client #3 (of 5). By the time the server gets stuck in Waitany(), this client has already correctly processed the first Isend() from master in requests[4], returned its response in requests[5], and the server received this response properly. After receiving this response, the server Isend()'s the next task to this client in requests[4], and this is correctly reflected in "requests[4] != MPI_REQUESTS_NULL" just before the last call to Waitany(), but for some reason this send doesn't seem to go any further. Looking at all other requests[] corresponding to Isend()'s initiated by the server to the same client (14,24,...,94), they're all also not MPI_REQUEST_NULL, and are not going any further either. One thing that might be important is that the messages the server is sending to the clients in my experiment are quite large, ranging from hundreds of Kbytes to several Mbytes, the largest being around 9 Mbytes. The largest messages take place at the beginning of the run and are processed correctly though. Also, I ran the same experiment on another cluster that uses slightly different hardware and network infrastructure, and could not reproduce the problem. Hope at least some of the above makes some sense. Any additional advice would be greatly appreciated! Many thanks, Daniel Daniel Rozenbaum wrote: I'm now running the same experiment under valgrind. It's probably going to run for a few days, but interestingly what I'm seeing now is that while running under valgrind's memcheck, the app has been reporting much more of these "recv failed" errors, and not only on the server node: |
- [OMPI users] Application using OpenMPI 1.2.3 hangs, error... Daniel Rozenbaum
- Re: [OMPI users] Application using OpenMPI 1.2.3 han... Jeff Squyres
- Re: [OMPI users] Application using OpenMPI 1.2.3... Daniel Rozenbaum
- Re: [OMPI users] Application using OpenMPI 1... Jeff Squyres
- Re: [OMPI users] Application using OpenM... Daniel Rozenbaum
- Re: [OMPI users] Application using ... Daniel Rozenbaum
- Re: [OMPI users] Application us... Daniel Rozenbaum