I'm running OpenMPI 2.1.0 on RHEL 7 using TCP communication.  For the
specific run that's crashing on me, I'm running with 17 ranks (on 17
different physical machines).  I've got a stage in my application where
ranks need to transfer chunks of data where the size of each chunk is
trivial (on the order of 100 MB) compared to the overall imagery.  However,
the chunks are spread out across many buffers in a way that makes the
indexing complicated (and the memory is not all within a single buffer)...
the simplest way to express the data movement in code is by a large number
of MPI_Isend() and MPI_Ireceive() calls followed of course by an eventual
MPI_Waitall().  This works fine for many cases, but I've run into a case
now where the chunks are imbalanced such that a few ranks have a total of
~450 MPI_Request objects (I do a single MPI_Waitall() with all requests at
once) and the remaining ranks have < 10 MPI_Requests.  In this scenario, I
get a seg fault inside PMPI_Waitall().

Is there an implementation limit as to how many asynchronous requests are
allowed?  Is there a way this can be queried either via a #define value or
runtime call?  I probably won't go this route, but when initially compiling
OpenMPI, is there a configure option to increase it?

I've done a fair amount of debugging and am pretty confident this is where
the error is occurring as opposed to indexing out of bounds somewhere, but
if there is no such limit in OpenMPI, that would be useful to know too.

users mailing list

Reply via email to