Google turns this up: https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls
On 28 September 2017 at 01:26, Ludovic Raess <ludovic.ra...@unil.ch> wrote: > Hi, > > > we have a issue on our 32 nodes Linux cluster regarding the usage of Open > MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR single > port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7). > > > On long runs (over ~10 days) involving more than 1 node (usually 64 MPI > processes distributed on 16 nodes [node01-node16]), we observe the freeze > of the simulation due to an internal error displaying: "error polling LP CQ > with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1 > vendor error 136 qp_idx 0" (see attached file for full output). > > > The job hangs, no computation neither communication occurs anymore, but no > exit neither unload of the nodes is observed. The job can be killed > normally but then the concerned nodes do not fully recover. A relaunch of > the simulation usually sustains a couple of iterations (few minutes > runtime), and then the job hangs again due to similar reasons. The only > workaround so far is to reboot the involved nodes. > > > Since we didn't find any hints on the web regarding this > strange behaviour, I am wondering if this is a known issue. We actually > don't know what causes this to happen and why. So any hints were to start > investigating or possible reasons for this to happen are welcome. > > > Ludovic > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users