Google turns this up:
https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls


On 28 September 2017 at 01:26, Ludovic Raess <ludovic.ra...@unil.ch> wrote:

> Hi,
>
>
> we have a issue on our 32 nodes Linux cluster regarding the usage of Open
> MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR single
> port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7).
>
>
> On long runs (over ~10 days) involving more than 1 node (usually 64 MPI
> processes distributed on 16 nodes [node01-node16]​), we observe the freeze
> of the simulation due to an internal error displaying: "error polling LP CQ
> with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1
>  vendor error 136 qp_idx 0" (see attached file for full output).
>
>
> The job hangs, no computation neither communication occurs anymore, but no
> exit neither unload of the nodes is observed. The job can be killed
> normally but then the concerned nodes do not fully recover. A relaunch of
> the simulation usually sustains a couple of iterations (few minutes
> runtime), and then the job hangs again due to similar reasons. The only
> workaround so far is to reboot the involved nodes.
>
>
> Since we didn't find any hints on the web regarding this
> strange behaviour, I am wondering if this is a known issue. We actually
> don't know what causes this to happen and why. So any hints were to start
> investigating or possible reasons for this to happen are welcome.​
>
>
> Ludovic
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to