Re: [OMPI users] Open MPI internal error

2017-09-29 Thread Richard Graham
mpi.org>> de la part de Richard Graham <richa...@mellanox.com<mailto:richa...@mellanox.com>> Envoyé : jeudi 28 septembre 2017 18:09 À : Open MPI Users Objet : Re: [OMPI users] Open MPI internal error I just talked with George, who brought me up to speed on this particular prob

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread Ludovic Raess
lanox.com> Envoyé : jeudi 28 septembre 2017 18:09 À : Open MPI Users Objet : Re: [OMPI users] Open MPI internal error I just talked with George, who brought me up to speed on this particular problem. I would suggest a couple of things: - Look at the HW error counters, and see if y

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread Richard Graham
users] Open MPI internal error John, On the ULFM mailing list you pointed out, we converged toward a hardware issue. Resources associated with the dead process were not correctly freed, and follow-up processes on the same setup would inherit issues related to these lingering messages. However

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread George Bosilca
John, On the ULFM mailing list you pointed out, we converged toward a hardware issue. Resources associated with the dead process were not correctly freed, and follow-up processes on the same setup would inherit issues related to these lingering messages. However, keep in mind that the setup was

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread John Hearns via users
ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ? On 28 September 2017 at 11:17, John Hearns wrote: > > Google turns this up: > https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls > > > On 28 September 2017 at 01:26, Ludovic Raess

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread John Hearns via users
Google turns this up: https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls On 28 September 2017 at 01:26, Ludovic Raess wrote: > Hi, > > > we have a issue on our 32 nodes Linux cluster regarding the usage of Open > MPI in a Infiniband dual-rail configuration (2 IB

[OMPI users] Open MPI internal error

2017-09-27 Thread Ludovic Raess
Hi, we have a issue on our 32 nodes Linux cluster regarding the usage of Open MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR single port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7). On long runs (over ~10 days) involving more than 1 node (usually 64 MPI