Re: [OMPI users] Fwd: OpenMPI does not obey hostfile

2017-09-28 Thread Anthony Thyssen
Thank you Gilles for the pointer. However that package "openmpi-gnu-ohpc-1.10.6-23.1.x86_64.rpm" has other dependencies from the OpenHPC. Basically it is strongly tied to the whole OpenHPC concept. I did however follow your suggestion and rebuild the OpenMPI RPM package from redhat adding the

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread Ludovic Raess
Dear John, George, Rich, thank you for the suggestions and potential paths towards understanding the reason for the observed freeze. Although a HW issue might be possible, it sounds unlikely since the error appears only after long runs and not randomly. Also, it is kind of fixed after a

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread Richard Graham
I just talked with George, who brought me up to speed on this particular problem. I would suggest a couple of things: - Look at the HW error counters, and see if you have many retransmits. This would indicate a potential issue with the particular HW in use, such as a cable that is

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread George Bosilca
John, On the ULFM mailing list you pointed out, we converged toward a hardware issue. Resources associated with the dead process were not correctly freed, and follow-up processes on the same setup would inherit issues related to these lingering messages. However, keep in mind that the setup was

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread John Hearns via users
ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ? On 28 September 2017 at 11:17, John Hearns wrote: > > Google turns this up: > https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls > > > On 28 September 2017 at 01:26, Ludovic Raess

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread John Hearns via users
Google turns this up: https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls On 28 September 2017 at 01:26, Ludovic Raess wrote: > Hi, > > > we have a issue on our 32 nodes Linux cluster regarding the usage of Open > MPI in a Infiniband dual-rail configuration (2 IB