Thank you Gilles for the pointer.
However that package "openmpi-gnu-ohpc-1.10.6-23.1.x86_64.rpm" has other
dependencies from the OpenHPC. Basically it is strongly tied to the whole
I did however follow your suggestion and rebuild the OpenMPI RPM package
from redhat adding the
Dear John, George, Rich,
thank you for the suggestions and potential paths towards understanding the
reason for the observed freeze. Although a HW issue might be possible, it
sounds unlikely since the error appears only after long runs and not randomly.
Also, it is kind of fixed after a
I just talked with George, who brought me up to speed on this particular
I would suggest a couple of things:
- Look at the HW error counters, and see if you have many retransmits.
This would indicate a potential issue with the particular HW in use, such as a
cable that is
On the ULFM mailing list you pointed out, we converged toward a hardware
issue. Resources associated with the dead process were not correctly freed,
and follow-up processes on the same setup would inherit issues related to
these lingering messages. However, keep in mind that the setup was
ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ?
On 28 September 2017 at 11:17, John Hearns wrote:
> Google turns this up:
> On 28 September 2017 at 01:26, Ludovic Raess
Google turns this up:
On 28 September 2017 at 01:26, Ludovic Raess wrote:
> we have a issue on our 32 nodes Linux cluster regarding the usage of Open
> MPI in a Infiniband dual-rail configuration (2 IB