Re: [OMPI users] Fwd: OpenMPI does not obey hostfile

2017-09-28 Thread Anthony Thyssen
Thank you Gilles for the pointer.

However that package "openmpi-gnu-ohpc-1.10.6-23.1.x86_64.rpm" has other
dependencies from the OpenHPC.  Basically it is strongly tied to the whole
OpenHPC concept.


I did however follow your suggestion and rebuild the OpenMPI RPM package
from redhat  adding the "tm" module needed to integration with torque.  But
that only produced another (similar but not quite the same) problem.

OpenMPI now does correct pick up a node allocation from Torque (according
to -display-allocation, and --display-map), but for some reason is
completely ignoring it, and just running everything (over-subscribing) on
the first node given.  The previous problem did not over subscribe the
nodes.  It just did not spread out the processes as requested.

I am starting a new thread about this problem to try and get some help.


  Anthony Thyssen ( System Programmer )
 --
  Warning: May contain traces of nuts.
 --


On Wed, Sep 27, 2017 at 2:55 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Anthony,
>
> a few things ...
> - Open MPI v1.10 is no more supported
> - you should at least use v2.0, preferably v2.1 or even the newly released
> 3.0
> - if you need to run under torque/pbs, then Open MPI should be built
> with tm support
> - openhpc.org provides Open MPI 1.10.7 with tm support
>
> Cheers,
>
> Gilles
>
> On Wed, Sep 27, 2017 at 12:57 PM, Anthony Thyssen
>  wrote:
> > This is not explained in the manual, when giving a hostfile (though I was
> > suspecting that was the case).
> >
> > However running one process on each node listed WAS the default
> behaviour in
> > the past.   In fact that is the default behaviour on a old Version 1.5.4
> > OpenMPI, I have on an old cluster which I am replacing.
> >
> > I suggest that this be explicitly explained in at least the manpages, and
> > preferably the OpenMPI FAQ too.
> >
> >
> > It explains why the manpages and FAQ seems to avoids specifying a host
> twice
> > in a --hostfile, and yet specifically does specify a host twice in the
> next
> > section on the --hosts option.  But no explanation is given!
> >
> > It explains why if I give a --pernode option, it runs only one process on
> > each host BUT ignores the fact that a host was listed twice. And if a -np
> > option was also given with --pernode errors with "more processes than the
> > ppr"
> >
> >
> > What that does NOT explain was why it completely ignores the  "ALLOCATED
> > NODES" that was reported in the debug output, as shown above.
> >
> > The only reason I posted for help was because the debug out seems to
> > indicate that it should be performing as I expected.
> >
> > ---
> >
> > Is there an option to force OpenMPi to use the OLD behaviour?  Just as
> many
> > web pages indicates it should be doing?
> > I have found no such option in the man pages.
> >
> > Without such an option, it makes the passing the $PBS_NODEFILE (from
> torque)
> > to the "mpirun" command much more difficult.  Which was why I developed
> the
> > "awk" script above, or try an convert it to a comma separated --host
> > argument, that does work.
> >
> > It seems a LOT of webpages on the net all, assume the old behaviour of
> > --hostfile which is why this new behaviour is confusing me, especially
> with
> > no explicit mention of this behaviour in the manual or OpenMPI FAQ pages.
> >
> > ---
> >
> > I have seen many PBS guides specify a --np option for the MPI command.
> > Though I could not see the point of it.
> >
> > A quick test seemed to indicate that it works, so I thought perhaps that
> was
> > the way to specify the old behaviour.
> >
> > # mpirun --hostfile hostfile.txt hostname
> > node21.emperor
> > node22.emperor
> > node21.emperor
> > node22.emperor
> > node23.emperor
> > node23.emperor
> >
> > # mpirun --hostfile hostfile.txt --np $(wc -l  > node21.emperor
> > node22.emperor
> > node22.emperor
> > node21.emperor
> >
> > I think however that was purely a fluke.  As when I expand it to a PBS
> batch
> > script command, to run on a larger number of nodes...
> >
> > mpirun --hostfile $PBS_NODEFILE -np $PBS_NP hostname
> >
> > Results is that OpenMPI still runs as many of the processes as it can
> (up to
> > the NP limit) on the first few nodes given. And node as Torque PBS
> > specified.
> >
> > ---
> >
> > ASIDE: The auto-discover does not appear to work very well. Tests with a
> mix
> > of dual and quad-core machines, often result in only
> > 2 processes on some of the quad-core machines.
> >
> > I saw mention of a --hetero-nodes which works to make auto-discovery
> work as
> > expected.  BUT it is NOT mentioned in the manual,  and to me "hetero"
> > implies a heterogeneous set of computers (all the same) rather than a
> mix of
> > computer types. As 

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread Ludovic Raess
Dear John, George, Rich,


thank you for the suggestions and potential paths towards understanding the 
reason for the observed freeze.  Although a HW issue might be possible, it 
sounds unlikely since the error appears only after long runs and not randomly. 
Also, it is kind of fixed after a reboot, unless another long run starts again.


Cables and connections seems OK, we already reseted all connexions.


Currently, we are investigating two paths towards a fix. We implemented a 
slightly modified version of the MPI point to point comm routine, to see if it 
was still a hidden programming issue. Additionally, I run the problematic setup 
using mvapich to see if it is related to Open MPI in particular, excluding thus 
a HW or implementation issue.


In both cases, I will run 'ibdiagnet' in case freeze will occur again, as 
suggested. In last, we could try to set the retransmit count to 0 as suggested 
by Rich.


Thanks for propositions, and I'll write if I have new hints (it would require 
some days to the runs to potentially freeze)


Ludovic


De : users  de la part de Richard Graham 

Envoyé : jeudi 28 septembre 2017 18:09
À : Open MPI Users
Objet : Re: [OMPI users] Open MPI internal error

I just talked with George, who brought me up to speed on this particular  
problem.

I would suggest a couple of things:

-  Look at the HW error counters, and see if you have many retransmits. 
 This would indicate a potential issue with the particular HW in use, such as a 
cable that is not seated well, or some type of similar problem.

-  If you have the ability, reset your cables from the HCA to the 
switch, and see if this addresses the problem.
Also, if you have the ability (e.g., can modify the Open MPI source code), set 
the retransmit count to 0, and see if you see the same issue.  This would just 
speed up reaching the problem, if this is indeed the issue.

Rich


From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of George 
Bosilca
Sent: Thursday, September 28, 2017 11:04 AM
To: Open MPI Users 
Subject: Re: [OMPI users] Open MPI internal error

John,

On the ULFM mailing list you pointed out, we converged toward a hardware issue. 
Resources associated with the dead process were not correctly freed, and 
follow-up processes on the same setup would inherit issues related to these 
lingering messages. However, keep in mind that the setup was different as we 
were talking about losing a process.

The proposed solution to force the timeout to a large value was not fixing the 
problem, it was just delaying it enough for the application to run to 
completion.

  George.


On Thu, Sep 28, 2017 at 5:17 AM, John Hearns via users 
> wrote:
ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ?

On 28 September 2017 at 11:17, John Hearns 
> wrote:

Google turns this up:
https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls


On 28 September 2017 at 01:26, Ludovic Raess 
> wrote:

Hi,



we have a issue on our 32 nodes Linux cluster regarding the usage of Open MPI 
in a Infiniband dual-rail configuration (2 IB Connect X FDR single port HCA, 
Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7).



On long runs (over ~10 days) involving more than 1 node (usually 64 MPI 
processes distributed on 16 nodes [node01-node16]​), we observe the freeze of 
the simulation due to an internal error displaying: "error polling LP CQ with 
status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1  vendor 
error 136 qp_idx 0" (see attached file for full output).



The job hangs, no computation neither communication occurs anymore, but no exit 
neither unload of the nodes is observed. The job can be killed normally but 
then the concerned nodes do not fully recover. A relaunch of the simulation 
usually sustains a couple of iterations (few minutes runtime), and then the job 
hangs again due to similar reasons. The only workaround so far is to reboot the 
involved nodes.



Since we didn't find any hints on the web regarding this strange behaviour, I 
am wondering if this is a known issue. We actually don't know what causes this 
to happen and why. So any hints were to start investigating or possible reasons 
for this to happen are welcome.​



Ludovic

___
users mailing list
users@lists.open-mpi.org

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread Richard Graham
I just talked with George, who brought me up to speed on this particular  
problem.

I would suggest a couple of things:

-  Look at the HW error counters, and see if you have many retransmits. 
 This would indicate a potential issue with the particular HW in use, such as a 
cable that is not seated well, or some type of similar problem.

-  If you have the ability, reset your cables from the HCA to the 
switch, and see if this addresses the problem.
Also, if you have the ability (e.g., can modify the Open MPI source code), set 
the retransmit count to 0, and see if you see the same issue.  This would just 
speed up reaching the problem, if this is indeed the issue.

Rich


From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of George 
Bosilca
Sent: Thursday, September 28, 2017 11:04 AM
To: Open MPI Users 
Subject: Re: [OMPI users] Open MPI internal error

John,

On the ULFM mailing list you pointed out, we converged toward a hardware issue. 
Resources associated with the dead process were not correctly freed, and 
follow-up processes on the same setup would inherit issues related to these 
lingering messages. However, keep in mind that the setup was different as we 
were talking about losing a process.

The proposed solution to force the timeout to a large value was not fixing the 
problem, it was just delaying it enough for the application to run to 
completion.

  George.


On Thu, Sep 28, 2017 at 5:17 AM, John Hearns via users 
> wrote:
ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ?

On 28 September 2017 at 11:17, John Hearns 
> wrote:

Google turns this up:
https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls


On 28 September 2017 at 01:26, Ludovic Raess 
> wrote:

Hi,



we have a issue on our 32 nodes Linux cluster regarding the usage of Open MPI 
in a Infiniband dual-rail configuration (2 IB Connect X FDR single port HCA, 
Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7).



On long runs (over ~10 days) involving more than 1 node (usually 64 MPI 
processes distributed on 16 nodes [node01-node16]​), we observe the freeze of 
the simulation due to an internal error displaying: "error polling LP CQ with 
status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1  vendor 
error 136 qp_idx 0" (see attached file for full output).



The job hangs, no computation neither communication occurs anymore, but no exit 
neither unload of the nodes is observed. The job can be killed normally but 
then the concerned nodes do not fully recover. A relaunch of the simulation 
usually sustains a couple of iterations (few minutes runtime), and then the job 
hangs again due to similar reasons. The only workaround so far is to reboot the 
involved nodes.



Since we didn't find any hints on the web regarding this strange behaviour, I 
am wondering if this is a known issue. We actually don't know what causes this 
to happen and why. So any hints were to start investigating or possible reasons 
for this to happen are welcome.​



Ludovic

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users



___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread George Bosilca
John,

On the ULFM mailing list you pointed out, we converged toward a hardware
issue. Resources associated with the dead process were not correctly freed,
and follow-up processes on the same setup would inherit issues related to
these lingering messages. However, keep in mind that the setup was
different as we were talking about losing a process.

The proposed solution to force the timeout to a large value was not fixing
the problem, it was just delaying it enough for the application to run to
completion.

  George.


On Thu, Sep 28, 2017 at 5:17 AM, John Hearns via users <
users@lists.open-mpi.org> wrote:

> ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ?
>
> On 28 September 2017 at 11:17, John Hearns  wrote:
>
>>
>> Google turns this up:
>> https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls
>>
>>
>> On 28 September 2017 at 01:26, Ludovic Raess 
>> wrote:
>>
>>> Hi,
>>>
>>>
>>> we have a issue on our 32 nodes Linux cluster regarding the usage of
>>> Open MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR
>>> single port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7).
>>>
>>>
>>> On long runs (over ~10 days) involving more than 1 node (usually 64 MPI
>>> processes distributed on 16 nodes [node01-node16]​), we observe the freeze
>>> of the simulation due to an internal error displaying: "error polling LP CQ
>>> with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1
>>>  vendor error 136 qp_idx 0" (see attached file for full output).
>>>
>>>
>>> The job hangs, no computation neither communication occurs anymore, but
>>> no exit neither unload of the nodes is observed. The job can be killed
>>> normally but then the concerned nodes do not fully recover. A relaunch of
>>> the simulation usually sustains a couple of iterations (few minutes
>>> runtime), and then the job hangs again due to similar reasons. The only
>>> workaround so far is to reboot the involved nodes.
>>>
>>>
>>> Since we didn't find any hints on the web regarding this
>>> strange behaviour, I am wondering if this is a known issue. We actually
>>> don't know what causes this to happen and why. So any hints were to start
>>> investigating or possible reasons for this to happen are welcome.​
>>>
>>>
>>> Ludovic
>>>
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>
>>
>>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread John Hearns via users
ps. Before you do the reboot of a compute node, have you run 'ibdiagnet' ?

On 28 September 2017 at 11:17, John Hearns  wrote:

>
> Google turns this up:
> https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls
>
>
> On 28 September 2017 at 01:26, Ludovic Raess 
> wrote:
>
>> Hi,
>>
>>
>> we have a issue on our 32 nodes Linux cluster regarding the usage of Open
>> MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR single
>> port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7).
>>
>>
>> On long runs (over ~10 days) involving more than 1 node (usually 64 MPI
>> processes distributed on 16 nodes [node01-node16]​), we observe the freeze
>> of the simulation due to an internal error displaying: "error polling LP CQ
>> with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1
>>  vendor error 136 qp_idx 0" (see attached file for full output).
>>
>>
>> The job hangs, no computation neither communication occurs anymore, but
>> no exit neither unload of the nodes is observed. The job can be killed
>> normally but then the concerned nodes do not fully recover. A relaunch of
>> the simulation usually sustains a couple of iterations (few minutes
>> runtime), and then the job hangs again due to similar reasons. The only
>> workaround so far is to reboot the involved nodes.
>>
>>
>> Since we didn't find any hints on the web regarding this
>> strange behaviour, I am wondering if this is a known issue. We actually
>> don't know what causes this to happen and why. So any hints were to start
>> investigating or possible reasons for this to happen are welcome.​
>>
>>
>> Ludovic
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Open MPI internal error

2017-09-28 Thread John Hearns via users
Google turns this up:
https://groups.google.com/forum/#!topic/ulfm/OPdsHTXF5ls


On 28 September 2017 at 01:26, Ludovic Raess  wrote:

> Hi,
>
>
> we have a issue on our 32 nodes Linux cluster regarding the usage of Open
> MPI in a Infiniband dual-rail configuration (2 IB Connect X FDR single
> port HCA, Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7).
>
>
> On long runs (over ~10 days) involving more than 1 node (usually 64 MPI
> processes distributed on 16 nodes [node01-node16]​), we observe the freeze
> of the simulation due to an internal error displaying: "error polling LP CQ
> with status REMOTE ACCESS ERROR status number 10 for wr_id e88c00 opcode 1
>  vendor error 136 qp_idx 0" (see attached file for full output).
>
>
> The job hangs, no computation neither communication occurs anymore, but no
> exit neither unload of the nodes is observed. The job can be killed
> normally but then the concerned nodes do not fully recover. A relaunch of
> the simulation usually sustains a couple of iterations (few minutes
> runtime), and then the job hangs again due to similar reasons. The only
> workaround so far is to reboot the involved nodes.
>
>
> Since we didn't find any hints on the web regarding this
> strange behaviour, I am wondering if this is a known issue. We actually
> don't know what causes this to happen and why. So any hints were to start
> investigating or possible reasons for this to happen are welcome.​
>
>
> Ludovic
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users