Re: [OMPI users] Heterogeneous cluster problem - mixing AMD and Intel nodes

2014-03-21 Thread hsaeed
Victor  gmail.com> writes:

> 
> I got 4 x AMD A-10 6800K nodes on loan for a few months and added them to 
my existing Intel nodes.
> All nodes share the relevant directories via NFS. I have OpenMPI 1.6.5 
which was build with Open-MX 1.5.3 support networked via GbE.
> 
> All nodes run Ubuntu 12.04.
> 
> Problem:
> 
> I can run a job EITHER on 4 x AMD nodes OR on 2 x Intel nodes, but I 
cannot run a job on any combination of an AMD and Intel node, ie. 1 x AMD 
node + 1 x Intel node = error below.
> 
> The error that I get during job setup is:
> 
> --
At least one pair of MPI processes are unable to reach each other forMPI 
communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This isan 
error; Open MPI requires that all MPI processes be able to reacheach other. 
 This error can sometimes be the result of forgetting tospecify the "self" 
BTL.
>   Process 1 ([[2229,1],1]) is on host: AMD-Node-1  Process 2 
([[2229,1],8]) is on host: Intel-Node-1  BTLs attempted: self sm tcpYour MPI 
job is now going to abort; sorry.---
---
> --
MPI_INIT has failed because at least one MPI process is unreachablefrom 
another.  This *usually* means that an underlying communication
> plugin -- such as a BTL or an MTL -- has either not loaded or notallowed 
itself to be used.  Your MPI job will now abort.You may wish to try to 
narrow down the problem; * Check the output of ompi_info to see which 
BTL/MTL plugins are
>    available. * Run your application with MPI_THREAD_SINGLE. * Set the MCA 
parameter btl_base_verbose to 100 (or mtl_base_verbose,   if using MTL-based 
communications) to see exactly which   communication plugins were considered 
and/or discarded.
> --
[AMD-Node-1:3932] *** An error occurred in MPI_Init[AMD-Node-1:3932] *** on 
a NULL communicator[AMD-Node-1:3932] *** Unknown error[AMD-Node-1:3932] *** 
MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> --
An MPI process is aborting at a time when it cannot guarantee that allof its 
peer processes in the job will be killed properly.  You should
> double check that everything has shut down cleanly.  Reason:     Before 
MPI_INIT completed  Local host: AMD-Node-1  PID:        3932
--
> 
> 
> What I would like to know is, is it actually difficult (impossible) to mix 
AMD and Intel machines in the same cluster and have them run the same job, 
or am I missing something obvious, or not so obvious when it comes to the 
communication stack on the Intel nodes for example. 
> 
> I set up the AMD nodes just yesterday, but I used the same OpenMPI and 
Open-MX versions, however I may have inadvertently done something different, 
so I am thinking (hoping) that it is possible to run such a heterogeneous 
cluster, and that all I need to do is ensure that all OpenMPI modules are 
correctly installed on all nodes.
> 
> I need the extra 32 Gb RAM and the AMD nodes bring as I need to validate 
our CFD application, and our additional Intel nodes are still not here (ETA 
2 weeks).
> 
> Thank you,
> 
> 
> 
> Victor
> 
> 
> ___
> users mailing list
> users  open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

I know you can help me in solving this problem.

I can compile my helloworld.c program using mpicc and I have confirmed that 
the script runs correctly on another working cluster, so the local paths are 
set up correctly I think and the script definitely works.

If I execute mpirun from my master node, and using only the master node, 
helloworld executes correctly:

mpirun -n 1 -host master --mca btl sm,openib,self ./helloworldmpi
hello world from process 0 of 1
If I execute mpirun from my master node, using only the worker node, 
helloworld executes correctly:

mpirun -n 1 -host node001 --mca btl sm,openib,self./helloworldmpi
hello world from process 0 of 1
Now, my problem is that if I try to run helloworld on both nodes, I get an 
error:

mpirun -n 2 -host master,node001 --mca btl openib,self ./helloworldmpi
--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[5228,1],0]) is on host: hsaeed
  Process 2 ([[5228,1],1]) is on 

Re: [OMPI users] Heterogeneous cluster problem - mixing AMD and Intel nodes

2014-03-02 Thread Victor
Thanks for your reply. There are some updates, but it was too late last
night to post it.

I now have the AMD/Intel heterogeneous cluster up and running. The initial
problem was that when I installed OpenMPI on the AMD nodes, the library
paths were set to a different location than on the Intel nodes. I am not
sure why.

In any case I then followed the suggestion from the FAQ and instead shared
the same OpenMPI install directory with all the nodes via NFS. Now the job
is running so I can confirm that it is indeed possible to run the same job
on a heterogeneous cluster comprised of AMD and Intel nodes.

I am using OpenMPI 1.7.4 now.

There is a related problem though. I am sharing /opt/openmpi-1.7.4 via NFS
but there does not seem to be a way to tell the nodes where OpenMPI is
located when using non-interactive SSH (using secure key login). SSH does
not seem to parse .bash_profile so I do not know how to tell the jobs on
the nodes where to find OpenMPI except by starting the job with
/opt/openmpi-1.7.4/bin/mpirun.

Regarding open-mx, yes I will look into that next to see if the job is
indeed using it. My msa flag is --mca mx self


Re: [OMPI users] Heterogeneous cluster problem - mixing AMD and Intel nodes

2014-03-02 Thread Brice Goglin
What's your mpirun or mpiexec command-line?
The error "BTLs attempted: self sm tcp" says that it didn't even try the
MX BTL (for Open-MX). Did you use the MX MTL instead?
Are you sure that you actually use Open-MX when not mixing AMD and Intel
nodes?

Brice



Le 02/03/2014 08:06, Victor a écrit :
> I got 4 x AMD A-10 6800K nodes on loan for a few months and added them
> to my existing Intel nodes.
>
> All nodes share the relevant directories via NFS. I have OpenMPI 1.6.5
> which was build with Open-MX 1.5.3 support networked via GbE.
>
> All nodes run Ubuntu 12.04.
>
> Problem:
>
> I can run a job EITHER on 4 x AMD nodes OR on 2 x Intel nodes, but I
> cannot run a job on any combination of an AMD and Intel node, ie. 1 x
> AMD node + 1 x Intel node = error below.
>
> The error that I get during job setup is:
>
>
> --
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>   Process 1 ([[2229,1],1]) is on host: AMD-Node-1
>   Process 2 ([[2229,1],8]) is on host: Intel-Node-1
>   BTLs attempted: self sm tcp
> Your MPI job is now going to abort; sorry.
> --
> --
> MPI_INIT has failed because at least one MPI process is unreachable
> from another.  This *usually* means that an underlying communication
> plugin -- such as a BTL or an MTL -- has either not loaded or not
> allowed itself to be used.  Your MPI job will now abort.
> You may wish to try to narrow down the problem;
>  * Check the output of ompi_info to see which BTL/MTL plugins are
>available.
>  * Run your application with MPI_THREAD_SINGLE.
>  * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
>if using MTL-based communications) to see exactly which
>communication plugins were considered and/or discarded.
> --
> [AMD-Node-1:3932] *** An error occurred in MPI_Init
> [AMD-Node-1:3932] *** on a NULL communicator
> [AMD-Node-1:3932] *** Unknown error
> [AMD-Node-1:3932] *** MPI_ERRORS_ARE_FATAL: your MPI job will now
> abort
> --
> An MPI process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly.  You should
> double check that everything has shut down cleanly.
>   Reason: Before MPI_INIT completed
>   Local host: AMD-Node-1
>   PID:3932
> --
>
>
>
> What I would like to know is, is it actually difficult (impossible) to
> mix AMD and Intel machines in the same cluster and have them run the
> same job, or am I missing something obvious, or not so obvious when it
> comes to the communication stack on the Intel nodes for example. 
>
> I set up the AMD nodes just yesterday, but I used the same OpenMPI and
> Open-MX versions, however I may have inadvertently done something
> different, so I am thinking (hoping) that it is possible to run such a
> heterogeneous cluster, and that all I need to do is ensure that all
> OpenMPI modules are correctly installed on all nodes.
>
> I need the extra 32 Gb RAM and the AMD nodes bring as I need to
> validate our CFD application, and our additional Intel nodes are still
> not here (ETA 2 weeks).
>
> Thank you,
>
> Victor
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] Heterogeneous cluster problem - mixing AMD and Intel nodes

2014-03-02 Thread Victor
I got 4 x AMD A-10 6800K nodes on loan for a few months and added them to
my existing Intel nodes.

All nodes share the relevant directories via NFS. I have OpenMPI 1.6.5
which was build with Open-MX 1.5.3 support networked via GbE.

All nodes run Ubuntu 12.04.

Problem:

I can run a job EITHER on 4 x AMD nodes OR on 2 x Intel nodes, but I cannot
run a job on any combination of an AMD and Intel node, ie. 1 x AMD node + 1
x Intel node = error below.

The error that I get during job setup is:

>
> --
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>   Process 1 ([[2229,1],1]) is on host: AMD-Node-1
>   Process 2 ([[2229,1],8]) is on host: Intel-Node-1
>   BTLs attempted: self sm tcp
> Your MPI job is now going to abort; sorry.
> --
> --
> MPI_INIT has failed because at least one MPI process is unreachable
> from another.  This *usually* means that an underlying communication
> plugin -- such as a BTL or an MTL -- has either not loaded or not
> allowed itself to be used.  Your MPI job will now abort.
> You may wish to try to narrow down the problem;
>  * Check the output of ompi_info to see which BTL/MTL plugins are
>available.
>  * Run your application with MPI_THREAD_SINGLE.
>  * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
>if using MTL-based communications) to see exactly which
>communication plugins were considered and/or discarded.
> --
> [AMD-Node-1:3932] *** An error occurred in MPI_Init
> [AMD-Node-1:3932] *** on a NULL communicator
> [AMD-Node-1:3932] *** Unknown error
> [AMD-Node-1:3932] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> --
> An MPI process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly.  You should
> double check that everything has shut down cleanly.
>   Reason: Before MPI_INIT completed
>   Local host: AMD-Node-1
>   PID:3932
> --



What I would like to know is, is it actually difficult (impossible) to mix
AMD and Intel machines in the same cluster and have them run the same job,
or am I missing something obvious, or not so obvious when it comes to the
communication stack on the Intel nodes for example.

I set up the AMD nodes just yesterday, but I used the same OpenMPI and
Open-MX versions, however I may have inadvertently done something
different, so I am thinking (hoping) that it is possible to run such a
heterogeneous cluster, and that all I need to do is ensure that all OpenMPI
modules are correctly installed on all nodes.

I need the extra 32 Gb RAM and the AMD nodes bring as I need to validate
our CFD application, and our additional Intel nodes are still not here (ETA
2 weeks).

Thank you,

Victor