I got 4 x AMD A-10 6800K nodes on loan for a few months and added them to
my existing Intel nodes.

All nodes share the relevant directories via NFS. I have OpenMPI 1.6.5
which was build with Open-MX 1.5.3 support networked via GbE.

All nodes run Ubuntu 12.04.

Problem:

I can run a job EITHER on 4 x AMD nodes OR on 2 x Intel nodes, but I cannot
run a job on any combination of an AMD and Intel node, ie. 1 x AMD node + 1
x Intel node = error below.

The error that I get during job setup is:

>
> --------------------------------------------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>   Process 1 ([[2229,1],1]) is on host: AMD-Node-1
>   Process 2 ([[2229,1],8]) is on host: Intel-Node-1
>   BTLs attempted: self sm tcp
> Your MPI job is now going to abort; sorry.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> MPI_INIT has failed because at least one MPI process is unreachable
> from another.  This *usually* means that an underlying communication
> plugin -- such as a BTL or an MTL -- has either not loaded or not
> allowed itself to be used.  Your MPI job will now abort.
> You may wish to try to narrow down the problem;
>  * Check the output of ompi_info to see which BTL/MTL plugins are
>    available.
>  * Run your application with MPI_THREAD_SINGLE.
>  * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose,
>    if using MTL-based communications) to see exactly which
>    communication plugins were considered and/or discarded.
> --------------------------------------------------------------------------
> [AMD-Node-1:3932] *** An error occurred in MPI_Init
> [AMD-Node-1:3932] *** on a NULL communicator
> [AMD-Node-1:3932] *** Unknown error
> [AMD-Node-1:3932] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> --------------------------------------------------------------------------
> An MPI process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly.  You should
> double check that everything has shut down cleanly.
>   Reason:     Before MPI_INIT completed
>   Local host: AMD-Node-1
>   PID:        3932
> --------------------------------------------------------------------------



What I would like to know is, is it actually difficult (impossible) to mix
AMD and Intel machines in the same cluster and have them run the same job,
or am I missing something obvious, or not so obvious when it comes to the
communication stack on the Intel nodes for example.

I set up the AMD nodes just yesterday, but I used the same OpenMPI and
Open-MX versions, however I may have inadvertently done something
different, so I am thinking (hoping) that it is possible to run such a
heterogeneous cluster, and that all I need to do is ensure that all OpenMPI
modules are correctly installed on all nodes.

I need the extra 32 Gb RAM and the AMD nodes bring as I need to validate
our CFD application, and our additional Intel nodes are still not here (ETA
2 weeks).

Thank you,

Victor

Reply via email to