Ralph,

the issue Siegmar initially reported was

loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi


per what you wrote, this should be equivalent to

loki hello_1 111 mpiexec -np 3 --host loki:2,exin:1 hello_1_mpi

and this is what i initially wanted to double check (but i made a typo in my reply)


anyway, the logs Siegmar posted indicate the two commands produce the same output

--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
that were requested by the application:
  hello_1_mpi

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------


to me, this is incorrect since the command line made 3 available slots.
also, i am unable to reproduce any of these issues :-(



Siegmar,

can you please post your configure command line, and try these commands from loki

mpiexec -np 3 --host loki:2,exin --mca plm_base_verbose 5 hostname
mpiexec -np 1 --host exin --mca plm_base_verbose 5 hostname
mpiexec -np 1 --host exin ldd ./hello_1_mpi

if Open MPI is not installed on a shared filesystem (NFS for example), please also double check
both install were built from the same source and with the same options


Cheers,

Gilles
On 5/30/2017 10:20 PM, r...@open-mpi.org wrote:
This behavior is as-expected. When you specify "-host foo,bar”, you have told 
us to assign one slot to each of those nodes. Thus, running 3 procs exceeds the 
number of slots you assigned.

You can tell it to set the #slots to the #cores it discovers on the node by 
using β€œ-host foo:*,bar:*”

I cannot replicate your behavior of "-np 3 -host foo:2,bar:3” running more than 
3 procs


On May 30, 2017, at 5:24 AM, Siegmar Gross 
<siegmar.gr...@informatik.hs-fulda.de> wrote:

Hi Gilles,

what if you ?
mpiexec --host loki:1,exin:1 -np 3 hello_1_mpi
I need as many slots as processes so that I use "-np 2".
"mpiexec --host loki,exin -np 2 hello_1_mpi" works as well. The command
breaks, if I use at least "-np 3" and distribute the processes across at
least two machines.

loki hello_1 118 mpiexec --host loki:1,exin:1 -np 2 hello_1_mpi
Process 0 of 2 running on loki
Process 1 of 2 running on exin
Now 1 slave tasks are sending greetings.
Greetings from task 1:
  message type:        3
  msg length:          131 characters
  message:
    hostname:          exin
    operating system:  Linux
    release:           4.4.49-92.11-default
    processor:         x86_64
loki hello_1 119



are loki and exin different ? (os, sockets, core)
Yes, loki is a real machine and exin is a virtual one. "exin" uses a newer
kernel.

loki fd1026 108 uname -a
Linux loki 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016 (2d3e9d4) 
x86_64 x86_64 x86_64 GNU/Linux

loki fd1026 109 ssh exin uname -a
Linux exin 4.4.49-92.11-default #1 SMP Fri Feb 17 08:29:30 UTC 2017 (8f9478a) 
x86_64 x86_64 x86_64 GNU/Linux
loki fd1026 110

The number of sockets and cores is identical, but the processor types are
different as you can see at the end of my previous email. "loki" uses two
"Intel(R) Xeon(R) CPU E5-2620 v3" processors and "exin" two "Intel Core
Processor (Haswell, no TSX)" from QEMU. I can provide a pdf file with both
topologies (89 K) if you are interested in the output from lstopo. I've
added some runs. Most interesting in my opinion are the last two
"mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi" and
"mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi".
Why does mpiexec create five processes although I've asked for only three
processes? Why do I have to break the program with <Ctrl-c> for the first
of the above commands?



loki hello_1 110 mpiexec --host loki:2,exin:1 -np 3 hello_1_mpi
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 3 slots
that were requested by the application:
  hello_1_mpi

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------



loki hello_1 111 mpiexec --host exin:3 -np 3 hello_1_mpi
Process 0 of 3 running on exin
Process 1 of 3 running on exin
Process 2 of 3 running on exin
...



loki hello_1 115 mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi
Process 1 of 3 running on loki
Process 0 of 3 running on loki
Process 2 of 3 running on loki
...

Process 0 of 3 running on exin
Process 1 of 3 running on exin
[exin][[52173,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:794:mca_btl_tcp_endpoint_complete_connect]
 connect() to 193.xxx.xxx.xxx failed: Connection refused (111)

^Cloki hello_1 116




loki hello_1 116 mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi
Process 0 of 3 running on loki
Process 2 of 3 running on loki
Process 1 of 3 running on loki
...
Process 1 of 3 running on exin
Process 0 of 3 running on exin
[exin][[51638,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:590:mca_btl_tcp_endpoint_recv_blocking]
 recv(16, 0/8) failed: Connection reset by peer (104)
[exin:31909] 
../../../../../openmpi-v3.x-201705250239-d5200ea/ompi/mca/pml/ob1/pml_ob1_sendreq.c:191
 FATAL
loki hello_1 117


Do you need anything else?


Kind regards and thank you very much for your help

Siegmar



Cheers,
Gilles
----- Original Message -----
Hi,

I have installed openmpi-v3.x-201705250239-d5200ea on my "SUSE Linux
Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-7.1.0.
Depending on the machine that I use to start my processes, I have
a problem with "--host" for versions "v3.x" and "master", while
everything works as expected with earlier versions.


loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
----------------------------------------------------------------------
----
There are not enough slots available in the system to satisfy the 3
slots
that were requested by the application:
    hello_1_mpi

Either request fewer slots for your application, or make more slots
available
for use.
----------------------------------------------------------------------
----


Everything is ok if I use the same command on "exin".

exin fd1026 107 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...



Everything is also ok if I use openmpi-v2.x-201705260340-58c6b3c on "
loki".
loki hello_1 114 which mpiexec
/usr/local/openmpi-2.1.2_64_cc/bin/mpiexec
loki hello_1 115 mpiexec -np 3 --host loki:2,exin hello_1_mpi
Process 0 of 3 running on loki
Process 1 of 3 running on loki
Process 2 of 3 running on exin
...


"exin" is a virtual machine on QEMU so that it uses a slightly
different
processor architecture, e.g., it has no L3 cache but larger L2 caches.

loki fd1026 117 cat /proc/cpuinfo | grep -e "model name" -e "physical
id" -e
"cpu cores" -e "cache size" | sort | uniq
cache size    : 15360 KB
cpu cores    : 6
model name    : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
physical id    : 0
physical id    : 1


loki fd1026 118 ssh exin cat /proc/cpuinfo | grep -e "model name" -e "
physical
id" -e "cpu cores" -e "cache size" | sort | uniq
cache size    : 4096 KB
cpu cores    : 6
model name    : Intel Core Processor (Haswell, no TSX)
physical id    : 0
physical id    : 1


Any ideas what's different in the newer versions of Open MPI? Is the
new
behavior intended? I would be grateful, if somebody can fix the
problem,
if "mpiexec -np 3 --host loki:2,exin hello_1_mpi" should print my
messages
in versions "3.x" and "master" as well, if the programs are started on
any
machine. Do you need anything else? Thank you very much for any help
in
advance.


Kind regards

Siegmar
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to