This behavior is as-expected. When you specify "-host foo,bar”, you have told 
us to assign one slot to each of those nodes. Thus, running 3 procs exceeds the 
number of slots you assigned.

You can tell it to set the #slots to the #cores it discovers on the node by 
using β€œ-host foo:*,bar:*”

I cannot replicate your behavior of "-np 3 -host foo:2,bar:3” running more than 
3 procs


> On May 30, 2017, at 5:24 AM, Siegmar Gross 
> <siegmar.gr...@informatik.hs-fulda.de> wrote:
> 
> Hi Gilles,
> 
>> what if you ?
>> mpiexec --host loki:1,exin:1 -np 3 hello_1_mpi
> 
> I need as many slots as processes so that I use "-np 2".
> "mpiexec --host loki,exin -np 2 hello_1_mpi" works as well. The command
> breaks, if I use at least "-np 3" and distribute the processes across at
> least two machines.
> 
> loki hello_1 118 mpiexec --host loki:1,exin:1 -np 2 hello_1_mpi
> Process 0 of 2 running on loki
> Process 1 of 2 running on exin
> Now 1 slave tasks are sending greetings.
> Greetings from task 1:
>  message type:        3
>  msg length:          131 characters
>  message:
>    hostname:          exin
>    operating system:  Linux
>    release:           4.4.49-92.11-default
>    processor:         x86_64
> loki hello_1 119
> 
> 
> 
>> are loki and exin different ? (os, sockets, core)
> 
> Yes, loki is a real machine and exin is a virtual one. "exin" uses a newer
> kernel.
> 
> loki fd1026 108 uname -a
> Linux loki 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016 (2d3e9d4) 
> x86_64 x86_64 x86_64 GNU/Linux
> 
> loki fd1026 109 ssh exin uname -a
> Linux exin 4.4.49-92.11-default #1 SMP Fri Feb 17 08:29:30 UTC 2017 (8f9478a) 
> x86_64 x86_64 x86_64 GNU/Linux
> loki fd1026 110
> 
> The number of sockets and cores is identical, but the processor types are
> different as you can see at the end of my previous email. "loki" uses two
> "Intel(R) Xeon(R) CPU E5-2620 v3" processors and "exin" two "Intel Core
> Processor (Haswell, no TSX)" from QEMU. I can provide a pdf file with both
> topologies (89 K) if you are interested in the output from lstopo. I've
> added some runs. Most interesting in my opinion are the last two
> "mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi" and
> "mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi".
> Why does mpiexec create five processes although I've asked for only three
> processes? Why do I have to break the program with <Ctrl-c> for the first
> of the above commands?
> 
> 
> 
> loki hello_1 110 mpiexec --host loki:2,exin:1 -np 3 hello_1_mpi
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 3 slots
> that were requested by the application:
>  hello_1_mpi
> 
> Either request fewer slots for your application, or make more slots available
> for use.
> --------------------------------------------------------------------------
> 
> 
> 
> loki hello_1 111 mpiexec --host exin:3 -np 3 hello_1_mpi
> Process 0 of 3 running on exin
> Process 1 of 3 running on exin
> Process 2 of 3 running on exin
> ...
> 
> 
> 
> loki hello_1 115 mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi
> Process 1 of 3 running on loki
> Process 0 of 3 running on loki
> Process 2 of 3 running on loki
> ...
> 
> Process 0 of 3 running on exin
> Process 1 of 3 running on exin
> [exin][[52173,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:794:mca_btl_tcp_endpoint_complete_connect]
>  connect() to 193.xxx.xxx.xxx failed: Connection refused (111)
> 
> ^Cloki hello_1 116
> 
> 
> 
> 
> loki hello_1 116 mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi
> Process 0 of 3 running on loki
> Process 2 of 3 running on loki
> Process 1 of 3 running on loki
> ...
> Process 1 of 3 running on exin
> Process 0 of 3 running on exin
> [exin][[51638,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:590:mca_btl_tcp_endpoint_recv_blocking]
>  recv(16, 0/8) failed: Connection reset by peer (104)
> [exin:31909] 
> ../../../../../openmpi-v3.x-201705250239-d5200ea/ompi/mca/pml/ob1/pml_ob1_sendreq.c:191
>  FATAL
> loki hello_1 117
> 
> 
> Do you need anything else?
> 
> 
> Kind regards and thank you very much for your help
> 
> Siegmar
> 
> 
> 
>> Cheers,
>> Gilles
>> ----- Original Message -----
>>> Hi,
>>> 
>>> I have installed openmpi-v3.x-201705250239-d5200ea on my "SUSE Linux
>>> Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-7.1.0.
>>> Depending on the machine that I use to start my processes, I have
>>> a problem with "--host" for versions "v3.x" and "master", while
>>> everything works as expected with earlier versions.
>>> 
>>> 
>>> loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
>>> ----------------------------------------------------------------------
>> ----
>>> There are not enough slots available in the system to satisfy the 3
>> slots
>>> that were requested by the application:
>>>    hello_1_mpi
>>> 
>>> Either request fewer slots for your application, or make more slots
>> available
>>> for use.
>>> ----------------------------------------------------------------------
>> ----
>>> 
>>> 
>>> 
>>> Everything is ok if I use the same command on "exin".
>>> 
>>> exin fd1026 107 mpiexec -np 3 --host loki:2,exin hello_1_mpi
>>> Process 0 of 3 running on loki
>>> Process 1 of 3 running on loki
>>> Process 2 of 3 running on exin
>>> ...
>>> 
>>> 
>>> 
>>> Everything is also ok if I use openmpi-v2.x-201705260340-58c6b3c on "
>> loki".
>>> 
>>> loki hello_1 114 which mpiexec
>>> /usr/local/openmpi-2.1.2_64_cc/bin/mpiexec
>>> loki hello_1 115 mpiexec -np 3 --host loki:2,exin hello_1_mpi
>>> Process 0 of 3 running on loki
>>> Process 1 of 3 running on loki
>>> Process 2 of 3 running on exin
>>> ...
>>> 
>>> 
>>> "exin" is a virtual machine on QEMU so that it uses a slightly
>> different
>>> processor architecture, e.g., it has no L3 cache but larger L2 caches.
>>> 
>>> loki fd1026 117 cat /proc/cpuinfo | grep -e "model name" -e "physical
>> id" -e
>>> "cpu cores" -e "cache size" | sort | uniq
>>> cache size    : 15360 KB
>>> cpu cores    : 6
>>> model name    : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
>>> physical id    : 0
>>> physical id    : 1
>>> 
>>> 
>>> loki fd1026 118 ssh exin cat /proc/cpuinfo | grep -e "model name" -e "
>> physical
>>> id" -e "cpu cores" -e "cache size" | sort | uniq
>>> cache size    : 4096 KB
>>> cpu cores    : 6
>>> model name    : Intel Core Processor (Haswell, no TSX)
>>> physical id    : 0
>>> physical id    : 1
>>> 
>>> 
>>> Any ideas what's different in the newer versions of Open MPI? Is the
>> new
>>> behavior intended? I would be grateful, if somebody can fix the
>> problem,
>>> if "mpiexec -np 3 --host loki:2,exin hello_1_mpi" should print my
>> messages
>>> in versions "3.x" and "master" as well, if the programs are started on
>> any
>>> machine. Do you need anything else? Thank you very much for any help
>> in
>>> advance.
>>> 
>>> 
>>> Kind regards
>>> 
>>> Siegmar
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to