This behavior is as-expected. When you specify "-host foo,barβ, you have told us to assign one slot to each of those nodes. Thus, running 3 procs exceeds the number of slots you assigned.
You can tell it to set the #slots to the #cores it discovers on the node by using β-host foo:*,bar:*β I cannot replicate your behavior of "-np 3 -host foo:2,bar:3β running more than 3 procs > On May 30, 2017, at 5:24 AM, Siegmar Gross > <siegmar.gr...@informatik.hs-fulda.de> wrote: > > Hi Gilles, > >> what if you ? >> mpiexec --host loki:1,exin:1 -np 3 hello_1_mpi > > I need as many slots as processes so that I use "-np 2". > "mpiexec --host loki,exin -np 2 hello_1_mpi" works as well. The command > breaks, if I use at least "-np 3" and distribute the processes across at > least two machines. > > loki hello_1 118 mpiexec --host loki:1,exin:1 -np 2 hello_1_mpi > Process 0 of 2 running on loki > Process 1 of 2 running on exin > Now 1 slave tasks are sending greetings. > Greetings from task 1: > message type: 3 > msg length: 131 characters > message: > hostname: exin > operating system: Linux > release: 4.4.49-92.11-default > processor: x86_64 > loki hello_1 119 > > > >> are loki and exin different ? (os, sockets, core) > > Yes, loki is a real machine and exin is a virtual one. "exin" uses a newer > kernel. > > loki fd1026 108 uname -a > Linux loki 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016 (2d3e9d4) > x86_64 x86_64 x86_64 GNU/Linux > > loki fd1026 109 ssh exin uname -a > Linux exin 4.4.49-92.11-default #1 SMP Fri Feb 17 08:29:30 UTC 2017 (8f9478a) > x86_64 x86_64 x86_64 GNU/Linux > loki fd1026 110 > > The number of sockets and cores is identical, but the processor types are > different as you can see at the end of my previous email. "loki" uses two > "Intel(R) Xeon(R) CPU E5-2620 v3" processors and "exin" two "Intel Core > Processor (Haswell, no TSX)" from QEMU. I can provide a pdf file with both > topologies (89 K) if you are interested in the output from lstopo. I've > added some runs. Most interesting in my opinion are the last two > "mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi" and > "mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi". > Why does mpiexec create five processes although I've asked for only three > processes? Why do I have to break the program with <Ctrl-c> for the first > of the above commands? > > > > loki hello_1 110 mpiexec --host loki:2,exin:1 -np 3 hello_1_mpi > -------------------------------------------------------------------------- > There are not enough slots available in the system to satisfy the 3 slots > that were requested by the application: > hello_1_mpi > > Either request fewer slots for your application, or make more slots available > for use. > -------------------------------------------------------------------------- > > > > loki hello_1 111 mpiexec --host exin:3 -np 3 hello_1_mpi > Process 0 of 3 running on exin > Process 1 of 3 running on exin > Process 2 of 3 running on exin > ... > > > > loki hello_1 115 mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi > Process 1 of 3 running on loki > Process 0 of 3 running on loki > Process 2 of 3 running on loki > ... > > Process 0 of 3 running on exin > Process 1 of 3 running on exin > [exin][[52173,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:794:mca_btl_tcp_endpoint_complete_connect] > connect() to 193.xxx.xxx.xxx failed: Connection refused (111) > > ^Cloki hello_1 116 > > > > > loki hello_1 116 mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi > Process 0 of 3 running on loki > Process 2 of 3 running on loki > Process 1 of 3 running on loki > ... > Process 1 of 3 running on exin > Process 0 of 3 running on exin > [exin][[51638,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:590:mca_btl_tcp_endpoint_recv_blocking] > recv(16, 0/8) failed: Connection reset by peer (104) > [exin:31909] > ../../../../../openmpi-v3.x-201705250239-d5200ea/ompi/mca/pml/ob1/pml_ob1_sendreq.c:191 > FATAL > loki hello_1 117 > > > Do you need anything else? > > > Kind regards and thank you very much for your help > > Siegmar > > > >> Cheers, >> Gilles >> ----- Original Message ----- >>> Hi, >>> >>> I have installed openmpi-v3.x-201705250239-d5200ea on my "SUSE Linux >>> Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-7.1.0. >>> Depending on the machine that I use to start my processes, I have >>> a problem with "--host" for versions "v3.x" and "master", while >>> everything works as expected with earlier versions. >>> >>> >>> loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi >>> ---------------------------------------------------------------------- >> ---- >>> There are not enough slots available in the system to satisfy the 3 >> slots >>> that were requested by the application: >>> hello_1_mpi >>> >>> Either request fewer slots for your application, or make more slots >> available >>> for use. >>> ---------------------------------------------------------------------- >> ---- >>> >>> >>> >>> Everything is ok if I use the same command on "exin". >>> >>> exin fd1026 107 mpiexec -np 3 --host loki:2,exin hello_1_mpi >>> Process 0 of 3 running on loki >>> Process 1 of 3 running on loki >>> Process 2 of 3 running on exin >>> ... >>> >>> >>> >>> Everything is also ok if I use openmpi-v2.x-201705260340-58c6b3c on " >> loki". >>> >>> loki hello_1 114 which mpiexec >>> /usr/local/openmpi-2.1.2_64_cc/bin/mpiexec >>> loki hello_1 115 mpiexec -np 3 --host loki:2,exin hello_1_mpi >>> Process 0 of 3 running on loki >>> Process 1 of 3 running on loki >>> Process 2 of 3 running on exin >>> ... >>> >>> >>> "exin" is a virtual machine on QEMU so that it uses a slightly >> different >>> processor architecture, e.g., it has no L3 cache but larger L2 caches. >>> >>> loki fd1026 117 cat /proc/cpuinfo | grep -e "model name" -e "physical >> id" -e >>> "cpu cores" -e "cache size" | sort | uniq >>> cache size : 15360 KB >>> cpu cores : 6 >>> model name : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz >>> physical id : 0 >>> physical id : 1 >>> >>> >>> loki fd1026 118 ssh exin cat /proc/cpuinfo | grep -e "model name" -e " >> physical >>> id" -e "cpu cores" -e "cache size" | sort | uniq >>> cache size : 4096 KB >>> cpu cores : 6 >>> model name : Intel Core Processor (Haswell, no TSX) >>> physical id : 0 >>> physical id : 1 >>> >>> >>> Any ideas what's different in the newer versions of Open MPI? Is the >> new >>> behavior intended? I would be grateful, if somebody can fix the >> problem, >>> if "mpiexec -np 3 --host loki:2,exin hello_1_mpi" should print my >> messages >>> in versions "3.x" and "master" as well, if the programs are started on >> any >>> machine. Do you need anything else? Thank you very much for any help >> in >>> advance. >>> >>> >>> Kind regards >>> >>> Siegmar >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users