Until the fixes pending in the big ORTE update PR are committed, I suggest not wasting time chasing this down. I tested the “patched” version of the 3.x branch, and it works just fine.
> On May 30, 2017, at 7:43 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > > Ralph, > > > the issue Siegmar initially reported was > > loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi > > > per what you wrote, this should be equivalent to > > loki hello_1 111 mpiexec -np 3 --host loki:2,exin:1 hello_1_mpi > > and this is what i initially wanted to double check (but i made a typo in my > reply) > > > anyway, the logs Siegmar posted indicate the two commands produce the same > output > > -------------------------------------------------------------------------- > There are not enough slots available in the system to satisfy the 3 slots > that were requested by the application: > hello_1_mpi > > Either request fewer slots for your application, or make more slots available > for use. > -------------------------------------------------------------------------- > > > to me, this is incorrect since the command line made 3 available slots. > also, i am unable to reproduce any of these issues :-( > > > > Siegmar, > > can you please post your configure command line, and try these commands from > loki > > mpiexec -np 3 --host loki:2,exin --mca plm_base_verbose 5 hostname > mpiexec -np 1 --host exin --mca plm_base_verbose 5 hostname > mpiexec -np 1 --host exin ldd ./hello_1_mpi > > if Open MPI is not installed on a shared filesystem (NFS for example), please > also double check > both install were built from the same source and with the same options > > > Cheers, > > Gilles > On 5/30/2017 10:20 PM, r...@open-mpi.org wrote: >> This behavior is as-expected. When you specify "-host foo,bar”, you have >> told us to assign one slot to each of those nodes. Thus, running 3 procs >> exceeds the number of slots you assigned. >> >> You can tell it to set the #slots to the #cores it discovers on the node by >> using “-host foo:*,bar:*” >> >> I cannot replicate your behavior of "-np 3 -host foo:2,bar:3” running more >> than 3 procs >> >> >>> On May 30, 2017, at 5:24 AM, Siegmar Gross >>> <siegmar.gr...@informatik.hs-fulda.de> wrote: >>> >>> Hi Gilles, >>> >>>> what if you ? >>>> mpiexec --host loki:1,exin:1 -np 3 hello_1_mpi >>> I need as many slots as processes so that I use "-np 2". >>> "mpiexec --host loki,exin -np 2 hello_1_mpi" works as well. The command >>> breaks, if I use at least "-np 3" and distribute the processes across at >>> least two machines. >>> >>> loki hello_1 118 mpiexec --host loki:1,exin:1 -np 2 hello_1_mpi >>> Process 0 of 2 running on loki >>> Process 1 of 2 running on exin >>> Now 1 slave tasks are sending greetings. >>> Greetings from task 1: >>> message type: 3 >>> msg length: 131 characters >>> message: >>> hostname: exin >>> operating system: Linux >>> release: 4.4.49-92.11-default >>> processor: x86_64 >>> loki hello_1 119 >>> >>> >>> >>>> are loki and exin different ? (os, sockets, core) >>> Yes, loki is a real machine and exin is a virtual one. "exin" uses a newer >>> kernel. >>> >>> loki fd1026 108 uname -a >>> Linux loki 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016 (2d3e9d4) >>> x86_64 x86_64 x86_64 GNU/Linux >>> >>> loki fd1026 109 ssh exin uname -a >>> Linux exin 4.4.49-92.11-default #1 SMP Fri Feb 17 08:29:30 UTC 2017 >>> (8f9478a) x86_64 x86_64 x86_64 GNU/Linux >>> loki fd1026 110 >>> >>> The number of sockets and cores is identical, but the processor types are >>> different as you can see at the end of my previous email. "loki" uses two >>> "Intel(R) Xeon(R) CPU E5-2620 v3" processors and "exin" two "Intel Core >>> Processor (Haswell, no TSX)" from QEMU. I can provide a pdf file with both >>> topologies (89 K) if you are interested in the output from lstopo. I've >>> added some runs. Most interesting in my opinion are the last two >>> "mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi" and >>> "mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi". >>> Why does mpiexec create five processes although I've asked for only three >>> processes? Why do I have to break the program with <Ctrl-c> for the first >>> of the above commands? >>> >>> >>> >>> loki hello_1 110 mpiexec --host loki:2,exin:1 -np 3 hello_1_mpi >>> -------------------------------------------------------------------------- >>> There are not enough slots available in the system to satisfy the 3 slots >>> that were requested by the application: >>> hello_1_mpi >>> >>> Either request fewer slots for your application, or make more slots >>> available >>> for use. >>> -------------------------------------------------------------------------- >>> >>> >>> >>> loki hello_1 111 mpiexec --host exin:3 -np 3 hello_1_mpi >>> Process 0 of 3 running on exin >>> Process 1 of 3 running on exin >>> Process 2 of 3 running on exin >>> ... >>> >>> >>> >>> loki hello_1 115 mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi >>> Process 1 of 3 running on loki >>> Process 0 of 3 running on loki >>> Process 2 of 3 running on loki >>> ... >>> >>> Process 0 of 3 running on exin >>> Process 1 of 3 running on exin >>> [exin][[52173,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:794:mca_btl_tcp_endpoint_complete_connect] >>> connect() to 193.xxx.xxx.xxx failed: Connection refused (111) >>> >>> ^Cloki hello_1 116 >>> >>> >>> >>> >>> loki hello_1 116 mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi >>> Process 0 of 3 running on loki >>> Process 2 of 3 running on loki >>> Process 1 of 3 running on loki >>> ... >>> Process 1 of 3 running on exin >>> Process 0 of 3 running on exin >>> [exin][[51638,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:590:mca_btl_tcp_endpoint_recv_blocking] >>> recv(16, 0/8) failed: Connection reset by peer (104) >>> [exin:31909] >>> ../../../../../openmpi-v3.x-201705250239-d5200ea/ompi/mca/pml/ob1/pml_ob1_sendreq.c:191 >>> FATAL >>> loki hello_1 117 >>> >>> >>> Do you need anything else? >>> >>> >>> Kind regards and thank you very much for your help >>> >>> Siegmar >>> >>> >>> >>>> Cheers, >>>> Gilles >>>> ----- Original Message ----- >>>>> Hi, >>>>> >>>>> I have installed openmpi-v3.x-201705250239-d5200ea on my "SUSE Linux >>>>> Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-7.1.0. >>>>> Depending on the machine that I use to start my processes, I have >>>>> a problem with "--host" for versions "v3.x" and "master", while >>>>> everything works as expected with earlier versions. >>>>> >>>>> >>>>> loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi >>>>> ---------------------------------------------------------------------- >>>> ---- >>>>> There are not enough slots available in the system to satisfy the 3 >>>> slots >>>>> that were requested by the application: >>>>> hello_1_mpi >>>>> >>>>> Either request fewer slots for your application, or make more slots >>>> available >>>>> for use. >>>>> ---------------------------------------------------------------------- >>>> ---- >>>>> >>>>> >>>>> Everything is ok if I use the same command on "exin". >>>>> >>>>> exin fd1026 107 mpiexec -np 3 --host loki:2,exin hello_1_mpi >>>>> Process 0 of 3 running on loki >>>>> Process 1 of 3 running on loki >>>>> Process 2 of 3 running on exin >>>>> ... >>>>> >>>>> >>>>> >>>>> Everything is also ok if I use openmpi-v2.x-201705260340-58c6b3c on " >>>> loki". >>>>> loki hello_1 114 which mpiexec >>>>> /usr/local/openmpi-2.1.2_64_cc/bin/mpiexec >>>>> loki hello_1 115 mpiexec -np 3 --host loki:2,exin hello_1_mpi >>>>> Process 0 of 3 running on loki >>>>> Process 1 of 3 running on loki >>>>> Process 2 of 3 running on exin >>>>> ... >>>>> >>>>> >>>>> "exin" is a virtual machine on QEMU so that it uses a slightly >>>> different >>>>> processor architecture, e.g., it has no L3 cache but larger L2 caches. >>>>> >>>>> loki fd1026 117 cat /proc/cpuinfo | grep -e "model name" -e "physical >>>> id" -e >>>>> "cpu cores" -e "cache size" | sort | uniq >>>>> cache size : 15360 KB >>>>> cpu cores : 6 >>>>> model name : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz >>>>> physical id : 0 >>>>> physical id : 1 >>>>> >>>>> >>>>> loki fd1026 118 ssh exin cat /proc/cpuinfo | grep -e "model name" -e " >>>> physical >>>>> id" -e "cpu cores" -e "cache size" | sort | uniq >>>>> cache size : 4096 KB >>>>> cpu cores : 6 >>>>> model name : Intel Core Processor (Haswell, no TSX) >>>>> physical id : 0 >>>>> physical id : 1 >>>>> >>>>> >>>>> Any ideas what's different in the newer versions of Open MPI? Is the >>>> new >>>>> behavior intended? I would be grateful, if somebody can fix the >>>> problem, >>>>> if "mpiexec -np 3 --host loki:2,exin hello_1_mpi" should print my >>>> messages >>>>> in versions "3.x" and "master" as well, if the programs are started on >>>> any >>>>> machine. Do you need anything else? Thank you very much for any help >>>> in >>>>> advance. >>>>> >>>>> >>>>> Kind regards >>>>> >>>>> Siegmar >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>> >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users