Until the fixes pending in the big ORTE update PR are committed, I suggest not 
wasting time chasing this down. I tested the “patched” version of the 3.x 
branch, and it works just fine.


> On May 30, 2017, at 7:43 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> 
> Ralph,
> 
> 
> the issue Siegmar initially reported was
> 
> loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
> 
> 
> per what you wrote, this should be equivalent to
> 
> loki hello_1 111 mpiexec -np 3 --host loki:2,exin:1 hello_1_mpi
> 
> and this is what i initially wanted to double check (but i made a typo in my 
> reply)
> 
> 
> anyway, the logs Siegmar posted indicate the two commands produce the same 
> output
> 
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 3 slots
> that were requested by the application:
>  hello_1_mpi
> 
> Either request fewer slots for your application, or make more slots available
> for use.
> --------------------------------------------------------------------------
> 
> 
> to me, this is incorrect since the command line made 3 available slots.
> also, i am unable to reproduce any of these issues :-(
> 
> 
> 
> Siegmar,
> 
> can you please post your configure command line, and try these commands from 
> loki
> 
> mpiexec -np 3 --host loki:2,exin --mca plm_base_verbose 5 hostname
> mpiexec -np 1 --host exin --mca plm_base_verbose 5 hostname
> mpiexec -np 1 --host exin ldd ./hello_1_mpi
> 
> if Open MPI is not installed on a shared filesystem (NFS for example), please 
> also double check
> both install were built from the same source and with the same options
> 
> 
> Cheers,
> 
> Gilles
> On 5/30/2017 10:20 PM, r...@open-mpi.org wrote:
>> This behavior is as-expected. When you specify "-host foo,bar”, you have 
>> told us to assign one slot to each of those nodes. Thus, running 3 procs 
>> exceeds the number of slots you assigned.
>> 
>> You can tell it to set the #slots to the #cores it discovers on the node by 
>> using “-host foo:*,bar:*”
>> 
>> I cannot replicate your behavior of "-np 3 -host foo:2,bar:3” running more 
>> than 3 procs
>> 
>> 
>>> On May 30, 2017, at 5:24 AM, Siegmar Gross 
>>> <siegmar.gr...@informatik.hs-fulda.de> wrote:
>>> 
>>> Hi Gilles,
>>> 
>>>> what if you ?
>>>> mpiexec --host loki:1,exin:1 -np 3 hello_1_mpi
>>> I need as many slots as processes so that I use "-np 2".
>>> "mpiexec --host loki,exin -np 2 hello_1_mpi" works as well. The command
>>> breaks, if I use at least "-np 3" and distribute the processes across at
>>> least two machines.
>>> 
>>> loki hello_1 118 mpiexec --host loki:1,exin:1 -np 2 hello_1_mpi
>>> Process 0 of 2 running on loki
>>> Process 1 of 2 running on exin
>>> Now 1 slave tasks are sending greetings.
>>> Greetings from task 1:
>>>  message type:        3
>>>  msg length:          131 characters
>>>  message:
>>>    hostname:          exin
>>>    operating system:  Linux
>>>    release:           4.4.49-92.11-default
>>>    processor:         x86_64
>>> loki hello_1 119
>>> 
>>> 
>>> 
>>>> are loki and exin different ? (os, sockets, core)
>>> Yes, loki is a real machine and exin is a virtual one. "exin" uses a newer
>>> kernel.
>>> 
>>> loki fd1026 108 uname -a
>>> Linux loki 4.4.38-93-default #1 SMP Wed Dec 14 12:59:43 UTC 2016 (2d3e9d4) 
>>> x86_64 x86_64 x86_64 GNU/Linux
>>> 
>>> loki fd1026 109 ssh exin uname -a
>>> Linux exin 4.4.49-92.11-default #1 SMP Fri Feb 17 08:29:30 UTC 2017 
>>> (8f9478a) x86_64 x86_64 x86_64 GNU/Linux
>>> loki fd1026 110
>>> 
>>> The number of sockets and cores is identical, but the processor types are
>>> different as you can see at the end of my previous email. "loki" uses two
>>> "Intel(R) Xeon(R) CPU E5-2620 v3" processors and "exin" two "Intel Core
>>> Processor (Haswell, no TSX)" from QEMU. I can provide a pdf file with both
>>> topologies (89 K) if you are interested in the output from lstopo. I've
>>> added some runs. Most interesting in my opinion are the last two
>>> "mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi" and
>>> "mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi".
>>> Why does mpiexec create five processes although I've asked for only three
>>> processes? Why do I have to break the program with <Ctrl-c> for the first
>>> of the above commands?
>>> 
>>> 
>>> 
>>> loki hello_1 110 mpiexec --host loki:2,exin:1 -np 3 hello_1_mpi
>>> --------------------------------------------------------------------------
>>> There are not enough slots available in the system to satisfy the 3 slots
>>> that were requested by the application:
>>>  hello_1_mpi
>>> 
>>> Either request fewer slots for your application, or make more slots 
>>> available
>>> for use.
>>> --------------------------------------------------------------------------
>>> 
>>> 
>>> 
>>> loki hello_1 111 mpiexec --host exin:3 -np 3 hello_1_mpi
>>> Process 0 of 3 running on exin
>>> Process 1 of 3 running on exin
>>> Process 2 of 3 running on exin
>>> ...
>>> 
>>> 
>>> 
>>> loki hello_1 115 mpiexec --host exin:2,loki:3 -np 3 hello_1_mpi
>>> Process 1 of 3 running on loki
>>> Process 0 of 3 running on loki
>>> Process 2 of 3 running on loki
>>> ...
>>> 
>>> Process 0 of 3 running on exin
>>> Process 1 of 3 running on exin
>>> [exin][[52173,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:794:mca_btl_tcp_endpoint_complete_connect]
>>>  connect() to 193.xxx.xxx.xxx failed: Connection refused (111)
>>> 
>>> ^Cloki hello_1 116
>>> 
>>> 
>>> 
>>> 
>>> loki hello_1 116 mpiexec -np 3 --host exin:2,loki:3 hello_1_mpi
>>> Process 0 of 3 running on loki
>>> Process 2 of 3 running on loki
>>> Process 1 of 3 running on loki
>>> ...
>>> Process 1 of 3 running on exin
>>> Process 0 of 3 running on exin
>>> [exin][[51638,1],1][../../../../../openmpi-v3.x-201705250239-d5200ea/opal/mca/btl/tcp/btl_tcp_endpoint.c:590:mca_btl_tcp_endpoint_recv_blocking]
>>>  recv(16, 0/8) failed: Connection reset by peer (104)
>>> [exin:31909] 
>>> ../../../../../openmpi-v3.x-201705250239-d5200ea/ompi/mca/pml/ob1/pml_ob1_sendreq.c:191
>>>  FATAL
>>> loki hello_1 117
>>> 
>>> 
>>> Do you need anything else?
>>> 
>>> 
>>> Kind regards and thank you very much for your help
>>> 
>>> Siegmar
>>> 
>>> 
>>> 
>>>> Cheers,
>>>> Gilles
>>>> ----- Original Message -----
>>>>> Hi,
>>>>> 
>>>>> I have installed openmpi-v3.x-201705250239-d5200ea on my "SUSE Linux
>>>>> Enterprise Server 12.2 (x86_64)" with Sun C 5.14 and gcc-7.1.0.
>>>>> Depending on the machine that I use to start my processes, I have
>>>>> a problem with "--host" for versions "v3.x" and "master", while
>>>>> everything works as expected with earlier versions.
>>>>> 
>>>>> 
>>>>> loki hello_1 111 mpiexec -np 3 --host loki:2,exin hello_1_mpi
>>>>> ----------------------------------------------------------------------
>>>> ----
>>>>> There are not enough slots available in the system to satisfy the 3
>>>> slots
>>>>> that were requested by the application:
>>>>>    hello_1_mpi
>>>>> 
>>>>> Either request fewer slots for your application, or make more slots
>>>> available
>>>>> for use.
>>>>> ----------------------------------------------------------------------
>>>> ----
>>>>> 
>>>>> 
>>>>> Everything is ok if I use the same command on "exin".
>>>>> 
>>>>> exin fd1026 107 mpiexec -np 3 --host loki:2,exin hello_1_mpi
>>>>> Process 0 of 3 running on loki
>>>>> Process 1 of 3 running on loki
>>>>> Process 2 of 3 running on exin
>>>>> ...
>>>>> 
>>>>> 
>>>>> 
>>>>> Everything is also ok if I use openmpi-v2.x-201705260340-58c6b3c on "
>>>> loki".
>>>>> loki hello_1 114 which mpiexec
>>>>> /usr/local/openmpi-2.1.2_64_cc/bin/mpiexec
>>>>> loki hello_1 115 mpiexec -np 3 --host loki:2,exin hello_1_mpi
>>>>> Process 0 of 3 running on loki
>>>>> Process 1 of 3 running on loki
>>>>> Process 2 of 3 running on exin
>>>>> ...
>>>>> 
>>>>> 
>>>>> "exin" is a virtual machine on QEMU so that it uses a slightly
>>>> different
>>>>> processor architecture, e.g., it has no L3 cache but larger L2 caches.
>>>>> 
>>>>> loki fd1026 117 cat /proc/cpuinfo | grep -e "model name" -e "physical
>>>> id" -e
>>>>> "cpu cores" -e "cache size" | sort | uniq
>>>>> cache size    : 15360 KB
>>>>> cpu cores    : 6
>>>>> model name    : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
>>>>> physical id    : 0
>>>>> physical id    : 1
>>>>> 
>>>>> 
>>>>> loki fd1026 118 ssh exin cat /proc/cpuinfo | grep -e "model name" -e "
>>>> physical
>>>>> id" -e "cpu cores" -e "cache size" | sort | uniq
>>>>> cache size    : 4096 KB
>>>>> cpu cores    : 6
>>>>> model name    : Intel Core Processor (Haswell, no TSX)
>>>>> physical id    : 0
>>>>> physical id    : 1
>>>>> 
>>>>> 
>>>>> Any ideas what's different in the newer versions of Open MPI? Is the
>>>> new
>>>>> behavior intended? I would be grateful, if somebody can fix the
>>>> problem,
>>>>> if "mpiexec -np 3 --host loki:2,exin hello_1_mpi" should print my
>>>> messages
>>>>> in versions "3.x" and "master" as well, if the programs are started on
>>>> any
>>>>> machine. Do you need anything else? Thank you very much for any help
>>>> in
>>>>> advance.
>>>>> 
>>>>> 
>>>>> Kind regards
>>>>> 
>>>>> Siegmar
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to