Re: [OMPI users] long initialization

2014-08-28 Thread Timur Ismagilov

In OMPI 1.9a1r32604 I get much better results:
$ time mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c
Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
semenov@compiler-2 Distribution, ident: 1.9a1r32604, repo rev: r32604, Aug 26, 
2014 (nightly snapshot tarball), 146)
real 0m4.166s
user 0m0.034s
sys 0m0.079s


Thu, 28 Aug 2014 13:10:02 +0400 от Timur Ismagilov :
>I enclosure 2 files with output of two foloowing commands (OMPI 1.9a1r32570)
>$time mpirun --leave-session-attached -mca oob_base_verbose 100 -np 1 
>./hello_c >& out1.txt 
>(Hello, world, I am )
>real 1m3.952s
>user 0m0.035s
>sys 0m0.107s
>$time mpirun --leave-session-attached -mca oob_base_verbose 100 --mca 
>oob_tcp_if_include ib0 -np 1 ./hello_c >& out2.txt 
>(no Hello, word, I am )
>real 0m9.337s
>user 0m0.059s
>sys 0m0.098s
>Wed, 27 Aug 2014 06:31:02 -0700 от Ralph Castain :
>>How bizarre. Please add "--leave-session-attached -mca oob_base_verbose 100" 
>>to your cmd line
>>
>>On Aug 27, 2014, at 4:31 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>>When i try to specify oob with --mca oob_tcp_if_include >>from ifconfig>, i alwase get error:
>>>$ mpirun  --mca oob_tcp_if_include ib0 -np 1 ./hello_c
>>>--
>>>An ORTE daemon has unexpectedly failed after launch and before
>>>communicating back to mpirun. This could be caused by a number
>>>of factors, including an inability to create a connection back
>>>to mpirun due to a lack of common network interfaces and/or no
>>>route found between them. Please check network connectivity
>>>(including firewalls and network routing requirements).
>>>-
>>>
>>>Earlier, in ompi 1.8.1, I can not run mpi jobs without " --mca 
>>>oob_tcp_if_include ib0 "... but now(ompi 1.9.a1) with this flag i get above 
>>>error.
>>>
>>>Here is an output of ifconfig
>>>$ ifconfig
>>>eth1 Link encap:Ethernet HWaddr 00:15:17:EE:89:E1  
>>>inet addr:10.0.251.53 Bcast:10.0.251.255 Mask:255.255.255.0
>>>UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>RX packets:215087433 errors:0 dropped:0 overruns:0 frame:0
>>>TX packets:2648 errors:0 dropped:0 overruns:0 carrier:0
>>>collisions:0 txqueuelen:1000  
>>>RX bytes:26925754883 (25.0 GiB) TX bytes:137971 (134.7 KiB)
>>>Memory:b2c0-b2c2
>>>eth2 Link encap:Ethernet HWaddr 00:02:C9:04:73:F8  
>>>inet addr:10.0.0.4 Bcast:10.0.0.255 Mask:255.255.255.0
>>>UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>RX packets:4892833125 errors:0 dropped:0 overruns:0 frame:0
>>>TX packets:8708606918 errors:0 dropped:0 overruns:0 carrier:0
>>>collisions:0 txqueuelen:1000  
>>>RX bytes:1823986502132 (1.6 TiB) TX bytes:11957754120037 (10.8 TiB)
>>>eth2.911 Link encap:Ethernet HWaddr 00:02:C9:04:73:F8  
>>>inet addr:93.180.7.38 Bcast:93.180.7.63 Mask:255.255.255.224
>>>UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>RX packets:3746454225 errors:0 dropped:0 overruns:0 frame:0
>>>TX packets:1131917608 errors:0 dropped:3 overruns:0 carrier:0
>>>collisions:0 txqueuelen:0  
>>>RX bytes:285174723322 (265.5 GiB) TX bytes:11523163526058 (10.4 TiB)
>>>eth3 Link encap:Ethernet HWaddr 00:02:C9:04:73:F9  
>>>inet addr:10.2.251.14 Bcast:10.2.251.255 Mask:255.255.255.0
>>>UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>RX packets:591156692 errors:0 dropped:56 overruns:56 frame:56
>>>TX packets:679729229 errors:0 dropped:0 overruns:0 carrier:0
>>>collisions:0 txqueuelen:1000  
>>>RX bytes:324195989293 (301.9 GiB) TX bytes:770299202886 (717.3 GiB)
>>>Ifconfig uses the ioctl access method to get the full address information, 
>>>which limits hardware addresses to 8 bytes.
>>>Because Infiniband address has 20 bytes, only the first 8 bytes are 
>>>displayed correctly.
>>>Ifconfig is obsolete! For replacement check ip.
>>>ib0 Link encap:InfiniBand HWaddr 
>>>80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
>>>inet addr:10.128.0.4 Bcast:10.128.255.255 Mask:255.255.0.0
>>>UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
>>>RX packets:10843859 errors:0 dropped:0 overruns:0 frame:0
>>>TX packets:8089839 errors:0 dropped:15 overruns:0 carrier:0
>>>collisions:0 txqueuelen:1024  
>>>RX bytes:939249464 (895.7 MiB) TX bytes:886054008 (845.0 MiB)
>>>lo Link encap:Local Loopback  
>>>inet addr:127.0.0.1 Mask:255.0.0.0
>>>UP LOOPBACK RUNNING MTU:16436 Metric:1
>>>RX packets:31235107 errors:0 dropped:0 overruns:0 frame:0
>>>TX packets:31235107 errors:0 dropped:0 overruns:0 carrier:0
>>>collisions:0 txqueuelen:0  
>>>RX bytes:132750916041 (123.6 GiB) TX bytes:132750916041 (123.6 GiB)
>>>
>>>
>>>
>>>Tue, 26 Aug 2014 09:48:35 -0700 от Ralph Castain < r...@open-mpi.org >:
I think something may be messed up with your installation. I went ahead and 
tested this on a Slurm 2.5.4 cluster, and got the following:

$ time mpirun -np 1 --host bend001 ./hello
Hello, World, I am 0 of 1 

Re: [OMPI users] long initialization

2014-08-28 Thread Timur Ismagilov

I enclosure 2 files with output of two foloowing commands (OMPI 1.9a1r32570)
$time mpirun --leave-session-attached -mca oob_base_verbose 100 -np 1 ./hello_c 
>& out1.txt 
(Hello, world, I am )
real 1m3.952s
user 0m0.035s
sys 0m0.107s
$time mpirun --leave-session-attached -mca oob_base_verbose 100 --mca 
oob_tcp_if_include ib0 -np 1 ./hello_c >& out2.txt 
(no Hello, word, I am )
real 0m9.337s
user 0m0.059s
sys 0m0.098s
Wed, 27 Aug 2014 06:31:02 -0700 от Ralph Castain :
>How bizarre. Please add "--leave-session-attached -mca oob_base_verbose 100" 
>to your cmd line
>
>On Aug 27, 2014, at 4:31 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>When i try to specify oob with --mca oob_tcp_if_include >from ifconfig>, i alwase get error:
>>$ mpirun  --mca oob_tcp_if_include ib0 -np 1 ./hello_c
>>--
>>An ORTE daemon has unexpectedly failed after launch and before
>>communicating back to mpirun. This could be caused by a number
>>of factors, including an inability to create a connection back
>>to mpirun due to a lack of common network interfaces and/or no
>>route found between them. Please check network connectivity
>>(including firewalls and network routing requirements).
>>-
>>
>>Earlier, in ompi 1.8.1, I can not run mpi jobs without " --mca 
>>oob_tcp_if_include ib0 "... but now(ompi 1.9.a1) with this flag i get above 
>>error.
>>
>>Here is an output of ifconfig
>>$ ifconfig
>>eth1 Link encap:Ethernet HWaddr 00:15:17:EE:89:E1  
>>inet addr:10.0.251.53 Bcast:10.0.251.255 Mask:255.255.255.0
>>UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>RX packets:215087433 errors:0 dropped:0 overruns:0 frame:0
>>TX packets:2648 errors:0 dropped:0 overruns:0 carrier:0
>>collisions:0 txqueuelen:1000  
>>RX bytes:26925754883 (25.0 GiB) TX bytes:137971 (134.7 KiB)
>>Memory:b2c0-b2c2
>>eth2 Link encap:Ethernet HWaddr 00:02:C9:04:73:F8  
>>inet addr:10.0.0.4 Bcast:10.0.0.255 Mask:255.255.255.0
>>UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>RX packets:4892833125 errors:0 dropped:0 overruns:0 frame:0
>>TX packets:8708606918 errors:0 dropped:0 overruns:0 carrier:0
>>collisions:0 txqueuelen:1000  
>>RX bytes:1823986502132 (1.6 TiB) TX bytes:11957754120037 (10.8 TiB)
>>eth2.911 Link encap:Ethernet HWaddr 00:02:C9:04:73:F8  
>>inet addr:93.180.7.38 Bcast:93.180.7.63 Mask:255.255.255.224
>>UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>RX packets:3746454225 errors:0 dropped:0 overruns:0 frame:0
>>TX packets:1131917608 errors:0 dropped:3 overruns:0 carrier:0
>>collisions:0 txqueuelen:0  
>>RX bytes:285174723322 (265.5 GiB) TX bytes:11523163526058 (10.4 TiB)
>>eth3 Link encap:Ethernet HWaddr 00:02:C9:04:73:F9  
>>inet addr:10.2.251.14 Bcast:10.2.251.255 Mask:255.255.255.0
>>UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>RX packets:591156692 errors:0 dropped:56 overruns:56 frame:56
>>TX packets:679729229 errors:0 dropped:0 overruns:0 carrier:0
>>collisions:0 txqueuelen:1000  
>>RX bytes:324195989293 (301.9 GiB) TX bytes:770299202886 (717.3 GiB)
>>Ifconfig uses the ioctl access method to get the full address information, 
>>which limits hardware addresses to 8 bytes.
>>Because Infiniband address has 20 bytes, only the first 8 bytes are displayed 
>>correctly.
>>Ifconfig is obsolete! For replacement check ip.
>>ib0 Link encap:InfiniBand HWaddr 
>>80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
>>inet addr:10.128.0.4 Bcast:10.128.255.255 Mask:255.255.0.0
>>UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
>>RX packets:10843859 errors:0 dropped:0 overruns:0 frame:0
>>TX packets:8089839 errors:0 dropped:15 overruns:0 carrier:0
>>collisions:0 txqueuelen:1024  
>>RX bytes:939249464 (895.7 MiB) TX bytes:886054008 (845.0 MiB)
>>lo Link encap:Local Loopback  
>>inet addr:127.0.0.1 Mask:255.0.0.0
>>UP LOOPBACK RUNNING MTU:16436 Metric:1
>>RX packets:31235107 errors:0 dropped:0 overruns:0 frame:0
>>TX packets:31235107 errors:0 dropped:0 overruns:0 carrier:0
>>collisions:0 txqueuelen:0  
>>RX bytes:132750916041 (123.6 GiB) TX bytes:132750916041 (123.6 GiB)
>>
>>
>>
>>Tue, 26 Aug 2014 09:48:35 -0700 от Ralph Castain < r...@open-mpi.org >:
>>>I think something may be messed up with your installation. I went ahead and 
>>>tested this on a Slurm 2.5.4 cluster, and got the following:
>>>
>>>$ time mpirun -np 1 --host bend001 ./hello
>>>Hello, World, I am 0 of 1 [0 local peers]: get_cpubind: 0 bitmap 0,12
>>>
>>>real 0m0.086s
>>>user 0m0.039s
>>>sys 0m0.046s
>>>
>>>$ time mpirun -np 1 --host bend002 ./hello
>>>Hello, World, I am 0 of 1 [0 local peers]: get_cpubind: 0 bitmap 0,12
>>>
>>>real 0m0.528s
>>>user 0m0.021s
>>>sys 0m0.023s
>>>
>>>Which is what I would have expected. With --host set to the local host, no 
>>>daemons are being launched and so the time is quite short (just spent 
>>>mapping and fork/exec). With --host set to a single remote host, 

Re: [OMPI users] long initialization

2014-08-27 Thread Ralph Castain
How bizarre. Please add "--leave-session-attached -mca oob_base_verbose 100" to 
your cmd line

On Aug 27, 2014, at 4:31 AM, Timur Ismagilov  wrote:

> When i try to specify oob with --mca oob_tcp_if_include  from ifconfig>, i alwase get error:
> 
> $ mpirun  --mca oob_tcp_if_include ib0 -np 1 ./hello_c
> --
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> -
> 
> Earlier, in ompi 1.8.1, I can not run mpi jobs without " --mca 
> oob_tcp_if_include ib0 "... but now(ompi 1.9.a1) with this flag i get above 
> error.
> 
> Here is an output of ifconfig
> 
> $ ifconfig
> eth1 Link encap:Ethernet HWaddr 00:15:17:EE:89:E1 
> inet addr:10.0.251.53 Bcast:10.0.251.255 Mask:255.255.255.0
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:215087433 errors:0 dropped:0 overruns:0 frame:0
> TX packets:2648 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000 
> RX bytes:26925754883 (25.0 GiB) TX bytes:137971 (134.7 KiB)
> Memory:b2c0-b2c2
> 
> eth2 Link encap:Ethernet HWaddr 00:02:C9:04:73:F8 
> inet addr:10.0.0.4 Bcast:10.0.0.255 Mask:255.255.255.0
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:4892833125 errors:0 dropped:0 overruns:0 frame:0
> TX packets:8708606918 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000 
> RX bytes:1823986502132 (1.6 TiB) TX bytes:11957754120037 (10.8 TiB)
> 
> eth2.911 Link encap:Ethernet HWaddr 00:02:C9:04:73:F8 
> inet addr:93.180.7.38 Bcast:93.180.7.63 Mask:255.255.255.224
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:3746454225 errors:0 dropped:0 overruns:0 frame:0
> TX packets:1131917608 errors:0 dropped:3 overruns:0 carrier:0
> collisions:0 txqueuelen:0 
> RX bytes:285174723322 (265.5 GiB) TX bytes:11523163526058 (10.4 TiB)
> 
> eth3 Link encap:Ethernet HWaddr 00:02:C9:04:73:F9 
> inet addr:10.2.251.14 Bcast:10.2.251.255 Mask:255.255.255.0
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:591156692 errors:0 dropped:56 overruns:56 frame:56
> TX packets:679729229 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:1000 
> RX bytes:324195989293 (301.9 GiB) TX bytes:770299202886 (717.3 GiB)
> 
> Ifconfig uses the ioctl access method to get the full address information, 
> which limits hardware addresses to 8 bytes.
> Because Infiniband address has 20 bytes, only the first 8 bytes are displayed 
> correctly.
> Ifconfig is obsolete! For replacement check ip.
> ib0 Link encap:InfiniBand HWaddr 
> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 
> inet addr:10.128.0.4 Bcast:10.128.255.255 Mask:255.255.0.0
> UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
> RX packets:10843859 errors:0 dropped:0 overruns:0 frame:0
> TX packets:8089839 errors:0 dropped:15 overruns:0 carrier:0
> collisions:0 txqueuelen:1024 
> RX bytes:939249464 (895.7 MiB) TX bytes:886054008 (845.0 MiB)
> 
> lo Link encap:Local Loopback 
> inet addr:127.0.0.1 Mask:255.0.0.0
> UP LOOPBACK RUNNING MTU:16436 Metric:1
> RX packets:31235107 errors:0 dropped:0 overruns:0 frame:0
> TX packets:31235107 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0 
> RX bytes:132750916041 (123.6 GiB) TX bytes:132750916041 (123.6 GiB)
> 
> 
> 
> 
> Tue, 26 Aug 2014 09:48:35 -0700 от Ralph Castain :
> 
> I think something may be messed up with your installation. I went ahead and 
> tested this on a Slurm 2.5.4 cluster, and got the following:
> 
> $ time mpirun -np 1 --host bend001 ./hello
> Hello, World, I am 0 of 1 [0 local peers]: get_cpubind: 0 bitmap 0,12
> 
> real  0m0.086s
> user  0m0.039s
> sys   0m0.046s
> 
> $ time mpirun -np 1 --host bend002 ./hello
> Hello, World, I am 0 of 1 [0 local peers]: get_cpubind: 0 bitmap 0,12
> 
> real  0m0.528s
> user  0m0.021s
> sys   0m0.023s
> 
> Which is what I would have expected. With --host set to the local host, no 
> daemons are being launched and so the time is quite short (just spent mapping 
> and fork/exec). With --host set to a single remote host, you have the time it 
> takes Slurm to launch our daemon on the remote host, so you get about half of 
> a second.
> 
> IIRC, you were having some problems with the OOB setup. If you specify the 
> TCP interface to use, does your time come down?
> 
> 
> On Aug 26, 2014, at 8:32 AM, Timur Ismagilov  wrote:
> 
>> I'm using slurm 2.5.6
>> 
>> $salloc -N8 --exclusive -J ompi -p test
>> 
>> $ srun hostname
>> node1-128-21
>> node1-128-24
>> node1-128-22
>> node1-128-26
>> node1-128-27

Re: [OMPI users] long initialization

2014-08-27 Thread Timur Ismagilov

When i try to specify oob with --mca oob_tcp_if_include , i alwase get error:
$ mpirun  --mca oob_tcp_if_include ib0 -np 1 ./hello_c
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
-

Earlier, in ompi 1.8.1, I can not run mpi jobs without " --mca 
oob_tcp_if_include ib0 "... but now(ompi 1.9.a1) with this flag i get above 
error.

Here is an output of ifconfig
$ ifconfig
eth1 Link encap:Ethernet HWaddr 00:15:17:EE:89:E1 
inet addr:10.0.251.53 Bcast:10.0.251.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:215087433 errors:0 dropped:0 overruns:0 frame:0
TX packets:2648 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000 
RX bytes:26925754883 (25.0 GiB) TX bytes:137971 (134.7 KiB)
Memory:b2c0-b2c2
eth2 Link encap:Ethernet HWaddr 00:02:C9:04:73:F8 
inet addr:10.0.0.4 Bcast:10.0.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:4892833125 errors:0 dropped:0 overruns:0 frame:0
TX packets:8708606918 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000 
RX bytes:1823986502132 (1.6 TiB) TX bytes:11957754120037 (10.8 TiB)
eth2.911 Link encap:Ethernet HWaddr 00:02:C9:04:73:F8 
inet addr:93.180.7.38 Bcast:93.180.7.63 Mask:255.255.255.224
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:3746454225 errors:0 dropped:0 overruns:0 frame:0
TX packets:1131917608 errors:0 dropped:3 overruns:0 carrier:0
collisions:0 txqueuelen:0 
RX bytes:285174723322 (265.5 GiB) TX bytes:11523163526058 (10.4 TiB)
eth3 Link encap:Ethernet HWaddr 00:02:C9:04:73:F9 
inet addr:10.2.251.14 Bcast:10.2.251.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:591156692 errors:0 dropped:56 overruns:56 frame:56
TX packets:679729229 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000 
RX bytes:324195989293 (301.9 GiB) TX bytes:770299202886 (717.3 GiB)
Ifconfig uses the ioctl access method to get the full address information, 
which limits hardware addresses to 8 bytes.
Because Infiniband address has 20 bytes, only the first 8 bytes are displayed 
correctly.
Ifconfig is obsolete! For replacement check ip.
ib0 Link encap:InfiniBand HWaddr 
80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 
inet addr:10.128.0.4 Bcast:10.128.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:10843859 errors:0 dropped:0 overruns:0 frame:0
TX packets:8089839 errors:0 dropped:15 overruns:0 carrier:0
collisions:0 txqueuelen:1024 
RX bytes:939249464 (895.7 MiB) TX bytes:886054008 (845.0 MiB)
lo Link encap:Local Loopback 
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:31235107 errors:0 dropped:0 overruns:0 frame:0
TX packets:31235107 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0 
RX bytes:132750916041 (123.6 GiB) TX bytes:132750916041 (123.6 GiB)



Tue, 26 Aug 2014 09:48:35 -0700 от Ralph Castain :
>I think something may be messed up with your installation. I went ahead and 
>tested this on a Slurm 2.5.4 cluster, and got the following:
>
>$ time mpirun -np 1 --host bend001 ./hello
>Hello, World, I am 0 of 1 [0 local peers]: get_cpubind: 0 bitmap 0,12
>
>real 0m0.086s
>user 0m0.039s
>sys 0m0.046s
>
>$ time mpirun -np 1 --host bend002 ./hello
>Hello, World, I am 0 of 1 [0 local peers]: get_cpubind: 0 bitmap 0,12
>
>real 0m0.528s
>user 0m0.021s
>sys 0m0.023s
>
>Which is what I would have expected. With --host set to the local host, no 
>daemons are being launched and so the time is quite short (just spent mapping 
>and fork/exec). With --host set to a single remote host, you have the time it 
>takes Slurm to launch our daemon on the remote host, so you get about half of 
>a second.
>
>IIRC, you were having some problems with the OOB setup. If you specify the TCP 
>interface to use, does your time come down?
>
>
>On Aug 26, 2014, at 8:32 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>I'm using slurm 2.5.6
>>
>>$salloc -N8 --exclusive -J ompi -p test
>>$ srun hostname
>>node1-128-21
>>node1-128-24
>>node1-128-22
>>node1-128-26
>>node1-128-27
>>node1-128-20
>>node1-128-25
>>node1-128-23
>>$ time mpirun -np 1 --host node1-128-21 ./hello_c
>>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
>>semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 
>>21, 2014 (nightly snapshot tarball), 146)
>>real 1m3.932s
>>user 0m0.035s
>>sys 0m0.072s
>>
>>
>>Tue, 26 Aug 2014 07:03:58 -0700 от Ralph Castain < 

Re: [OMPI users] long initialization

2014-08-26 Thread Ralph Castain
I think something may be messed up with your installation. I went ahead and 
tested this on a Slurm 2.5.4 cluster, and got the following:

$ time mpirun -np 1 --host bend001 ./hello
Hello, World, I am 0 of 1 [0 local peers]: get_cpubind: 0 bitmap 0,12

real0m0.086s
user0m0.039s
sys 0m0.046s

$ time mpirun -np 1 --host bend002 ./hello
Hello, World, I am 0 of 1 [0 local peers]: get_cpubind: 0 bitmap 0,12

real0m0.528s
user0m0.021s
sys 0m0.023s

Which is what I would have expected. With --host set to the local host, no 
daemons are being launched and so the time is quite short (just spent mapping 
and fork/exec). With --host set to a single remote host, you have the time it 
takes Slurm to launch our daemon on the remote host, so you get about half of a 
second.

IIRC, you were having some problems with the OOB setup. If you specify the TCP 
interface to use, does your time come down?


On Aug 26, 2014, at 8:32 AM, Timur Ismagilov  wrote:

> I'm using slurm 2.5.6
> 
> $salloc -N8 --exclusive -J ompi -p test
> 
> $ srun hostname
> node1-128-21
> node1-128-24
> node1-128-22
> node1-128-26
> node1-128-27
> node1-128-20
> node1-128-25
> node1-128-23
> 
> $ time mpirun -np 1 --host node1-128-21 ./hello_c
> Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
> semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 
> 21, 2014 (nightly snapshot tarball), 146)
> 
> real 1m3.932s
> user 0m0.035s
> sys 0m0.072s
> 
> 
> 
> 
> Tue, 26 Aug 2014 07:03:58 -0700 от Ralph Castain :
> hmmmwhat is your allocation like? do you have a large hostfile, for 
> example?
> 
> if you add a --host argument that contains just the local host, what is the 
> time for that scenario?
> 
> On Aug 26, 2014, at 6:27 AM, Timur Ismagilov  wrote:
> 
>> Hello!
>> Here is my time results:
>> 
>> $time mpirun -n 1 ./hello_c
>> Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
>> semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 
>> 21, 2014 (nightly snapshot tarball), 146)
>> 
>> real 1m3.985s
>> user 0m0.031s
>> sys 0m0.083s
>> 
>> 
>> 
>> 
>> Fri, 22 Aug 2014 07:43:03 -0700 от Ralph Castain :
>> I'm also puzzled by your timing statement - I can't replicate it:
>> 
>> 07:41:43  $ time mpirun -n 1 ./hello_c
>> Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI rhc@bend001 
>> Distribution, ident: 1.9a1r32577, repo rev: r32577, Unreleased developer 
>> copy, 125)
>> 
>> real 0m0.547s
>> user 0m0.043s
>> sys  0m0.046s
>> 
>> The entire thing ran in 0.5 seconds
>> 
>> 
>> On Aug 22, 2014, at 6:33 AM, Mike Dubman  wrote:
>> 
>>> Hi,
>>> The default delimiter is ";" . You can change delimiter with 
>>> mca_base_env_list_delimiter.
>>> 
>>> 
>>> 
>>> On Fri, Aug 22, 2014 at 2:59 PM, Timur Ismagilov  wrote:
>>> Hello!
>>> If i use latest night snapshot:
>>> $ ompi_info -V
>>> Open MPI v1.9a1r32570
>>> 
>>> In programm hello_c initialization takes ~1 min
>>> In ompi 1.8.2rc4 and ealier it takes ~1 sec(or less)
>>> if i use 
>>> $mpirun  --mca mca_base_env_list 'MXM_SHM_KCOPY_MODE=off,OMP_NUM_THREADS=8' 
>>> --map-by slot:pe=8 -np 1 ./hello_c
>>> i got error 
>>> config_parser.c:657  MXM  ERROR Invalid value for SHM_KCOPY_MODE: 
>>> 'off,OMP_NUM_THREADS=8'. Expected: [off|knem|cma|autodetect]
>>> but with -x all works fine (but with warn)
>>> $mpirun  -x MXM_SHM_KCOPY_MODE=off -x OMP_NUM_THREADS=8 -np 1 ./hello_c
>>> WARNING: The mechanism by which environment variables are explicitly
>>> ..
>>> ..
>>> ..
>>> Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
>>> semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 
>>> 21, 2014 (nightly snapshot tarball), 146)
>>> 
>>> 
>>> Thu, 21 Aug 2014 06:26:13 -0700 от Ralph Castain :
>>> Not sure I understand. The problem has been fixed in both the trunk and the 
>>> 1.8 branch now, so you should be able to work with either of those nightly 
>>> builds.
>>> 
>>> On Aug 21, 2014, at 12:02 AM, Timur Ismagilov  wrote:
>>> 
 Have i I any opportunity to run mpi jobs?
 
 
 Wed, 20 Aug 2014 10:48:38 -0700 от Ralph Castain :
 yes, i know - it is cmr'd
 
 On Aug 20, 2014, at 10:26 AM, Mike Dubman  wrote:
 
> btw, we get same error in v1.8 branch as well.
> 
> 
> On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain  wrote:
> It was not yet fixed - but should be now.
> 
> On Aug 20, 2014, at 6:39 AM, Timur Ismagilov  wrote:
> 
>> Hello!
>> 
>> As i can see, the bug is fixed, but in Open MPI v1.9a1r32516  i still 
>> have the problem
>> 
>> a)
>> $ mpirun  -np 1 ./hello_c
>> 
>> 

Re: [OMPI users] long initialization

2014-08-26 Thread Timur Ismagilov

Hello!
Here is my time results:
$time mpirun -n 1 ./hello_c
Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 21, 
2014 (nightly snapshot tarball), 146)
real 1m3.985s
user 0m0.031s
sys 0m0.083s


Fri, 22 Aug 2014 07:43:03 -0700 от Ralph Castain :
>I'm also puzzled by your timing statement - I can't replicate it:
>
>07:41:43  $ time mpirun -n 1 ./hello_c
>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI rhc@bend001 
>Distribution, ident: 1.9a1r32577, repo rev: r32577, Unreleased developer copy, 
>125)
>
>real 0m0.547s
>user 0m0.043s
>sys 0m0.046s
>
>The entire thing ran in 0.5 seconds
>
>
>On Aug 22, 2014, at 6:33 AM, Mike Dubman < mi...@dev.mellanox.co.il > wrote:
>>Hi,
>>The default delimiter is ";" . You can change delimiter with 
>>mca_base_env_list_delimiter.
>>
>>
>>
>>On Fri, Aug 22, 2014 at 2:59 PM, Timur Ismagilov  < tismagi...@mail.ru > 
>>wrote:
>>>Hello!
>>>If i use latest night snapshot:
>>>$ ompi_info -V
>>>Open MPI v1.9a1r32570
>>>*  In programm hello_c initialization takes ~1 min
>>>In ompi 1.8.2rc4 and ealier it takes ~1 sec(or less)
>>>*  if i use 
>>>$mpirun  --mca mca_base_env_list 'MXM_SHM_KCOPY_MODE=off,OMP_NUM_THREADS=8' 
>>>--map-by slot:pe=8 -np 1 ./hello_c
>>>i got error 
>>>config_parser.c:657  MXM  ERROR Invalid value for SHM_KCOPY_MODE: 
>>>'off,OMP_NUM_THREADS=8'. Expected: [off|knem|cma|autodetect]
>>>but with -x all works fine (but with warn)
>>>$mpirun  -x MXM_SHM_KCOPY_MODE=off -x OMP_NUM_THREADS=8 -np 1 ./hello_c
>>>WARNING: The mechanism by which environment variables are explicitly
>>>..
>>>..
>>>..
>>>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
>>>semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 
>>>21, 2014 (nightly snapshot tarball), 146)
>>>Thu, 21 Aug 2014 06:26:13 -0700 от Ralph Castain < r...@open-mpi.org >:
Not sure I understand. The problem has been fixed in both the trunk and the 
1.8 branch now, so you should be able to work with either of those nightly 
builds.

On Aug 21, 2014, at 12:02 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>Have i I any opportunity to run mpi jobs?
>
>
>Wed, 20 Aug 2014 10:48:38 -0700 от Ralph Castain < r...@open-mpi.org >:
>>yes, i know - it is cmr'd
>>
>>On Aug 20, 2014, at 10:26 AM, Mike Dubman < mi...@dev.mellanox.co.il > 
>>wrote:
>>>btw, we get same error in v1.8 branch as well.
>>>
>>>
>>>On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain   < r...@open-mpi.org >   
>>>wrote:
It was not yet fixed - but should be now.

On Aug 20, 2014, at 6:39 AM, Timur Ismagilov < tismagi...@mail.ru > 
wrote:
>Hello!
>
>As i can see, the bug is fixed, but in Open MPI v1.9a1r32516  i still 
>have the problem
>
>a)
>$ mpirun  -np 1 ./hello_c
>--
>An ORTE daemon has unexpectedly failed after launch and before
>communicating back to mpirun. This could be caused by a number
>of factors, including an inability to create a connection back
>to mpirun due to a lack of common network interfaces and/or no
>route found between them. Please check network connectivity
>(including firewalls and network routing requirements).
>--
>b)
>$ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c
>--
>An ORTE daemon has unexpectedly failed after launch and before
>communicating back to mpirun. This could be caused by a number
>of factors, including an inability to create a connection back
>to mpirun due to a lack of common network interfaces and/or no
>route found between them. Please check network connectivity
>(including firewalls and network routing requirements).
>--
>
>c)
>
>$ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca 
>plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 
>-np 1 ./hello_c
>[compiler-2:14673] mca:base:select:( plm) Querying component [isolated]
>[compiler-2:14673] mca:base:select:( plm) Query of component 
>[isolated] set priority to 0
>[compiler-2:14673] mca:base:select:( plm) Querying component [rsh]
>[compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set 
>priority to 10
>[compiler-2:14673] mca:base:select:( plm) Querying component [slurm]
>[compiler-2:14673] mca:base:select:( plm) Query of component 

Re: [OMPI users] long initialization

2014-08-22 Thread Ralph Castain
I'm also puzzled by your timing statement - I can't replicate it:

07:41:43  $ time mpirun -n 1 ./hello_c
Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI rhc@bend001 
Distribution, ident: 1.9a1r32577, repo rev: r32577, Unreleased developer copy, 
125)

real0m0.547s
user0m0.043s
sys 0m0.046s

The entire thing ran in 0.5 seconds


On Aug 22, 2014, at 6:33 AM, Mike Dubman  wrote:

> Hi,
> The default delimiter is ";" . You can change delimiter with 
> mca_base_env_list_delimiter.
> 
> 
> 
> On Fri, Aug 22, 2014 at 2:59 PM, Timur Ismagilov  wrote:
> Hello!
> If i use latest night snapshot:
> $ ompi_info -V
> Open MPI v1.9a1r32570
> 
> In programm hello_c initialization takes ~1 min
> In ompi 1.8.2rc4 and ealier it takes ~1 sec(or less)
> if i use 
> $mpirun  --mca mca_base_env_list 'MXM_SHM_KCOPY_MODE=off,OMP_NUM_THREADS=8' 
> --map-by slot:pe=8 -np 1 ./hello_c
> i got error 
> config_parser.c:657  MXM  ERROR Invalid value for SHM_KCOPY_MODE: 
> 'off,OMP_NUM_THREADS=8'. Expected: [off|knem|cma|autodetect]
> but with -x all works fine (but with warn)
> $mpirun  -x MXM_SHM_KCOPY_MODE=off -x OMP_NUM_THREADS=8 -np 1 ./hello_c
> WARNING: The mechanism by which environment variables are explicitly
> ..
> ..
> ..
> Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
> semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 
> 21, 2014 (nightly snapshot tarball), 146)
> 
> 
> Thu, 21 Aug 2014 06:26:13 -0700 от Ralph Castain :
> Not sure I understand. The problem has been fixed in both the trunk and the 
> 1.8 branch now, so you should be able to work with either of those nightly 
> builds.
> 
> On Aug 21, 2014, at 12:02 AM, Timur Ismagilov  wrote:
> 
>> Have i I any opportunity to run mpi jobs?
>> 
>> 
>> Wed, 20 Aug 2014 10:48:38 -0700 от Ralph Castain :
>> yes, i know - it is cmr'd
>> 
>> On Aug 20, 2014, at 10:26 AM, Mike Dubman  wrote:
>> 
>>> btw, we get same error in v1.8 branch as well.
>>> 
>>> 
>>> On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain  wrote:
>>> It was not yet fixed - but should be now.
>>> 
>>> On Aug 20, 2014, at 6:39 AM, Timur Ismagilov  wrote:
>>> 
 Hello!
 
 As i can see, the bug is fixed, but in Open MPI v1.9a1r32516  i still have 
 the problem
 
 a)
 $ mpirun  -np 1 ./hello_c
 
 --
 An ORTE daemon has unexpectedly failed after launch and before
 communicating back to mpirun. This could be caused by a number
 of factors, including an inability to create a connection back
 to mpirun due to a lack of common network interfaces and/or no
 route found between them. Please check network connectivity
 (including firewalls and network routing requirements).
 --
 
 b)
 $ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c
 --
 An ORTE daemon has unexpectedly failed after launch and before
 communicating back to mpirun. This could be caused by a number
 of factors, including an inability to create a connection back
 to mpirun due to a lack of common network interfaces and/or no
 route found between them. Please check network connectivity
 (including firewalls and network routing requirements).
 --
 
 c)
 
 $ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca 
 plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1 
 ./hello_c
 
 [compiler-2:14673] mca:base:select:( plm) Querying component [isolated]
 [compiler-2:14673] mca:base:select:( plm) Query of component [isolated] 
 set priority to 0
 [compiler-2:14673] mca:base:select:( plm) Querying component [rsh]
 [compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set 
 priority to 10
 [compiler-2:14673] mca:base:select:( plm) Querying component [slurm]
 [compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set 
 priority to 75
 [compiler-2:14673] mca:base:select:( plm) Selected component [slurm]
 [compiler-2:14673] mca: base: components_register: registering oob 
 components
 [compiler-2:14673] mca: base: components_register: found loaded component 
 tcp
 [compiler-2:14673] mca: base: components_register: component tcp register 
 function successful
 [compiler-2:14673] mca: base: components_open: opening oob components
 [compiler-2:14673] mca: base: components_open: found loaded component tcp
 [compiler-2:14673] mca: base: components_open: component 

Re: [OMPI users] long initialization

2014-08-22 Thread Mike Dubman
Hi,
The default delimiter is ";" . You can change delimiter with
mca_base_env_list_delimiter.



On Fri, Aug 22, 2014 at 2:59 PM, Timur Ismagilov  wrote:

> Hello!
> If i use latest night snapshot:
>
> $ ompi_info -V
> Open MPI v1.9a1r32570
>
>1. In programm hello_c initialization takes ~1 min
>In ompi 1.8.2rc4 and ealier it takes ~1 sec(or less)
>2. if i use
>$mpirun  --mca mca_base_env_list
>'MXM_SHM_KCOPY_MODE=off,OMP_NUM_THREADS=8' --map-by slot:pe=8 -np 1
>./hello_c
>i got error
>config_parser.c:657  MXM  ERROR Invalid value for SHM_KCOPY_MODE:
>'off,OMP_NUM_THREADS=8'. Expected: [off|knem|cma|autodetect]
>but with -x all works fine (but with warn)
>$mpirun  -x MXM_SHM_KCOPY_MODE=off -x OMP_NUM_THREADS=8 -np 1 ./hello_c
>
>WARNING: The mechanism by which environment variables are explicitly
>..
>..
>..
>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI
>semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570,
>Aug 21, 2014 (nightly snapshot tarball), 146)
>
>
> Thu, 21 Aug 2014 06:26:13 -0700 от Ralph Castain :
>
>   Not sure I understand. The problem has been fixed in both the trunk and
> the 1.8 branch now, so you should be able to work with either of those
> nightly builds.
>
> On Aug 21, 2014, at 12:02 AM, Timur Ismagilov  > wrote:
>
> Have i I any opportunity to run mpi jobs?
>
>
> Wed, 20 Aug 2014 10:48:38 -0700 от Ralph Castain  >:
>
> yes, i know - it is cmr'd
>
> On Aug 20, 2014, at 10:26 AM, Mike Dubman 
> wrote:
>
> btw, we get same error in v1.8 branch as well.
>
>
> On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain  wrote:
>
> It was not yet fixed - but should be now.
>
> On Aug 20, 2014, at 6:39 AM, Timur Ismagilov  wrote:
>
> Hello!
>
> As i can see, the bug is fixed, but in Open MPI v1.9a1r32516  i still have
> the problem
>
> a)
> $ mpirun  -np 1 ./hello_c
>
> --
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> --
>
> b)
> $ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c
> --
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> --
>
> c)
>
> $ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca
> plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1
> ./hello_c
>
> [compiler-2:14673] mca:base:select:( plm) Querying component [isolated]
> [compiler-2:14673] mca:base:select:( plm) Query of component [isolated]
> set priority to 0
> [compiler-2:14673] mca:base:select:( plm) Querying component [rsh]
> [compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set
> priority to 10
> [compiler-2:14673] mca:base:select:( plm) Querying component [slurm]
> [compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set
> priority to 75
> [compiler-2:14673] mca:base:select:( plm) Selected component [slurm]
> [compiler-2:14673] mca: base: components_register: registering oob
> components
> [compiler-2:14673] mca: base: components_register: found loaded component
> tcp
> [compiler-2:14673] mca: base: components_register: component tcp register
> function successful
> [compiler-2:14673] mca: base: components_open: opening oob components
> [compiler-2:14673] mca: base: components_open: found loaded component tcp
> [compiler-2:14673] mca: base: components_open: component tcp open function
> successful
> [compiler-2:14673] mca:oob:select: checking available component tcp
> [compiler-2:14673] mca:oob:select: Querying component [tcp]
> [compiler-2:14673] oob:tcp: component_available called
> [compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> [compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
> [compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
> [compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
> [compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
> [compiler-2:14673]