from:"Timur Ismagilov"

[OMPI users] spml_ikrit_np random values

2014-06-05 Thread Timur Ismagilov


Hello!
I am using Open MPI v1.8.1.

$oshmem_info -a --parsable | grep spml_ikrit_np
mca:spml:ikrit:param:spml_ikrit_np:value:1620524368  (alwase new value)
mca:spml:ikrit:param:spml_ikrit_np:source:default
mca:spml:ikrit:param:spml_ikrit_np:status:writeable
mca:spml:ikrit:param:spml_ikrit_np:level:9
mca:spml:ikrit:param:spml_ikrit_np:help:[integer] Minimal allowed job's NP to 
activate ikrit
mca:spml:ikrit:param:spml_ikrit_np:deprecated:no
mca:spml:ikrit:param:spml_ikrit_np:type:int
mca:spml:ikrit:param:spml_ikrit_np:disabled:false
why spml_ikrit_np gets a new value each time?
Regards,
Timur

Re: [OMPI users] [warn] Epoll ADD(1) on fd 0 failed

2014-06-06 Thread Timur Ismagilov

 Sometimes,  after termination of the program, launched with the command 
"sbatch ... -o myprogram.out .",  no file "myprogram.out"  is being 
produced. Could this be due to the above mentioned problem?


Thu, 5 Jun 2014 07:45:01 -0700 от Ralph Castain :
>FWIW: support for the --resv-ports option was deprecated and removed on the 
>OMPI side a long time ago.
>
>I'm not familiar enough with "oshrun" to know if it is doing anything unusual 
>- I believe it is just a renaming of our usual "mpirun". I suspect this is 
>some interaction with sbatch, but I'll take a look. I haven't see that 
>warning. Mike indicated he thought it is due to both slurm and OMPI trying to 
>control stdin/stdout, in which case it shouldn't be happening but you can 
>safely ignore it
>
>
>On Jun 5, 2014, at 3:04 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>I use cmd line
>>$sbatch -p test --exclusive -N 2 -o hello_oshmem.out -e hello_oshmem.err 
>>shrun_mxm3.0 ./hello_oshmem
>>
>>where script shrun_mxm3.0:
>>$cat shrun_mxm3.0
>>  #!/bin/sh
>>  #srun --resv-ports "$@"
>>  #exit $?
>>  [ x"$TMPDIR" == x"" ] && TMPDIR=/tmp
>>  HOSTFILE=${TMPDIR}/hostfile.${SLURM_JOB_ID}
>>  srun hostname -s|sort|uniq -c|awk '{print $2" slots="$1}' > $HOSTFILE || { 
>>rm -f $HOSTFILE; exit 255; }
>>  
>>LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so
>> oshrun -x LD_PRELOAD -x MXM_SHM_KCOPY_MODE=off --hostfile $HOSTFILE "$@"
>>
>>  rc=$?
>>  rm -f $HOSTFILE
>>  exit $rc
>>I configured openmpi using
>>./configure CC=icc CXX=icpc F77=ifort FC=ifort 
>>--prefix=/mnt/data/users/dm2/vol3/semenov/_scratch/openmpi-1.8.1_mxm-3.0 
>>--with-mxm=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/ --with-     
>>         slurm --with-platform=contrib/platform/mellanox/optimized
>>
>>
>>Fri, 30 May 2014 07:09:54 -0700 от Ralph Castain < r...@open-mpi.org >:
>>>Can you pass along the cmd line that generated that output, and how OMPI was 
>>>configured?
>>>
>>>On May 30, 2014, at 5:11 AM, Тимур Исмагилов < tismagi...@mail.ru > wrote:
>>>>Hello!
>>>>I am using Open MPI v1.8.1 and slurm 2.5.6.
>>>>I got this messages when i try to run example (hello_oshmem.cpp) program:
>>>>[warn] Epoll ADD(1) on fd 0 failed. Old events were 0; read change was 1 
>>>>(add); write change was 0 (none): Operation not permitted
>>>>[warn] Epoll ADD(4) on fd 1 failed. Old events were 0; read change was 0 
>>>>(none); write change was 1 (add): Operation not permitted
>>>>Hello, world, I am 0 of 2
>>>>Hello, world, I am 1 of 2
>>>>What does this warnings mean?
>>>>I lunch this job using sbatch and mpirun with hostfile (got it from :  
>>>>$srun hostname -s|sort|uniq -c|awk '{print $2" slots="$1}' > $HOSTFILE)
>>>>Regards,
>>>>Timur
>>>>___
>>>>users mailing list
>>>>us...@open-mpi.org
>>>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] Problem with yoda component in oshmem.

2014-06-06 Thread Timur Ismagilov


Hello!
I am using Open MPI v1.8.1 in
example program hello_oshmem.cpp.
When I put  spml_ikrit_np = 1000 (more than 4) and run task on 4 (2,1) nodes, I 
get an:
in out file: 
No available spml components were found!
This means that there are no components of this type installed on your
system or all the components reported that they could not be used.
This is a fatal error; your SHMEM process is likely to abort. Check the
output of the "ompi_info" command and ensure that components of this
type are available on your system. You may also wish to check the
value of the "component_path" MCA parameter and ensure that it has at
least one directory that contains valid MCA components
in err file:
[node1-128-31:05405] SPML ikrit cannot be selected
Regards,
Timur

[OMPI users] openMP and mpi problem

2014-07-02 Thread Timur Ismagilov


Hello!
I have open mpi 1.9a1r32104 and open mpi 1.5.5.
I have much better perfomance in open mpi 1.5.5 with openMP on 8 cores
in  the program:


#define N 1000
int main(int argc, char *argv[]) {
...
MPI_Init(&argc, &argv);
...
for (i = 0; i < N; i++) {
a[i] = i * 1.0;
b[i] = i * 2.0;
}

#pragma omp parallel for shared(a, b, c) private(i)
for (i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
.
MPI_Finalize();
}
I got on 1 node 
(for i in 1 2 4 8 ; do export OMP_NUM_THREADS=$i; sbatch -p test -t 5 
--exclusive -N 1 -o hybrid-hello_omp$i.out -e hybrid-hello_omp$i.err 
ompi_mxm3.0 ./hybrid-hello; done)
*   open mpi 1.5.5 (Data for node: node1-128-17 Num slots: 8 Max slots: 0): 
*  8 threads 0.014527 sec
*  4 threads 0.016138 sec
*  2 threads 0.018764 sec
*  1 thread   0.029963 sec
*  openmpi 1.9a1r32104 ( node1-128-29: slots=8 max_slots=0 slots_inuse=0 
state=UP ):
*  8 threads 0.035055 sec
*  4 threads 0.029859 sec 
*  2 threads 0.019564 sec  (same as  open mpi 1.5.5 )
*  1 thread   0.028394 sec (same as  open mpi 1.5.5 )
So, it looks like, that open mpi 1.9 use only 2 cores from 8.

What can i do with this?

$cat ompi_mxm3.0
#!/bin/sh
[ x"$TMPDIR" == x"" ] && TMPDIR=/tmp
HOSTFILE=${TMPDIR}/hostfile.${SLURM_JOB_ID}
srun hostname -s|sort|uniq -c|awk '{print $2" slots="$1}' > $HOSTFILE || { rm 
-f $HOSTFILE; exit 255; }
LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so 
mpirun -x LD_PRELOAD -x MXM_SHM_KCOPY_MODE=off --display-allocation --hostfile 
$HOSTFILE "$@"
rc=$?
rm -f $HOSTFILE
exit $rc

For open mpi 1.5.5 i remove LD_PRELOAD from run script.

Re: [OMPI users] openMP and mpi problem

2014-07-03 Thread Timur Ismagilov





When i used --map-by slot:pe=8, i got the same message 

Your job failed to map. Either no mapper was available, or none
of the available mappers was able to perform the requested
mapping operation. This can happen if you request a map type
(e.g., loadbalance) and the corresponding mapper was not built.
...

Wed, 2 Jul 2014 07:36:48 -0700 от Ralph Castain :
>Let's keep this on the user list so others with similar issues can find it.
>
>My guess is that the $OMP_NUM_THREADS syntax isn't quite right, so it didn't 
>pick up the actual value there. Since it doesn't hurt to have extra cpus, just 
>set it to 8 for your test case and that should be fine, so adding a little 
>clarity:
>
>--map-by slot:pe=8
>
>I'm not aware of any slurm utility similar to top, but there is no reason you 
>can't just submit this as an interactive job and use top itself, is there?
>
>As for that sbgp warning - you can probably just ignore it. Not sure why that 
>is failing, but it just means that component will disqualify itself. If you 
>want to eliminate it, just add
>
>-mca sbgp ^ibnet
>
>to your cmd line
>
>
>On Jul 2, 2014, at 7:29 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>Thanks, Ralph!
>>With '--map-by :pe=$OMP_NUM_THREADS'  i got:
>>--
>>Your job failed to map. Either no mapper was available, or none
>>of the available mappers was able to perform the requested
>>mapping operation. This can happen if you request a map type
>>(e.g., loadbalance) and the corresponding mapper was not built.
>>
>>What does it mean?
>>With '--bind-to socket' everything looks better, but performance still 
>>worse..( but better than it was)
>>*  1 thread 0.028 sec
>>*  2 thread 0.018 sec
>>*  4 thread 0.020 sec 
>>*  8 thread 0.021 sec
>>Do i have utility similar to the 'top' with sbatch?
>>
>>Also, every time,  i got the message in ompi 1.9:
>>mca: base: components_register: component sbgp / ibnet register function 
>>failed
>>Is it bad?
>>
>>Regards, 
>>Timur
>>
>>Wed, 2 Jul 2014 05:53:44 -0700 от Ralph Castain < r...@open-mpi.org >:
>>>OMPI started binding by default during the 1.7 series. You should add the 
>>>following to your cmd line:
>>>
>>>--map-by :pe=$OMP_NUM_THREADS
>>>
>>>This will give you a dedicated core for each thread. Alternatively, you 
>>>could instead add
>>>
>>>--bind-to socket
>>>
>>>OMPI 1.5.5 doesn't bind at all unless directed to do so, which is why you 
>>>are getting the difference in behavior.
>>>
>>>
>>>On Jul 2, 2014, at 12:33 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>>>Hello!
>>>>I have open mpi 1.9a1r32104 and open mpi 1.5.5.
>>>>I have much better perfomance in open mpi 1.5.5 with openMP on 8 cores
>>>>in  the program:
>>>>
>>>>
>>>>#define N 1000
>>>>
>>>>int main(int argc, char *argv[]) {
>>>>...
>>>>MPI_Init(&argc, &argv);
>>>>...
>>>>for (i = 0; i < N; i++) {
>>>>a[i] = i * 1.0;
>>>>b[i] = i * 2.0;
>>>>}
>>>>
>>>>#pragma omp parallel for shared(a, b, c) private(i)
>>>>for (i = 0; i < N; i++) {
>>>>c[i] = a[i] + b[i];
>>>>}
>>>>.
>>>>MPI_Finalize();
>>>>}
>>>>I got on 1 node  
>>>>(for i in 1 2 4 8 ; do export OMP_NUM_THREADS=$i; sbatch -p test -t 5 
>>>>--exclusive -N 1 -o hybrid-hello_omp$i.out -e hybrid-hello_omp$i.err 
>>>>ompi_mxm3.0 ./hybrid-hello; done)
>>>>
>>>>*   open mpi 1.5.5 (Data for node: node1-128-17 Num slots: 8 Max slots: 
>>>>0): 
>>>>*  8 threads 0.014527 sec
>>>>*  4 threads 0.016138 sec
>>>>*  2 threads 0.018764 sec
>>>>*  1 thread   0.029963 sec
>>>>*  openmpi 1.9a1r32104 ( node1-128-29: slots=8 max_slots=0 slots_inuse=0 
>>>>state=UP ):
>>>>*  8 threads 0.035055 sec
>>>>*  4 threads 0.029859 sec 
>>>>*  2 threads 0.019564 sec  (same as  open mpi 1.5.5 )
>>>>*  1 thread   0.028394 sec (same as   open mpi 1.5.5 )
>>>>So, it looks like, that open mpi 1.9 use only 2 cores from 8.
>>>>
>>>>What can i do with this?
>>>>
>>>>$cat ompi_mxm3.0
>>>>#!/bin/sh
>>>>[ x"$TMPDIR" == x"" ] && TMPDIR=/tmp
>>>>HOSTFILE=${TMPDIR}/hostfile.${SLURM_JOB_ID}
>>>>srun hostname -s|sort|uniq -c|awk '{print $2" slots="$1}' > $HOSTFILE || { 
>>>>rm -f $HOSTFILE; exit 255; }
>>>>LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so
>>>> mpirun -x LD_PRELOAD -x MXM_SHM_KCOPY_MODE=off --display-allocation 
>>>>--hostfile $HOSTFILE "$@"
>>>>rc=$?
>>>>rm -f $HOSTFILE
>>>>exit $rc
>>>>
>>>>For open mpi 1.5.5 i remove LD_PRELOAD from run script. 
>>>>___
>>>>users mailing list
>>>>us...@open-mpi.org
>>>>Subscription:   http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>Link to this post:   
>>>>http://www.open-mpi.org/community/lists/users/2014/07/24738.php
>>
>>
>>




--

Re: [OMPI users] openMP and mpi problem

2014-07-04 Thread Timur Ismagilov

69] mca:rmaps:resilient: cannot perform initial map of job 
[25027,1] - no fault groups
[node1-128-29:21569] mca:rmaps:mindist: job [25027,1] not using mindist mapper
[node1-128-29:21569] mca:rmaps:rr: mapping job [25027,1]
[node1-128-29:21569] AVAILABLE NODES FOR MAPPING:
[node1-128-29:21569] node: node1-128-29 daemon: 0
[node1-128-29:21569] mca:rmaps:rr: mapping no-span by Core for job [25027,1] 
slots 1 num_procs 1
[node1-128-29:21569] mca:rmaps:rr: found 8 Core objects on node node1-128-29
[node1-128-29:21569] mca:rmaps:rr: calculated nprocs 1
[node1-128-29:21569] mca:rmaps:rr: assigning nprocs 1
[node1-128-29:21569] mca:rmaps:rr: assigning proc to object 0
[node1-128-29:21569] mca:rmaps:base: computing vpids by slot for job [25027,1]
[node1-128-29:21569] mca:rmaps:base: assigning rank 0 to node node1-128-29
[node1-128-29:21569] mca:rmaps: compute bindings for job [25027,1] with policy 
CORE
[node1-128-29:21569] mca:rmaps: bindings for job [25027,1] - bind in place
[node1-128-29:21569] mca:rmaps: bind in place for job [25027,1] with bindings 
CORE
[node1-128-29:21569] [[25027,0],0] reset_usage: node node1-128-29 has 1 procs 
on it
[node1-128-29:21569] [[25027,0],0] reset_usage: ignoring proc [[25027,1],0]
[node1-128-29:21569] BINDING PROC [[25027,1],0] TO Core NUMBER 0
[node1-128-29:21569] [[25027,0],0] BOUND PROC [[25027,1],0] TO 0,8[Core:0] on 
node node1-128-29
[node1-128-29:21571] mca: base: components_register: component sbgp / ibnet 
register function failed
Main 21.366504 secs total /1
Computation 21.048671 secs total /1000
[node1-128-29:21569] mca: base: close: unloading component lama
[node1-128-29:21569] mca: base: close: component mindist closed
[node1-128-29:21569] mca: base: close: unloading component mindist
[node1-128-29:21569] mca: base: close: component ppr closed
[node1-128-29:21569] mca: base: close: unloading component ppr
[node1-128-29:21569] mca: base: close: component rank_file closed
[node1-128-29:21569] mca: base: close: unloading component rank_file
[node1-128-29:21569] mca: base: close: component resilient closed
[node1-128-29:21569] mca: base: close: unloading component resilient
[node1-128-29:21569] mca: base: close: component round_robin closed
[node1-128-29:21569] mca: base: close: unloading component round_robin
[node1-128-29:21569] mca: base: close: component seq closed
[node1-128-29:21569] mca: base: close: unloading component seq
[node1-128-29:21569] mca: base: close: component staged closed
[node1-128-29:21569] mca: base: close: unloading component staged Regards,
Timur.

Thu, 3 Jul 2014 06:10:26 -0700 от Ralph Castain :
>This looks to me like a message from some older version of OMPI. Please check 
>your LD_LIBRARY_PATH and ensure that the 1.9 installation is at the *front* of 
>that list.
>
>Of course, I'm also assuming that you installed the two versions into 
>different locations - yes?
>
>Also, add "--mca rmaps_base_verbose 20" to your cmd line - this will tell us 
>what mappers are being considered.
>
>
>On Jul 3, 2014, at 1:31 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>When i used --map-by slot:pe=8, i got the same message 
>>
>>Your job failed to map. Either no mapper was available, or none
>>of the available mappers was able to perform the requested
>>mapping operation. This can happen if you request a map type
>>(e.g., loadbalance) and the corresponding mapper was not built.
>>...
>>
>>Wed, 2 Jul 2014 07:36:48 -0700 от Ralph Castain < r...@open-mpi.org >:
>>>Let's keep this on the user list so others with similar issues can find it.
>>>
>>>My guess is that the $OMP_NUM_THREADS syntax isn't quite right, so it didn't 
>>>pick up the actual value there. Since it doesn't hurt to have extra cpus, 
>>>just set it to 8 for your test case and that should be fine, so adding a 
>>>little clarity:
>>>
>>>--map-by slot:pe=8
>>>
>>>I'm not aware of any slurm utility similar to top, but there is no reason 
>>>you can't just submit this as an interactive job and use top itself, is 
>>>there?
>>>
>>>As for that sbgp warning - you can probably just ignore it. Not sure why 
>>>that is failing, but it just means that component will disqualify itself. If 
>>>you want to eliminate it, just add
>>>
>>>-mca sbgp ^ibnet
>>>
>>>to your cmd line
>>>
>>>
>>>On Jul 2, 2014, at 7:29 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>>>Thanks, Ralph!
>>>>With '--map-by :pe=$OMP_NUM_THREADS'  i got:
>>>>--
>>>>Your job failed to map. Either no mapper was available, or none
>>>>of the ava

Re: [OMPI users] openMP and mpi problem

2014-07-04 Thread Timur Ismagilov



1. Intell mpi is located here: /opt/intel/impi/4.1.0/intel64/lib. I have added 
OMPI path at the start and got the same output.
2. here is my cmd line:
export OMP_NUM_THREADS=8; export 
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/mnt/data/users/dm2/vol3/semenov/_scratch/openmpi-1.9.0_mxm-3.0/lib;
 sbatch -p test -t 5 --exclusive -N 1 -o ./results/hybrid-hello_omp$i.out -e 
./results/hybrid-hello_omp$i.err ompi_mxm3.0 ./hybrid-hello; done
$ cat ompi_mxm3.0
#!/bin/sh
#srun --resv-ports "$@"
#exit $?
[ x"$TMPDIR" == x"" ] && TMPDIR=/tmp
HOSTFILE=${TMPDIR}/hostfile.${SLURM_JOB_ID}
srun hostname -s|sort|uniq -c|awk '{print $2" slots="$1}' > $HOSTFILE || { rm 
-f $HOSTFILE; exit 255; }
LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so 
mpirun -x LD_PRELOAD -x MXM_SHM_KCOPY_MODE=off --map-by slot:pe=8 --mca 
rmaps_base_verbose 20 --hostfile $HOSTFILE "$@"
rc=$?
rm -f $HOSTFILE
exit $rc


Fri, 4 Jul 2014 07:06:34 -0700 от Ralph Castain :
>Hmmm...couple of things here:
>
>1. Intel packages Intel MPI in their compiler, and so there is in fact an 
>mpiexec and MPI libraries in your path before us. I would advise always 
>putting the OMPI path at the start of your path envars to avoid potential 
>conflict
>
>2. I'm having trouble understanding your command line because of all the 
>variable definitions. Could you please tell me what the mpirun cmd line is? I 
>suspect I know the problem, but need to see the actual cmd line to confirm it
>
>Thanks
>Ralph
>
>On Jul 4, 2014, at 1:38 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>There is only one path to mpi lib.
>>echo $LD_LIBRARY_PATH  
>>/opt/intel/composer_xe_2013.2.146/mkl/lib/intel64:/opt/intel/composer_xe_2013.2.146/compiler/lib/intel64:/home/users/semenov/BFD/lib:/home/users/semenov/local/lib:/usr/lib64/:/mnt/data/users/dm2/vol3/semenov/_scratch/openmpi-1.9.0_mxm-3.0/lib
>>
>>This one also looks correct.
>>$ldd hybrid-hello
>>linux-vdso.so.1 => (0x7fff8b983000)
>>libmpi.so.0 => 
>>/mnt/data/users/dm2/vol3/semenov/_scratch/openmpi-1.9.0_mxm-3.0/lib/libmpi.so.0
>> (0x7f58c95cb000)
>>libm.so.6 => /lib64/libm.so.6 (0x00338ac0)
>>libiomp5.so => 
>>/opt/intel/composer_xe_2013.2.146/compiler/lib/intel64/libiomp5.so 
>>(0x7f58c92a2000)
>>libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x00338d40)
>>libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00338cc0)
>>libpthread.so.0 => /lib64/libpthread.so.0 (0x00338b80)
>>libc.so.6 => /lib64/libc.so.6 (0x00338b00)
>>libdl.so.2 => /lib64/libdl.so.2 (0x00338b40)
>>libopen-rte.so.0 => 
>>/mnt/data/users/dm2/vol3/semenov/_scratch/openmpi-1.9.0_mxm-3.0/lib/libopen-rte.so.0
>> (0x7f58c9009000)
>>libopen-pal.so.0 => 
>>/mnt/data/users/dm2/vol3/semenov/_scratch/openmpi-1.9.0_mxm-3.0/lib/libopen-pal.so.0
>> (0x7f58c8d05000)
>>libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x7f58c8afb000)
>>librt.so.1 => /lib64/librt.so.1 (0x00338c00)
>>libnsl.so.1 => /lib64/libnsl.so.1 (0x00339380)
>>libutil.so.1 => /lib64/libutil.so.1 (0x00339b60)
>>libimf.so => /opt/intel/composer_xe_2013.2.146/compiler/lib/intel64/libimf.so 
>>(0x7f58c863e000)
>>libsvml.so => 
>>/opt/intel/composer_xe_2013.2.146/compiler/lib/intel64/libsvml.so 
>>(0x7f58c7c73000)
>>libirng.so => 
>>/opt/intel/composer_xe_2013.2.146/compiler/lib/intel64/libirng.so 
>>(0x7f58c7a6b000)
>>libintlc.so.5 => 
>>/opt/intel/composer_xe_2013.2.146/compiler/lib/intel64/libintlc.so.5 
>>(0x7f58c781d000)
>>/lib64/ld-linux-x86-64.so.2 (0x00338a80) open mpi 1.5.5 was 
>>preinstalled to "/opt/mpi/openmpi-1.5.5-icc/".
>>
>>Here is an output after adding "--mca rmaps_base_verbose 20" and "--map-by 
>>slot:pe=8".
>>*  outfile:
>>--
>>Your job failed to map. Either no mapper was available, or none
>>of the available mappers was able to perform the requested
>>mapping operation. This can happen if you request a map type
>>(e.g., loadbalance) and the corresponding mapper was not built.
>>--
>>*  errfile:
>>[node1-128-29:21477] mca: base: components_register: registering rmaps 
>>components
>>[node1-128-29:21477] mca: base: components_register: found loaded component 
>>lama
>>[node1-128-29:21477] mca:rmaps:lama: Priority 0
>>[node1-128-29:21477] mca:rmaps:lama: Map : NULL

[OMPI users] Salloc and mpirun problem

2014-07-16 Thread Timur Ismagilov


Hello!
I have Open MPI v1.9a1r32142 and slurm 2.5.6.

I can not use mpirun after salloc:

$salloc -N2 --exclusive -p test -J ompi
$LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so 
mpirun -np 1 hello_c
-
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--
But if i use mpirun in sbutch script it looks correct:
$cat ompi_mxm3.0
#!/bin/sh
LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so  
mpirun  -x LD_PRELOAD -x MXM_SHM_KCOPY_MODE=off --map-by slot:pe=8 "$@"

$sbatch -N2  --exclusive -p test -J ompi  ompi_mxm3.0 ./hello_c
Submitted batch job 645039
$cat slurm-645039.out 
[warn] Epoll ADD(1) on fd 0 failed.  Old events were 0; read change was 1 
(add); write change was 0 (none): Operation not permitted
[warn] Epoll ADD(4) on fd 1 failed.  Old events were 0; read change was 0 
(none); write change was 1 (add): Operation not permitted
Hello, world, I am 0 of 2, (Open MPI v1.9a1, package: Open MPI 
semenov@compiler-2 Distribution, ident: 1.9a1r32142, repo rev: r32142, Jul 04, 
2014 (nightly snapshot tarball), 146)
Hello, world, I am 1 of 2, (Open MPI v1.9a1, package: Open MPI 
semenov@compiler-2 Distribution, ident: 1.9a1r32142, repo rev: r32142, Jul 04, 
2014 (nightly snapshot tarball), 146)

Regards,
Timur

Re: [OMPI users] Salloc and mpirun problem

2014-07-16 Thread Timur Ismagilov


Here it is:

$ 
LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so  
mpirun  -x LD_PRELOAD --mca plm_base_verbose 10 --debug-daemons -np 1 hello_c

[access1:29064] mca: base: components_register: registering plm components
[access1:29064] mca: base: components_register: found loaded component isolated
[access1:29064] mca: base: components_register: component isolated has no 
register or open function
[access1:29064] mca: base: components_register: found loaded component rsh
[access1:29064] mca: base: components_register: component rsh register function 
successful
[access1:29064] mca: base: components_register: found loaded component slurm
[access1:29064] mca: base: components_register: component slurm register 
function successful
[access1:29064] mca: base: components_open: opening plm components
[access1:29064] mca: base: components_open: found loaded component isolated
[access1:29064] mca: base: components_open: component isolated open function 
successful
[access1:29064] mca: base: components_open: found loaded component rsh
[access1:29064] mca: base: components_open: component rsh open function 
successful
[access1:29064] mca: base: components_open: found loaded component slurm
[access1:29064] mca: base: components_open: component slurm open function 
successful
[access1:29064] mca:base:select: Auto-selecting plm components
[access1:29064] mca:base:select:(  plm) Querying component [isolated]
[access1:29064] mca:base:select:(  plm) Query of component [isolated] set 
priority to 0
[access1:29064] mca:base:select:(  plm) Querying component [rsh]
[access1:29064] mca:base:select:(  plm) Query of component [rsh] set priority 
to 10
[access1:29064] mca:base:select:(  plm) Querying component [slurm]
[access1:29064] mca:base:select:(  plm) Query of component [slurm] set priority 
to 75
[access1:29064] mca:base:select:(  plm) Selected component [slurm]
[access1:29064] mca: base: close: component isolated closed
[access1:29064] mca: base: close: unloading component isolated
[access1:29064] mca: base: close: component rsh closed
[access1:29064] mca: base: close: unloading component rsh
Daemon was launched on node1-128-17 - beginning to initialize
Daemon was launched on node1-128-18 - beginning to initialize
Daemon [[63607,0],2] checking in as pid 24538 on host node1-128-18
[node1-128-18:24538] [[63607,0],2] orted: up and running - waiting for commands!
Daemon [[63607,0],1] checking in as pid 17192 on host node1-128-17
[node1-128-17:17192] [[63607,0],1] orted: up and running - waiting for commands!
srun: error: node1-128-18: task 1: Exited with exit code 1
srun: Terminating job step 645191.1
srun: error: node1-128-17: task 0: Exited with exit code 1
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--
[access1:29064] [[63607,0],0] orted_cmd: received halt_vm cmd
[access1:29064] mca: base: close: component slurm closed
[access1:29064] mca: base: close: unloading component slurm


Wed, 16 Jul 2014 14:20:33 +0300 от Mike Dubman :
>please add following flags to mpirun "--mca plm_base_verbose 10 
>--debug-daemons" and attach output.
>Thx
>
>
>On Wed, Jul 16, 2014 at 11:12 AM, Timur Ismagilov  < tismagi...@mail.ru > 
>wrote:
>>Hello!
>>I have Open MPI v1.9a1r32142 and slurm 2.5.6.
>>
>>I can not use mpirun after salloc:
>>
>>$salloc -N2 --exclusive -p test -J ompi
>>$LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so
>> mpirun -np 1 hello_c
>>-
>>An ORTE daemon has unexpectedly failed after launch and before
>>communicating back to mpirun. This could be caused by a number
>>of factors, including an inability to create a connection back
>>to mpirun due to a lack of common network interfaces and/or no
>>route found between them. Please check network connectivity
>>(including firewalls and network routing requirements).
>>--
>>But if i use mpirun in sbutch script it looks correct:
>>$cat ompi_mxm3.0
>>#!/bin/sh
>>LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so
>>  mpirun  -x LD_PRELOAD -x MXM_SHM_KCOPY_MODE=off --map-by slot:pe=8 "$@"
>>
>>$sbatch -N2  --exclusive -p test -J ompi  ompi_mxm3.0 ./hel

Re: [OMPI users] Salloc and mpirun problem

2014-07-17 Thread Timur Ismagilov


With Open MPI 1.9a1r32252 (Jul 16, 2014 (nightly snapshot tarball)) i got this 
output (same?):
$ salloc -N2 --exclusive -p test -J ompi
salloc: Granted job allocation 645686

$LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so 
 mpirun  -mca mca_base_env_list 'LD_PRELOAD'  --mca plm_base_verbose 10 
--debug-daemons -np 1 hello_c
[access1:04312] mca: base: components_register: registering plm components
[access1:04312] mca: base: components_register: found loaded component isolated
[access1:04312] mca: base: components_register: component isolated has no 
register or open function
[access1:04312] mca: base: components_register: found loaded component rsh
[access1:04312] mca: base: components_register: component rsh register function 
successful
[access1:04312] mca: base: components_register: found loaded component slurm
[access1:04312] mca: base: components_register: component slurm register 
function successful
[access1:04312] mca: base: components_open: opening plm components
[access1:04312] mca: base: components_open: found loaded component isolated
[access1:04312] mca: base: components_open: component isolated open function 
successful
[access1:04312] mca: base: components_open: found loaded component rsh
[access1:04312] mca: base: components_open: component rsh open function 
successful
[access1:04312] mca: base: components_open: found loaded component slurm
[access1:04312] mca: base: components_open: component slurm open function 
successful
[access1:04312] mca:base:select: Auto-selecting plm components
[access1:04312] mca:base:select:( plm) Querying component [isolated]
[access1:04312] mca:base:select:( plm) Query of component [isolated] set 
priority to 0
[access1:04312] mca:base:select:( plm) Querying component [rsh]
[access1:04312] mca:base:select:( plm) Query of component [rsh] set priority to 
10
[access1:04312] mca:base:select:( plm) Querying component [slurm]
[access1:04312] mca:base:select:( plm) Query of component [slurm] set priority 
to 75
[access1:04312] mca:base:select:( plm) Selected component [slurm]
[access1:04312] mca: base: close: component isolated closed
[access1:04312] mca: base: close: unloading component isolated
[access1:04312] mca: base: close: component rsh closed
[access1:04312] mca: base: close: unloading component rsh
Daemon was launched on node1-128-09 - beginning to initialize
Daemon was launched on node1-128-15 - beginning to initialize
Daemon [[39207,0],1] checking in as pid 26240 on host node1-128-09
[node1-128-09:26240] [[39207,0],1] orted: up and running - waiting for commands!
Daemon [[39207,0],2] checking in as pid 30129 on host node1-128-15
[node1-128-15:30129] [[39207,0],2] orted: up and running - waiting for commands!
srun: error: node1-128-09: task 0: Exited with exit code 1
srun: Terminating job step 645686.3
srun: error: node1-128-15: task 1: Exited with exit code 1
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--
[access1:04312] [[39207,0],0] orted_cmd: received halt_vm cmd
[access1:04312] mca: base: close: component slurm closed
[access1:04312] mca: base: close: unloading component slurm


Thu, 17 Jul 2014 11:40:24 +0300 от Mike Dubman :
>can you use latest ompi-1.8 from svn/git?
>Ralph - could you please suggest.
>Thx
>
>
>On Wed, Jul 16, 2014 at 2:48 PM, Timur Ismagilov  < tismagi...@mail.ru > wrote:
>>Here it is:
>>
>>$ 
>>LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so
>>  mpirun  -x LD_PRELOAD --mca plm_base_verbose 10 --debug-daemons -np 1 
>>hello_c
>>
>>[access1:29064] mca: base: components_register: registering plm components
>>[access1:29064] mca: base: components_register: found loaded component 
>>isolated
>>[access1:29064] mca: base: components_register: component isolated has no 
>>register or open function
>>[access1:29064] mca: base: components_register: found loaded component rsh
>>[access1:29064] mca: base: components_register: component rsh register 
>>function successful
>>[access1:29064] mca: base: components_register: found loaded component slurm
>>[access1:29064] mca: base: components_register: component slurm register 
>>function successful
>>[access1:29064] mca: base: components_open: opening plm components
>>[access1:29064] mca: base: components_open: found loaded component isolated
>>[access1:29064] mca: base: components_open: component isolated open

[OMPI users] Fwd: Re[4]: Salloc and mpirun problem

2014-07-20 Thread Timur Ismagilov


I have the same problem in openmpi 1.8.1( Apr 23, 2014 ).
Does the srun command have  a --map-by mpirun parameter, or can i chage it 
from bash enviroment?


 Пересылаемое сообщение 
От кого: Timur Ismagilov 
Кому: Mike Dubman 
Копия: Open MPI Users 
Дата: Thu, 17 Jul 2014 16:42:24 +0400
Тема: Re[4]: [OMPI users] Salloc and mpirun problem

With Open MPI 1.9a1r32252 (Jul 16, 2014 (nightly snapshot tarball)) i got this 
output (same?):
$ salloc -N2 --exclusive -p test -J ompi
salloc: Granted job allocation 645686

$LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so 
 mpirun  -mca mca_base_env_list 'LD_PRELOAD'  --mca plm_base_verbose 10 
--debug-daemons -np 1 hello_c
[access1:04312] mca: base: components_register: registering plm components
[access1:04312] mca: base: components_register: found loaded component isolated
[access1:04312] mca: base: components_register: component isolated has no 
register or open function
[access1:04312] mca: base: components_register: found loaded component rsh
[access1:04312] mca: base: components_register: component rsh register function 
successful
[access1:04312] mca: base: components_register: found loaded component slurm
[access1:04312] mca: base: components_register: component slurm register 
function successful
[access1:04312] mca: base: components_open: opening plm components
[access1:04312] mca: base: components_open: found loaded component isolated
[access1:04312] mca: base: components_open: component isolated open function 
successful
[access1:04312] mca: base: components_open: found loaded component rsh
[access1:04312] mca: base: components_open: component rsh open function 
successful
[access1:04312] mca: base: components_open: found loaded component slurm
[access1:04312] mca: base: components_open: component slurm open function 
successful
[access1:04312] mca:base:select: Auto-selecting plm components
[access1:04312] mca:base:select:( plm) Querying component [isolated]
[access1:04312] mca:base:select:( plm) Query of component [isolated] set 
priority to 0
[access1:04312] mca:base:select:( plm) Querying component [rsh]
[access1:04312] mca:base:select:( plm) Query of component [rsh] set priority to 
10
[access1:04312] mca:base:select:( plm) Querying component [slurm]
[access1:04312] mca:base:select:( plm) Query of component [slurm] set priority 
to 75
[access1:04312] mca:base:select:( plm) Selected component [slurm]
[access1:04312] mca: base: close: component isolated closed
[access1:04312] mca: base: close: unloading component isolated
[access1:04312] mca: base: close: component rsh closed
[access1:04312] mca: base: close: unloading component rsh
Daemon was launched on node1-128-09 - beginning to initialize
Daemon was launched on node1-128-15 - beginning to initialize
Daemon [[39207,0],1] checking in as pid 26240 on host node1-128-09
[node1-128-09:26240] [[39207,0],1] orted: up and running - waiting for commands!
Daemon [[39207,0],2] checking in as pid 30129 on host node1-128-15
[node1-128-15:30129] [[39207,0],2] orted: up and running - waiting for commands!
srun: error: node1-128-09: task 0: Exited with exit code 1
srun: Terminating job step 645686.3
srun: error: node1-128-15: task 1: Exited with exit code 1
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--
[access1:04312] [[39207,0],0] orted_cmd: received halt_vm cmd
[access1:04312] mca: base: close: component slurm closed
[access1:04312] mca: base: close: unloading component slurm


Thu, 17 Jul 2014 11:40:24 +0300 от Mike Dubman :
>can you use latest ompi-1.8 from svn/git?
>Ralph - could you please suggest.
>Thx
>
>
>On Wed, Jul 16, 2014 at 2:48 PM, Timur Ismagilov  < tismagi...@mail.ru > wrote:
>>Here it is:
>>
>>$ 
>>LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so
>>  mpirun  -x LD_PRELOAD --mca plm_base_verbose 10 --debug-daemons -np 1 
>>hello_c
>>
>>[access1:29064] mca: base: components_register: registering plm components
>>[access1:29064] mca: base: components_register: found loaded component 
>>isolated
>>[access1:29064] mca: base: components_register: component isolated has no 
>>register or open function
>>[access1:29064] mca: base: components_register: found loaded component rsh
>>[access1:29064] mca: base: components_register: component rsh register 
>>function successful
>>[access1:29064] mca: base: components_register: found loaded

Re: [OMPI users] Fwd: Re[4]: Salloc and mpirun problem

2014-07-20 Thread Timur Ismagilov




 Пересылаемое сообщение 
От кого: Timur Ismagilov 
Кому: Ralph Castain 
Дата: Sun, 20 Jul 2014 21:58:41 +0400
Тема: Re[2]: [OMPI users] Fwd: Re[4]:  Salloc and mpirun problem

Here it is:
$ salloc -N2 --exclusive -p test -J ompi
salloc: Granted job allocation 647049

$ mpirun -mca mca_base_env_list 'LD_PRELOAD' -mca oob_base_verbose 10 -mca 
rml_base_verbose 10 -np 2 hello_c
[access1:24264] mca: base: components_register: registering oob components
[access1:24264] mca: base: components_register: found loaded component tcp
[access1:24264] mca: base: components_register: component tcp register function 
successful
[access1:24264] mca: base: components_open: opening oob components
[access1:24264] mca: base: components_open: found loaded component tcp
[access1:24264] mca: base: components_open: component tcp open function 
successful
[access1:24264] mca:oob:select: checking available component tcp
[access1:24264] mca:oob:select: Querying component [tcp]
[access1:24264] oob:tcp: component_available called
[access1:24264] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[access1:24264] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
[access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.251.51 to our list of V4 
connections
[access1:24264] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
[access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.0.111 to our list of V4 
connections
[access1:24264] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
[access1:24264] [[55095,0],0] oob:tcp:init adding 10.2.251.11 to our list of V4 
connections
[access1:24264] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
[access1:24264] [[55095,0],0] oob:tcp:init adding 10.128.0.1 to our list of V4 
connections
[access1:24264] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4
[access1:24264] [[55095,0],0] oob:tcp:init adding 93.180.7.36 to our list of V4 
connections
[access1:24264] WORKING INTERFACE 7 KERNEL INDEX 7 FAMILY: V4
[access1:24264] [[55095,0],0] oob:tcp:init adding 93.180.7.36 to our list of V4 
connections
[access1:24264] [[55095,0],0] TCP STARTUP
[access1:24264] [[55095,0],0] attempting to bind to IPv4 port 0
[access1:24264] [[55095,0],0] assigned IPv4 port 47756
[access1:24264] mca:oob:select: Adding component to end
[access1:24264] mca:oob:select: Found 1 active transports
[access1:24264] mca: base: components_register: registering rml components
[access1:24264] mca: base: components_register: found loaded component oob
[access1:24264] mca: base: components_register: component oob has no register 
or open function
[access1:24264] mca: base: components_open: opening rml components
[access1:24264] mca: base: components_open: found loaded component oob
[access1:24264] mca: base: components_open: component oob open function 
successful
[access1:24264] orte_rml_base_select: initializing rml component oob
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 30 for peer 
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 15 for peer 
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 32 for peer 
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 33 for peer 
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 5 for peer 
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 10 for peer 
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 12 for peer 
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 9 for peer 
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 34 for peer 
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 2 for peer 
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 21 for peer 
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 22 for peer 
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 45 for peer 
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 46 for peer 
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 1 for peer 
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],

Re: [OMPI users] Fwd: Re[4]: Salloc and mpirun problem

2014-07-21 Thread Timur Ismagilov

 component oob
[node1-128-17:14767] [[61806,0],1] TCP SHUTDOWN
[node1-128-17:14767] [[61806,0],1] RELEASING PEER OBJ [[61806,0],0]
[node1-128-17:14767] [[61806,0],1] CLOSING SOCKET 10
[node1-128-17:14767] mca: base: close: component tcp closed
[node1-128-17:14767] mca: base: close: unloading component tcp
srun: error: node1-128-17: task 0: Exited with exit code 1
[node1-128-17:14779] [[65177,0],1] tcp:send_handler called to send to peer 
[[65177,0],0]
[node1-128-17:14779] [[65177,0],1] tcp:send_handler CONNECTING
[node1-128-17:14779] [[65177,0],1]:tcp:complete_connect called for peer 
[[65177,0],0] on socket 10
[node1-128-17:14779] [[65177,0],1]-[[65177,0],0] tcp_peer_complete_connect: 
connection failed: Connection timed out (110)
[node1-128-17:14779] [[65177,0],1] tcp_peer_close for [[65177,0],0] sd 10 state 
CONNECTING
[node1-128-17:14779] [[65177,0],1] tcp:lost connection called for peer 
[[65177,0],0]
[node1-128-17:14779] mca: base: close: component oob closed
[node1-128-17:14779] mca: base: close: unloading component oob
[node1-128-17:14779] [[65177,0],1] TCP SHUTDOWN
[node1-128-17:14779] [[65177,0],1] RELEASING PEER OBJ [[65177,0],0]
[node1-128-17:14779] [[65177,0],1] CLOSING SOCKET 10
[node1-128-17:14779] mca: base: close: component tcp closed
[node1-128-17:14779] mca: base: close: unloading component tcp
[node1-128-18:17849] [[65177,0],2] tcp:send_handler called to send to peer 
[[65177,0],0]
[node1-128-18:17849] [[65177,0],2] tcp:send_handler CONNECTING
[node1-128-18:17849] [[65177,0],2]:tcp:complete_connect called for peer 
[[65177,0],0] on socket 10
[node1-128-18:17849] [[65177,0],2]-[[65177,0],0] tcp_peer_complete_connect: 
connection failed: Connection timed out (110)
[node1-128-18:17849] [[65177,0],2] tcp_peer_close for [[65177,0],0] sd 10 state 
CONNECTING
[node1-128-18:17849] [[65177,0],2] tcp:lost connection called for peer 
[[65177,0],0]
[node1-128-18:17849] mca: base: close: component oob closed
[node1-128-18:17849] mca: base: close: unloading component oob
[node1-128-18:17849] [[65177,0],2] TCP SHUTDOWN
[node1-128-18:17849] [[65177,0],2] RELEASING PEER OBJ [[65177,0],0]
[node1-128-18:17849] [[65177,0],2] CLOSING SOCKET 10
[node1-128-18:17849] mca: base: close: component tcp closed
[node1-128-18:17849] mca: base: close: unloading component tcp
srun: error: node1-128-17: task 0: Exited with exit code 1
srun: Terminating job step 647191.2
srun: error: node1-128-18: task 1: Exited with exit code 1
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--
[compiler-2:30735] [[65177,0],0] orted_cmd: received halt_vm cmd
[compiler-2:30735] mca: base: close: component oob closed
[compiler-2:30735] mca: base: close: unloading component oob
[compiler-2:30735] [[65177,0],0] TCP SHUTDOWN
[compiler-2:30735] mca: base: close: component tcp closed
[compiler-2:30735] mca: base: close: unloading component tcp


Sun, 20 Jul 2014 13:11:19 -0700 от Ralph Castain :
>Yeah, we aren't connecting back - is there a firewall running?  You need to 
>leave the "--debug-daemons --mca plm_base_verbose 5" on there as well to see 
>the entire problem.
>
>What you can see here is that mpirun is listening on several interfaces:
>>[access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.251.51 to our list of 
>>V4 connections
>>[access1:24264] [[55095,0],0] oob:tcp:init adding 10.2.251.11 to our list of 
>>V4 connections
>>[access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.0.111 to our list of 
>>V4 connections
>>[access1:24264] [[55095,0],0] oob:tcp:init adding 10.128.0.1 to our list of 
>>V4 connections
>>[access1:24264] [[55095,0],0] oob:tcp:init adding 93.180.7.36 to our list of 
>>V4 connections
>
>It looks like you have multiple interfaces connected to the same subnet - this 
>is generally a bad idea. I also saw that the last one in the list shows up 
>twice in the kernel array - not sure why, but is there something special about 
>that NIC?
>
>What do the NICs look like on the remote hosts?
>
>On Jul 20, 2014, at 10:59 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>
>>
>>
>> Пересылаемое сообщение 
>>От кого: Timur Ismagilov < tismagi...@mail.ru >
>>Кому: Ralph Castain < r...@open-mpi.org >
>>Дата: Sun, 20 Jul 2014 21:58:41 +0400
>>Тема: Re[2]: [OMPI users] Fwd: Re[4]: Salloc and mpirun problem
>>
>>Here it is:
>>$ salloc -N2 -

Re: [OMPI users] Salloc and mpirun problem

2014-07-23 Thread Timur Ismagilov

 Thanks, Ralph!
When I add --mca oob_tcp_if_include ib0 (where ib0 is infiniband interface from 
ifconfig) to mpirun it starts working correct! 
Why OpenMPI doesn't do it itself?

Tue, 22 Jul 2014 11:26:16 -0700 от Ralph Castain :
>Okay, the problem is that the connection back to mpirun isn't getting thru. We 
>are trying on the 10.0.251.53 address - is that blocked, or should we be using 
>something else? If so, you might want to direct us by adding "-mca 
>oob_tcp_if_include foo", where foo is the interface you want us to use
>
>
>On Jul 20, 2014, at 10:24 PM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>NIC = network interface controller? 
>>There is QDR Infiniband 4x/10G Ethernet/Gigabit Ethernet.  
>>I want to use  QDR Infiniband.
>>Here is a new output:
>>$ mpirun -mca mca_base_env_list 'LD_PRELOAD' --debug-daemons --mca 
>>plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 2 
>>hello_c |tee hello.out
>>Warning: Conflicting CPU frequencies detected, using: 2927.00.
>>[compiler-2:30735] mca:base:select:( plm) Querying component [isolated]
>>[compiler-2:30735] mca:base:select:( plm) Query of component [isolated] set 
>>priority to 0
>>[compiler-2:30735] mca:base:select:( plm) Querying component [rsh]
>>[compiler-2:30735] mca:base:select:( plm) Query of component [rsh] set 
>>priority to 10
>>[compiler-2:30735] mca:base:select:( plm) Querying component [slurm]
>>[compiler-2:30735] mca:base:select:( plm) Query of component [slurm] set 
>>priority to 75
>>[compiler-2:30735] mca:base:select:( plm) Selected component [slurm]
>>[compiler-2:30735] mca: base: components_register: registering oob components
>>[compiler-2:30735] mca: base: components_register: found loaded component tcp
>>[compiler-2:30735] mca: base: components_register: component tcp register 
>>function successful
>>[compiler-2:30735] mca: base: components_open: opening oob components
>>[compiler-2:30735] mca: base: components_open: found loaded component tcp
>>[compiler-2:30735] mca: base: components_open: component tcp open function 
>>successful
>>[compiler-2:30735] mca:oob:select: checking available component tcp
>>[compiler-2:30735] mca:oob:select: Querying component [tcp]
>>[compiler-2:30735] oob:tcp: component_available called
>>[compiler-2:30735] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
>>[compiler-2:30735] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
>>[compiler-2:30735] [[65177,0],0] oob:tcp:init adding 10.0.251.53 to our list 
>>of V4 connections
>>[compiler-2:30735] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
>>[compiler-2:30735] [[65177,0],0] oob:tcp:init adding 10.0.0.4 to our list of 
>>V4 connections
>>[compiler-2:30735] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
>>[compiler-2:30735] [[65177,0],0] oob:tcp:init adding 10.2.251.14 to our list 
>>of V4 connections
>>[compiler-2:30735] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
>>[compiler-2:30735] [[65177,0],0] oob:tcp:init adding 10.128.0.4 to our list 
>>of V4 connections
>>[compiler-2:30735] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4
>>[compiler-2:30735] [[65177,0],0] oob:tcp:init adding 93.180.7.38 to our list 
>>of V4 connections
>>[compiler-2:30735] [[65177,0],0] TCP STARTUP
>>[compiler-2:30735] [[65177,0],0] attempting to bind to IPv4 port 0
>>[compiler-2:30735] [[65177,0],0] assigned IPv4 port 49759
>>[compiler-2:30735] mca:oob:select: Adding component to end
>>[compiler-2:30735] mca:oob:select: Found 1 active transports
>>[compiler-2:30735] mca: base: components_register: registering rml components
>>[compiler-2:30735] mca: base: components_register: found loaded component oob
>>[compiler-2:30735] mca: base: components_register: component oob has no 
>>register or open function
>>[compiler-2:30735] mca: base: components_open: opening rml components
>>[compiler-2:30735] mca: base: components_open: found loaded component oob
>>[compiler-2:30735] mca: base: components_open: component oob open function 
>>successful
>>[compiler-2:30735] orte_rml_base_select: initializing rml component oob
>>[compiler-2:30735] [[65177,0],0] posting recv
>>[compiler-2:30735] [[65177,0],0] posting persistent recv on tag 30 for peer 
>>[[WILDCARD],WILDCARD]
>>[compiler-2:30735] [[65177,0],0] posting recv
>>[compiler-2:30735] [[65177,0],0] posting persistent recv on tag 15 for peer 
>>[[WILDCARD],WILDCARD]
>>[compiler-2:30735] [[65177,0],0] posting recv
>>[compiler-2:30735] [[65177,0],0] posting persistent recv on tag 32 for peer 
>>[[WILDCARD],WILDCARD]
>>[compiler-2:30735] [[65177

[OMPI users] ORTE daemon has unexpectedly failed after launch

2014-08-12 Thread Timur Ismagilov


Hello!

I have Open MPI  v1.8.2rc4r32485

When i run hello_c, I got this error message
$mpirun  -np 2 hello_c

An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).

When i run with --debug-daemons --mca plm_base_verbose 5 -mca oob_base_verbose 
10 -mca rml_base_verbose 10 i got this output:
$mpirun  --debug-daemons --mca plm_base_verbose 5 -mca oob_base_verbose 10 -mca 
rml_base_verbose 10   -np 2 hello_c
[compiler-2:08780] mca:base:select:( plm) Querying component [isolated]
[compiler-2:08780] mca:base:select:( plm) Query of component [isolated] set 
priority to 0
[compiler-2:08780] mca:base:select:( plm) Querying component [rsh]
[compiler-2:08780] mca:base:select:( plm) Query of component [rsh] set priority 
to 10
[compiler-2:08780] mca:base:select:( plm) Querying component [slurm]
[compiler-2:08780] mca:base:select:( plm) Query of component [slurm] set 
priority to 75
[compiler-2:08780] mca:base:select:( plm) Selected component [slurm]
[compiler-2:08780] mca: base: components_register: registering oob components
[compiler-2:08780] mca: base: components_register: found loaded component tcp
[compiler-2:08780] mca: base: components_register: component tcp register 
function successful
[compiler-2:08780] mca: base: components_open: opening oob components
[compiler-2:08780] mca: base: components_open: found loaded component tcp
[compiler-2:08780] mca: base: components_open: component tcp open function 
successful
[compiler-2:08780] mca:oob:select: checking available component tcp
[compiler-2:08780] mca:oob:select: Querying component [tcp]
[compiler-2:08780] oob:tcp: component_available called
[compiler-2:08780] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[compiler-2:08780] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
[compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.251.53 to our list of 
V4 connections
[compiler-2:08780] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
[compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.0.0.4 to our list of V4 
connections
[compiler-2:08780] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
[compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.2.251.14 to our list of 
V4 connections
[compiler-2:08780] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
[compiler-2:08780] [[42202,0],0] oob:tcp:init adding 10.128.0.4 to our list of 
V4 connections
[compiler-2:08780] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4
[compiler-2:08780] [[42202,0],0] oob:tcp:init adding 93.180.7.38 to our list of 
V4 connections
[compiler-2:08780] [[42202,0],0] TCP STARTUP
[compiler-2:08780] [[42202,0],0] attempting to bind to IPv4 port 0
[compiler-2:08780] [[42202,0],0] assigned IPv4 port 38420
[compiler-2:08780] mca:oob:select: Adding component to end
[compiler-2:08780] mca:oob:select: Found 1 active transports
[compiler-2:08780] mca: base: components_register: registering rml components
[compiler-2:08780] mca: base: components_register: found loaded component oob
[compiler-2:08780] mca: base: components_register: component oob has no 
register or open function
[compiler-2:08780] mca: base: components_open: opening rml components
[compiler-2:08780] mca: base: components_open: found loaded component oob
[compiler-2:08780] mca: base: components_open: component oob open function 
successful
[compiler-2:08780] orte_rml_base_select: initializing rml component oob
[compiler-2:08780] [[42202,0],0] posting recv
[compiler-2:08780] [[42202,0],0] posting persistent recv on tag 30 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:08780] [[42202,0],0] posting recv
[compiler-2:08780] [[42202,0],0] posting persistent recv on tag 15 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:08780] [[42202,0],0] posting recv
[compiler-2:08780] [[42202,0],0] posting persistent recv on tag 32 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:08780] [[42202,0],0] posting recv
[compiler-2:08780] [[42202,0],0] posting persistent recv on tag 33 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:08780] [[42202,0],0] posting recv
[compiler-2:08780] [[42202,0],0] posting persistent recv on tag 5 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:08780] [[42202,0],0] posting recv
[compiler-2:08780] [[42202,0],0] posting persistent recv on tag 10 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:08780] [[42202,0],0] posting recv
[compiler-2:08780] [[42202,0],0] posting persistent recv on tag 12 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:08780] [[42202,0],0] posting recv
[compiler-2:08780] [[42202,0],0] posting persistent recv on tag 9 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:08780] [[42202,0],0] posting recv
[compiler-2:08780] [[42202,0],0] posting persistent recv on tag 34 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:08780] [[42202,0],0] posting recv
[c

[OMPI users] ORTE daemon has unexpectedly failed after launch

2014-08-12 Thread Timur Ismagilov

 on tag 27 for peer 
[[WILDCARD],WILDCARD]
Daemon was launched on node1-128-01 - beginning to initialize
Daemon was launched on node1-128-02 - beginning to initialize
--
WARNING: An invalid value was given for oob_tcp_if_include. This
value will be ignored.
Local host: node1-128-01
Value: "ib0"
Message: Invalid specification (missing "/")
--
--
WARNING: An invalid value was given for oob_tcp_if_include. This
value will be ignored.
Local host: node1-128-02
Value: "ib0"
Message: Invalid specification (missing "/")
--
--
None of the TCP networks specified to be included for out-of-band communications
could be found:
Value given:
Please revise the specification and try again.
--
--
None of the TCP networks specified to be included for out-of-band communications
could be found:
Value given:
Please revise the specification and try again.
--
--
No network interfaces were found for out-of-band communications. We require
at least one available network for out-of-band messaging.
--
--
No network interfaces were found for out-of-band communications. We require
at least one available network for out-of-band messaging.
--
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_oob_base_select failed
--> Returned value (null) (-43) instead of ORTE_SUCCESS
--
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_oob_base_select failed
--> Returned value (null) (-43) instead of ORTE_SUCCESS
--
srun: error: node1-128-02: task 1: Exited with exit code 213
srun: Terminating job step 657300.0
srun: error: node1-128-01: task 0: Exited with exit code 213
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--
[compiler-2:08792] [[42190,0],0] orted_cmd: received halt_vm cmd
[compiler-2:08792] mca: base: close: component oob closed
[compiler-2:08792] mca: base: close: unloading component oob
[compiler-2:08792] [[42190,0],0] TCP SHUTDOWN
[compiler-2:08792] mca: base: close: component tcp closed
[compiler-2:08792] mca: base: close: unloading component tcp

Tue, 12 Aug 2014 16:14:58 +0400 от Timur Ismagilov :
>Hello!
>
>I have Open MPI  v1.8.2rc4r32485
>
>When i run hello_c, I got this error message
>$mpirun  -np 2 hello_c
>
>An ORTE daemon has unexpectedly failed after launch and before
>communicating back to mpirun. This could be caused by a number
>of factors, including an inability to create a connection back
>to mpirun due to a lack of common network interfaces and/or no
>route found between them. Please check network connectivity
>(including firewalls and network routing requirements).
>
>When i run with --debug-daemons --mca plm_base_verbose 5 -mca oob_base_verbose 
>10 -mca rml_base_verbose 10 i got this output:
>

[OMPI users] mpi+openshmem hybrid

2014-08-14 Thread Timur Ismagilov


Hello!
I use Open MPI v1.9a132520.

Can I use hybrid mpi+openshmem?
Where can i read about?

I have some problems in simple programm:
#include 
#include "shmem.h"
#include "mpi.h"
int main(int argc, char* argv[])
{
int proc, nproc;
int rank, size, len;
char version[MPI_MAX_LIBRARY_VERSION_STRING];
MPI_Init(&argc, &argv);
start_pes(0);
MPI_Finalize();
return 0;
}

I compile with oshcc, with mpicc i got a compile error.

1. When i run this programm with mpirun/oshrun i got an output

[1408002416.687274] [node1-130-01:26354:0] proto.c:64 MXM WARN mxm is destroyed 
but still has pending receive requests
[1408002416.687604] [node1-130-01:26355:0] proto.c:64 MXM WARN mxm is destroyed 
but still has pending receive requests

2. If in programm, i use this code
start_pes(0);
MPI_Init(&argc, &argv);
MPI_Finalize();

i got an error:
--
Calling MPI_Init or MPI_Init_thread twice is erroneous.
--
[node1-130-01:26469] *** An error occurred in MPI_Init
[node1-130-01:26469] *** reported by process [2397634561,140733193388033]
[node1-130-01:26469] *** on communicator MPI_COMM_WORLD
[node1-130-01:26469] *** MPI_ERR_OTHER: known error not in list
[node1-130-01:26469] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,
[node1-130-01:26469] *** and potentially your MPI job)
[node1-130-01:26468] [[36585,1],0] ORTE_ERROR_LOG: Not found in file 
routed_radix.c at line 395
[node1-130-01:26469] [[36585,1],1] ORTE_ERROR_LOG: Not found in file 
routed_radix.c at line 395
[compiler-2:02175] 1 more process has sent help message help-mpi-errors.txt / 
mpi_errors_are_fatal
[compiler-2:02175] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages

--
Calling MPI_Init or MPI_Init_thread twice is erroneous.
--
[node1-130-01:26469] *** An error occurred in MPI_Init
[node1-130-01:26469] *** reported by process [2397634561,140733193388033]
[node1-130-01:26469] *** on communicator MPI_COMM_WORLD
[node1-130-01:26469] *** MPI_ERR_OTHER: known error not in list
[node1-130-01:26469] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,
[node1-130-01:26469] ***    and potentially your MPI job)
[node1-130-01:26468] [[36585,1],0] ORTE_ERROR_LOG: Not found in file 
routed_radix.c at line 395
[node1-130-01:26469] [[36585,1],1] ORTE_ERROR_LOG: Not found in file 
routed_radix.c at line 395
[compiler-2:02175] 1 more process has sent help message help-mpi-errors.txt / 
mpi_errors_are_fatal
[compiler-2:02175] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
help / error messages

Re: [OMPI users] ORTE daemon has unexpectedly failed after launch

2014-08-20 Thread Timur Ismagilov

] [[49095,0],0] posting persistent recv on tag 9 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:14673] [[49095,0],0] posting recv
[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 34 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:14673] [[49095,0],0] posting recv
[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 2 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:14673] [[49095,0],0] posting recv
[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 21 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:14673] [[49095,0],0] posting recv
[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 22 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:14673] [[49095,0],0] posting recv
[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 45 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:14673] [[49095,0],0] posting recv
[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 46 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:14673] [[49095,0],0] posting recv
[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 1 for peer 
[[WILDCARD],WILDCARD]
[compiler-2:14673] [[49095,0],0] posting recv
[compiler-2:14673] [[49095,0],0] posting persistent recv on tag 27 for peer 
[[WILDCARD],WILDCARD]
Daemon was launched on node1-128-01 - beginning to initialize
--
WARNING: An invalid value was given for oob_tcp_if_include. This
value will be ignored.
Local host: node1-128-01
Value: "ib0"
Message: Invalid specification (missing "/")
--
--
None of the TCP networks specified to be included for out-of-band communications
could be found:
Value given:
Please revise the specification and try again.
--
--
No network interfaces were found for out-of-band communications. We require
at least one available network for out-of-band messaging.
--
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_oob_base_select failed
--> Returned value (null) (-43) instead of ORTE_SUCCESS
--
srun: error: node1-128-01: task 0: Exited with exit code 213
srun: Terminating job step 661215.0
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--
[compiler-2:14673] [[49095,0],0] orted_cmd: received halt_vm cmd
[compiler-2:14673] mca: base: close: component oob closed
[compiler-2:14673] mca: base: close: unloading component oob
[compiler-2:14673] [[49095,0],0] TCP SHUTDOWN
[compiler-2:14673] mca: base: close: component tcp closed
[compiler-2:14673] mca: base: close: unloading component tcp


Tue, 12 Aug 2014 18:33:24 + от "Jeff Squyres (jsquyres)" 
:
>I filed the following ticket:
>
> https://svn.open-mpi.org/trac/ompi/ticket/4857
>
>
>On Aug 12, 2014, at 12:39 PM, Jeff Squyres (jsquyres) < jsquy...@cisco.com > 
>wrote:
>
>> (please keep the users list CC'ed)
>> 
>> We talked about this on the weekly engineering call today.  Ralph has an 
>> idea what is happening -- I need to do a little investigation today and file 
>> a bug.  I'll make sure you're CC'ed on the bug ticket.
>> 
>> 
>> 
>> On Aug 12, 2014, at 12:27 PM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>> 
>>> I don't have this error in OMPI 1.9a1r32252 and OMPI 1.8.1 (with --mca 
>>> oob_tcp_if_include ib0), but in all latest night snapshots i got this error.
>>> 
>>> 
>>> Tue, 12 Aug 2014 13:08:12 + от "Jeff Squyres (jsquyres)" < 
>>> jsquy...@cisco.com >:
>>> Are you running any kind of firewall on the node where mpirun is invoked? 
>>> Open MPI needs to be able to use a

Re: [OMPI users] ORTE daemon has unexpectedly failed after launch

2014-08-21 Thread Timur Ismagilov

 Have i I any opportunity to run mpi jobs?


Wed, 20 Aug 2014 10:48:38 -0700 от Ralph Castain :
>yes, i know - it is cmr'd
>
>On Aug 20, 2014, at 10:26 AM, Mike Dubman < mi...@dev.mellanox.co.il > wrote:
>>btw, we get same error in v1.8 branch as well.
>>
>>
>>On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain  < r...@open-mpi.org > wrote:
>>>It was not yet fixed - but should be now.
>>>
>>>On Aug 20, 2014, at 6:39 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>>>Hello!
>>>>
>>>>As i can see, the bug is fixed, but in Open MPI v1.9a1r32516  i still have 
>>>>the problem
>>>>
>>>>a)
>>>>$ mpirun  -np 1 ./hello_c
>>>>--
>>>>An ORTE daemon has unexpectedly failed after launch and before
>>>>communicating back to mpirun. This could be caused by a number
>>>>of factors, including an inability to create a connection back
>>>>to mpirun due to a lack of common network interfaces and/or no
>>>>route found between them. Please check network connectivity
>>>>(including firewalls and network routing requirements).
>>>>--
>>>>b)
>>>>$ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c
>>>>--
>>>>An ORTE daemon has unexpectedly failed after launch and before
>>>>communicating back to mpirun. This could be caused by a number
>>>>of factors, including an inability to create a connection back
>>>>to mpirun due to a lack of common network interfaces and/or no
>>>>route found between them. Please check network connectivity
>>>>(including firewalls and network routing requirements).
>>>>--
>>>>
>>>>c)
>>>>
>>>>$ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca plm_base_verbose 
>>>>5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1 ./hello_c
>>>>[compiler-2:14673] mca:base:select:( plm) Querying component [isolated]
>>>>[compiler-2:14673] mca:base:select:( plm) Query of component [isolated] set 
>>>>priority to 0
>>>>[compiler-2:14673] mca:base:select:( plm) Querying component [rsh]
>>>>[compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set 
>>>>priority to 10
>>>>[compiler-2:14673] mca:base:select:( plm) Querying component [slurm]
>>>>[compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set 
>>>>priority to 75
>>>>[compiler-2:14673] mca:base:select:( plm) Selected component [slurm]
>>>>[compiler-2:14673] mca: base: components_register: registering oob 
>>>>components
>>>>[compiler-2:14673] mca: base: components_register: found loaded component 
>>>>tcp
>>>>[compiler-2:14673] mca: base: components_register: component tcp register 
>>>>function successful
>>>>[compiler-2:14673] mca: base: components_open: opening oob components
>>>>[compiler-2:14673] mca: base: components_open: found loaded component tcp
>>>>[compiler-2:14673] mca: base: components_open: component tcp open function 
>>>>successful
>>>>[compiler-2:14673] mca:oob:select: checking available component tcp
>>>>[compiler-2:14673] mca:oob:select: Querying component [tcp]
>>>>[compiler-2:14673] oob:tcp: component_available called
>>>>[compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
>>>>[compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
>>>>[compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
>>>>[compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
>>>>[compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
>>>>[compiler-2:14673] [[49095,0],0] oob:tcp:init adding 10.128.0.4 to our list 
>>>>of V4 connections
>>>>[compiler-2:14673] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4
>>>>[compiler-2:14673] [[49095,0],0] TCP STARTUP
>>>>[compiler-2:14673] [[49095,0],0] attempting to bind to IPv4 port 0
>>>>[compiler-2:14673] [[49095,0],0] assigned IPv4 port 59460
>>>>[compiler-2:14673] mca:oob:select: Adding component to end
>>>>[compiler-2:14673] mca:oob:select: Found 1 active transports
>>>>[compiler-2:146

[OMPI users] long initialization

2014-08-22 Thread Timur Ismagilov

 Hello!
If i use latest night snapshot:
$ ompi_info -V
Open MPI v1.9a1r32570
*  In programm hello_c initialization takes ~1 min
In ompi 1.8.2rc4 and ealier it takes ~1 sec(or less)
*  if i use 
$mpirun  --mca mca_base_env_list 'MXM_SHM_KCOPY_MODE=off,OMP_NUM_THREADS=8' 
--map-by slot:pe=8 -np 1 ./hello_c
i got error 
config_parser.c:657  MXM  ERROR Invalid value for SHM_KCOPY_MODE: 
'off,OMP_NUM_THREADS=8'. Expected: [off|knem|cma|autodetect]
but with -x all works fine (but with warn)
$mpirun  -x MXM_SHM_KCOPY_MODE=off -x OMP_NUM_THREADS=8 -np 1 ./hello_c
WARNING: The mechanism by which environment variables are explicitly
..
..
..
Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 21, 
2014 (nightly snapshot tarball), 146)
Thu, 21 Aug 2014 06:26:13 -0700 от Ralph Castain :
>Not sure I understand. The problem has been fixed in both the trunk and the 
>1.8 branch now, so you should be able to work with either of those nightly 
>builds.
>
>On Aug 21, 2014, at 12:02 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>Have i I any opportunity to run mpi jobs?
>>
>>
>>Wed, 20 Aug 2014 10:48:38 -0700 от Ralph Castain < r...@open-mpi.org >:
>>>yes, i know - it is cmr'd
>>>
>>>On Aug 20, 2014, at 10:26 AM, Mike Dubman < mi...@dev.mellanox.co.il > wrote:
>>>>btw, we get same error in v1.8 branch as well.
>>>>
>>>>
>>>>On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain   < r...@open-mpi.org >   
>>>>wrote:
>>>>>It was not yet fixed - but should be now.
>>>>>
>>>>>On Aug 20, 2014, at 6:39 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>>>>>Hello!
>>>>>>
>>>>>>As i can see, the bug is fixed, but in Open MPI v1.9a1r32516  i still 
>>>>>>have the problem
>>>>>>
>>>>>>a)
>>>>>>$ mpirun  -np 1 ./hello_c
>>>>>>--
>>>>>>An ORTE daemon has unexpectedly failed after launch and before
>>>>>>communicating back to mpirun. This could be caused by a number
>>>>>>of factors, including an inability to create a connection back
>>>>>>to mpirun due to a lack of common network interfaces and/or no
>>>>>>route found between them. Please check network connectivity
>>>>>>(including firewalls and network routing requirements).
>>>>>>--
>>>>>>b)
>>>>>>$ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c
>>>>>>--
>>>>>>An ORTE daemon has unexpectedly failed after launch and before
>>>>>>communicating back to mpirun. This could be caused by a number
>>>>>>of factors, including an inability to create a connection back
>>>>>>to mpirun due to a lack of common network interfaces and/or no
>>>>>>route found between them. Please check network connectivity
>>>>>>(including firewalls and network routing requirements).
>>>>>>--
>>>>>>
>>>>>>c)
>>>>>>
>>>>>>$ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca 
>>>>>>plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 
>>>>>>1 ./hello_c
>>>>>>[compiler-2:14673] mca:base:select:( plm) Querying component [isolated]
>>>>>>[compiler-2:14673] mca:base:select:( plm) Query of component [isolated] 
>>>>>>set priority to 0
>>>>>>[compiler-2:14673] mca:base:select:( plm) Querying component [rsh]
>>>>>>[compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set 
>>>>>>priority to 10
>>>>>>[compiler-2:14673] mca:base:select:( plm) Querying component [slurm]
>>>>>>[compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set 
>>>>>>priority to 75
>>>>>>[compiler-2:14673] mca:base:select:( plm) Selected component [slurm]
>>>>>>[compiler-2:14673] mca: base: components_register: registering oob 
>>>>>>components
>>>>>>[compiler-2:14673] mca: base: components_re

Re: [OMPI users] long initialization

2014-08-26 Thread Timur Ismagilov


Hello!
Here is my time results:
$time mpirun -n 1 ./hello_c
Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 21, 
2014 (nightly snapshot tarball), 146)
real 1m3.985s
user 0m0.031s
sys 0m0.083s


Fri, 22 Aug 2014 07:43:03 -0700 от Ralph Castain :
>I'm also puzzled by your timing statement - I can't replicate it:
>
>07:41:43  $ time mpirun -n 1 ./hello_c
>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI rhc@bend001 
>Distribution, ident: 1.9a1r32577, repo rev: r32577, Unreleased developer copy, 
>125)
>
>real 0m0.547s
>user 0m0.043s
>sys 0m0.046s
>
>The entire thing ran in 0.5 seconds
>
>
>On Aug 22, 2014, at 6:33 AM, Mike Dubman < mi...@dev.mellanox.co.il > wrote:
>>Hi,
>>The default delimiter is ";" . You can change delimiter with 
>>mca_base_env_list_delimiter.
>>
>>
>>
>>On Fri, Aug 22, 2014 at 2:59 PM, Timur Ismagilov  < tismagi...@mail.ru > 
>>wrote:
>>>Hello!
>>>If i use latest night snapshot:
>>>$ ompi_info -V
>>>Open MPI v1.9a1r32570
>>>*  In programm hello_c initialization takes ~1 min
>>>In ompi 1.8.2rc4 and ealier it takes ~1 sec(or less)
>>>*  if i use 
>>>$mpirun  --mca mca_base_env_list 'MXM_SHM_KCOPY_MODE=off,OMP_NUM_THREADS=8' 
>>>--map-by slot:pe=8 -np 1 ./hello_c
>>>i got error 
>>>config_parser.c:657  MXM  ERROR Invalid value for SHM_KCOPY_MODE: 
>>>'off,OMP_NUM_THREADS=8'. Expected: [off|knem|cma|autodetect]
>>>but with -x all works fine (but with warn)
>>>$mpirun  -x MXM_SHM_KCOPY_MODE=off -x OMP_NUM_THREADS=8 -np 1 ./hello_c
>>>WARNING: The mechanism by which environment variables are explicitly
>>>..
>>>..
>>>..
>>>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
>>>semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 
>>>21, 2014 (nightly snapshot tarball), 146)
>>>Thu, 21 Aug 2014 06:26:13 -0700 от Ralph Castain < r...@open-mpi.org >:
>>>>Not sure I understand. The problem has been fixed in both the trunk and the 
>>>>1.8 branch now, so you should be able to work with either of those nightly 
>>>>builds.
>>>>
>>>>On Aug 21, 2014, at 12:02 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>>>>Have i I any opportunity to run mpi jobs?
>>>>>
>>>>>
>>>>>Wed, 20 Aug 2014 10:48:38 -0700 от Ralph Castain < r...@open-mpi.org >:
>>>>>>yes, i know - it is cmr'd
>>>>>>
>>>>>>On Aug 20, 2014, at 10:26 AM, Mike Dubman < mi...@dev.mellanox.co.il > 
>>>>>>wrote:
>>>>>>>btw, we get same error in v1.8 branch as well.
>>>>>>>
>>>>>>>
>>>>>>>On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain   < r...@open-mpi.org >   
>>>>>>>wrote:
>>>>>>>>It was not yet fixed - but should be now.
>>>>>>>>
>>>>>>>>On Aug 20, 2014, at 6:39 AM, Timur Ismagilov < tismagi...@mail.ru > 
>>>>>>>>wrote:
>>>>>>>>>Hello!
>>>>>>>>>
>>>>>>>>>As i can see, the bug is fixed, but in Open MPI v1.9a1r32516  i still 
>>>>>>>>>have the problem
>>>>>>>>>
>>>>>>>>>a)
>>>>>>>>>$ mpirun  -np 1 ./hello_c
>>>>>>>>>--
>>>>>>>>>An ORTE daemon has unexpectedly failed after launch and before
>>>>>>>>>communicating back to mpirun. This could be caused by a number
>>>>>>>>>of factors, including an inability to create a connection back
>>>>>>>>>to mpirun due to a lack of common network interfaces and/or no
>>>>>>>>>route found between them. Please check network connectivity
>>>>>>>>>(including firewalls and network routing requirements).
>>>>>>>>>--
>>>>>>>>>b)
>>>>>>>>>$ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c
>>>>>>>>>---

Re: [OMPI users] long initialization

2014-08-26 Thread Timur Ismagilov


I'm using slurm 2.5.6

$salloc -N8 --exclusive -J ompi -p test
$ srun hostname
node1-128-21
node1-128-24
node1-128-22
node1-128-26
node1-128-27
node1-128-20
node1-128-25
node1-128-23
$ time mpirun -np 1 --host node1-128-21 ./hello_c
Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 21, 
2014 (nightly snapshot tarball), 146)
real 1m3.932s
user 0m0.035s
sys 0m0.072s


Tue, 26 Aug 2014 07:03:58 -0700 от Ralph Castain :
>hmmmwhat is your allocation like? do you have a large hostfile, for 
>example?
>
>if you add a --host argument that contains just the local host, what is the 
>time for that scenario?
>
>On Aug 26, 2014, at 6:27 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>Hello!
>>Here is my time results:
>>$time mpirun -n 1 ./hello_c
>>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
>>semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 
>>21, 2014 (nightly snapshot tarball), 146)
>>real 1m3.985s
>>user 0m0.031s
>>sys 0m0.083s
>>
>>
>>Fri, 22 Aug 2014 07:43:03 -0700 от Ralph Castain < r...@open-mpi.org >:
>>>I'm also puzzled by your timing statement - I can't replicate it:
>>>
>>>07:41:43    $ time mpirun -n 1 ./hello_c
>>>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI rhc@bend001 
>>>Distribution, ident: 1.9a1r32577, repo rev: r32577, Unreleased developer 
>>>copy, 125)
>>>
>>>real 0m0.547s
>>>user 0m0.043s
>>>sys 0m0.046s
>>>
>>>The entire thing ran in 0.5 seconds
>>>
>>>
>>>On Aug 22, 2014, at 6:33 AM, Mike Dubman < mi...@dev.mellanox.co.il > wrote:
>>>>Hi,
>>>>The default delimiter is ";" . You can change delimiter with 
>>>>mca_base_env_list_delimiter.
>>>>
>>>>
>>>>
>>>>On Fri, Aug 22, 2014 at 2:59 PM, Timur Ismagilov   < tismagi...@mail.ru >   
>>>>wrote:
>>>>>Hello!
>>>>>If i use latest night snapshot:
>>>>>$ ompi_info -V
>>>>>Open MPI v1.9a1r32570
>>>>>*  In programm hello_c initialization takes ~1 min
>>>>>In ompi 1.8.2rc4 and ealier it takes ~1 sec(or less)
>>>>>*  if i use 
>>>>>$mpirun  --mca mca_base_env_list 
>>>>>'MXM_SHM_KCOPY_MODE=off,OMP_NUM_THREADS=8' --map-by slot:pe=8 -np 1 
>>>>>./hello_c
>>>>>i got error 
>>>>>config_parser.c:657  MXM  ERROR Invalid value for SHM_KCOPY_MODE: 
>>>>>'off,OMP_NUM_THREADS=8'. Expected: [off|knem|cma|autodetect]
>>>>>but with -x all works fine (but with warn)
>>>>>$mpirun  -x MXM_SHM_KCOPY_MODE=off -x OMP_NUM_THREADS=8 -np 1 ./hello_c
>>>>>WARNING: The mechanism by which environment variables are explicitly
>>>>>..
>>>>>..
>>>>>..
>>>>>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
>>>>>semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug 
>>>>>21, 2014 (nightly snapshot tarball), 146)
>>>>>Thu, 21 Aug 2014 06:26:13 -0700 от Ralph Castain < r...@open-mpi.org >:
>>>>>>Not sure I understand. The problem has been fixed in both the trunk and 
>>>>>>the 1.8 branch now, so you should be able to work with either of those 
>>>>>>nightly builds.
>>>>>>
>>>>>>On Aug 21, 2014, at 12:02 AM, Timur Ismagilov < tismagi...@mail.ru > 
>>>>>>wrote:
>>>>>>>Have i I any opportunity to run mpi jobs?
>>>>>>>
>>>>>>>
>>>>>>>Wed, 20 Aug 2014 10:48:38 -0700 от Ralph Castain < r...@open-mpi.org >:
>>>>>>>>yes, i know - it is cmr'd
>>>>>>>>
>>>>>>>>On Aug 20, 2014, at 10:26 AM, Mike Dubman < mi...@dev.mellanox.co.il > 
>>>>>>>>wrote:
>>>>>>>>>btw, we get same error in v1.8 branch as well.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain   < r...@open-mpi.org > 
>>>>>>>>>  wrote:
>>>>>>>>>>It was not yet fixed - but should be now.
>>>>>>>>>>
>>>>>>>>>>On Aug 20, 2014,

Re: [OMPI users] long initialization

2014-08-27 Thread Timur Ismagilov

When i try to specify oob with --mca oob_tcp_if_include , i alwase get error:
$ mpirun  --mca oob_tcp_if_include ib0 -np 1 ./hello_c
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
-

Earlier, in ompi 1.8.1, I can not run mpi jobs without " --mca 
oob_tcp_if_include ib0 "... but now(ompi 1.9.a1) with this flag i get above 
error.

Here is an output of ifconfig
$ ifconfig
eth1 Link encap:Ethernet HWaddr 00:15:17:EE:89:E1 
inet addr:10.0.251.53 Bcast:10.0.251.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:215087433 errors:0 dropped:0 overruns:0 frame:0
TX packets:2648 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000 
RX bytes:26925754883 (25.0 GiB) TX bytes:137971 (134.7 KiB)
Memory:b2c0-b2c2
eth2 Link encap:Ethernet HWaddr 00:02:C9:04:73:F8 
inet addr:10.0.0.4 Bcast:10.0.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:4892833125 errors:0 dropped:0 overruns:0 frame:0
TX packets:8708606918 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000 
RX bytes:1823986502132 (1.6 TiB) TX bytes:11957754120037 (10.8 TiB)
eth2.911 Link encap:Ethernet HWaddr 00:02:C9:04:73:F8 
inet addr:93.180.7.38 Bcast:93.180.7.63 Mask:255.255.255.224
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:3746454225 errors:0 dropped:0 overruns:0 frame:0
TX packets:1131917608 errors:0 dropped:3 overruns:0 carrier:0
collisions:0 txqueuelen:0 
RX bytes:285174723322 (265.5 GiB) TX bytes:11523163526058 (10.4 TiB)
eth3 Link encap:Ethernet HWaddr 00:02:C9:04:73:F9 
inet addr:10.2.251.14 Bcast:10.2.251.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:591156692 errors:0 dropped:56 overruns:56 frame:56
TX packets:679729229 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000 
RX bytes:324195989293 (301.9 GiB) TX bytes:770299202886 (717.3 GiB)
Ifconfig uses the ioctl access method to get the full address information, 
which limits hardware addresses to 8 bytes.
Because Infiniband address has 20 bytes, only the first 8 bytes are displayed 
correctly.
Ifconfig is obsolete! For replacement check ip.
ib0 Link encap:InfiniBand HWaddr 
80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 
inet addr:10.128.0.4 Bcast:10.128.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:10843859 errors:0 dropped:0 overruns:0 frame:0
TX packets:8089839 errors:0 dropped:15 overruns:0 carrier:0
collisions:0 txqueuelen:1024 
RX bytes:939249464 (895.7 MiB) TX bytes:886054008 (845.0 MiB)
lo Link encap:Local Loopback 
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:31235107 errors:0 dropped:0 overruns:0 frame:0
TX packets:31235107 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0 
RX bytes:132750916041 (123.6 GiB) TX bytes:132750916041 (123.6 GiB)

Tue, 26 Aug 2014 09:48:35 -0700 от Ralph Castain :
>I think something may be messed up with your installation. I went ahead and 
>tested this on a Slurm 2.5.4 cluster, and got the following:
>
>$ time mpirun -np 1 --host bend001 ./hello
>Hello, World, I am 0 of 1 [0 local peers]: get_cpubind: 0 bitmap 0,12
>
>real 0m0.086s
>user 0m0.039s
>sys 0m0.046s
>
>$ time mpirun -np 1 --host bend002 ./hello
>Hello, World, I am 0 of 1 [0 local peers]: get_cpubind: 0 bitmap 0,12
>
>real 0m0.528s
>user 0m0.021s
>sys 0m0.023s
>
>Which is what I would have expected. With --host set to the local host, no 
>daemons are being launched and so the time is quite short (just spent mapping 
>and fork/exec). With --host set to a single remote host, you have the time it 
>takes Slurm to launch our daemon on the remote host, so you get about half of 
>a second.
>
>IIRC, you were having some problems with the OOB setup. If you specify the TCP 
>interface to use, does your time come down?
>
>
>On Aug 26, 2014, at 8:32 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>I'm using slurm 2.5.6
>>
>>$salloc -N8 --exclusive -J ompi -p test
>>$ srun hostname
>>node1-128-21
>>node1-128-24
>>node1-128-22
>>node1-128-26
>>node1-128-27
>>node1-128-20
>>node1-128-25
>>node1-128-23
>>$ time mpirun -np 1 --host node1-128-21 ./hello_c
>>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
>>semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo r

Re: [OMPI users] long initialization

2014-08-28 Thread Timur Ismagilov


I enclosure 2 files with output of two foloowing commands (OMPI 1.9a1r32570)
$time mpirun --leave-session-attached -mca oob_base_verbose 100 -np 1 ./hello_c 
>& out1.txt 
(Hello, world, I am )
real 1m3.952s
user 0m0.035s
sys 0m0.107s
$time mpirun --leave-session-attached -mca oob_base_verbose 100 --mca 
oob_tcp_if_include ib0 -np 1 ./hello_c >& out2.txt 
(no Hello, word, I am )
real 0m9.337s
user 0m0.059s
sys 0m0.098s
Wed, 27 Aug 2014 06:31:02 -0700 от Ralph Castain :
>How bizarre. Please add "--leave-session-attached -mca oob_base_verbose 100" 
>to your cmd line
>
>On Aug 27, 2014, at 4:31 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>When i try to specify oob with --mca oob_tcp_if_include >from ifconfig>, i alwase get error:
>>$ mpirun  --mca oob_tcp_if_include ib0 -np 1 ./hello_c
>>--
>>An ORTE daemon has unexpectedly failed after launch and before
>>communicating back to mpirun. This could be caused by a number
>>of factors, including an inability to create a connection back
>>to mpirun due to a lack of common network interfaces and/or no
>>route found between them. Please check network connectivity
>>(including firewalls and network routing requirements).
>>-
>>
>>Earlier, in ompi 1.8.1, I can not run mpi jobs without " --mca 
>>oob_tcp_if_include ib0 "... but now(ompi 1.9.a1) with this flag i get above 
>>error.
>>
>>Here is an output of ifconfig
>>$ ifconfig
>>eth1 Link encap:Ethernet HWaddr 00:15:17:EE:89:E1  
>>inet addr:10.0.251.53 Bcast:10.0.251.255 Mask:255.255.255.0
>>UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>RX packets:215087433 errors:0 dropped:0 overruns:0 frame:0
>>TX packets:2648 errors:0 dropped:0 overruns:0 carrier:0
>>collisions:0 txqueuelen:1000  
>>RX bytes:26925754883 (25.0 GiB) TX bytes:137971 (134.7 KiB)
>>Memory:b2c0-b2c2
>>eth2 Link encap:Ethernet HWaddr 00:02:C9:04:73:F8  
>>inet addr:10.0.0.4 Bcast:10.0.0.255 Mask:255.255.255.0
>>UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>RX packets:4892833125 errors:0 dropped:0 overruns:0 frame:0
>>TX packets:8708606918 errors:0 dropped:0 overruns:0 carrier:0
>>collisions:0 txqueuelen:1000  
>>RX bytes:1823986502132 (1.6 TiB) TX bytes:11957754120037 (10.8 TiB)
>>eth2.911 Link encap:Ethernet HWaddr 00:02:C9:04:73:F8  
>>inet addr:93.180.7.38 Bcast:93.180.7.63 Mask:255.255.255.224
>>UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>RX packets:3746454225 errors:0 dropped:0 overruns:0 frame:0
>>TX packets:1131917608 errors:0 dropped:3 overruns:0 carrier:0
>>collisions:0 txqueuelen:0  
>>RX bytes:285174723322 (265.5 GiB) TX bytes:11523163526058 (10.4 TiB)
>>eth3 Link encap:Ethernet HWaddr 00:02:C9:04:73:F9  
>>inet addr:10.2.251.14 Bcast:10.2.251.255 Mask:255.255.255.0
>>UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>RX packets:591156692 errors:0 dropped:56 overruns:56 frame:56
>>TX packets:679729229 errors:0 dropped:0 overruns:0 carrier:0
>>collisions:0 txqueuelen:1000  
>>RX bytes:324195989293 (301.9 GiB) TX bytes:770299202886 (717.3 GiB)
>>Ifconfig uses the ioctl access method to get the full address information, 
>>which limits hardware addresses to 8 bytes.
>>Because Infiniband address has 20 bytes, only the first 8 bytes are displayed 
>>correctly.
>>Ifconfig is obsolete! For replacement check ip.
>>ib0 Link encap:InfiniBand HWaddr 
>>80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
>>inet addr:10.128.0.4 Bcast:10.128.255.255 Mask:255.255.0.0
>>UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
>>RX packets:10843859 errors:0 dropped:0 overruns:0 frame:0
>>TX packets:8089839 errors:0 dropped:15 overruns:0 carrier:0
>>collisions:0 txqueuelen:1024  
>>RX bytes:939249464 (895.7 MiB) TX bytes:886054008 (845.0 MiB)
>>lo Link encap:Local Loopback  
>>inet addr:127.0.0.1 Mask:255.0.0.0
>>UP LOOPBACK RUNNING MTU:16436 Metric:1
>>RX packets:31235107 errors:0 dropped:0 overruns:0 frame:0
>>TX packets:31235107 errors:0 dropped:0 overruns:0 carrier:0
>>collisions:0 txqueuelen:0  
>>RX bytes:132750916041 (123.6 GiB) TX bytes:132750916041 (123.6 GiB)
>>
>>
>>
>>Tue, 26 Aug 2014 09:48:35 -0700 от Ralph Castain < r...@open-mpi.org >:
>>>I think something may be messed up with your installation. I went ahead and 
>>>tested this on a Slurm 2.5.4 cluster, and got the following:
>>>
>>>$ time mpirun -np 1 --host bend001 ./hello
>>>Hello, World,

Re: [OMPI users] long initialization

2014-08-28 Thread Timur Ismagilov


In OMPI 1.9a1r32604 I get much better results:
$ time mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c
Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI 
semenov@compiler-2 Distribution, ident: 1.9a1r32604, repo rev: r32604, Aug 26, 
2014 (nightly snapshot tarball), 146)
real 0m4.166s
user 0m0.034s
sys 0m0.079s


Thu, 28 Aug 2014 13:10:02 +0400 от Timur Ismagilov :
>I enclosure 2 files with output of two foloowing commands (OMPI 1.9a1r32570)
>$time mpirun --leave-session-attached -mca oob_base_verbose 100 -np 1 
>./hello_c >& out1.txt 
>(Hello, world, I am )
>real 1m3.952s
>user 0m0.035s
>sys 0m0.107s
>$time mpirun --leave-session-attached -mca oob_base_verbose 100 --mca 
>oob_tcp_if_include ib0 -np 1 ./hello_c >& out2.txt 
>(no Hello, word, I am )
>real 0m9.337s
>user 0m0.059s
>sys 0m0.098s
>Wed, 27 Aug 2014 06:31:02 -0700 от Ralph Castain :
>>How bizarre. Please add "--leave-session-attached -mca oob_base_verbose 100" 
>>to your cmd line
>>
>>On Aug 27, 2014, at 4:31 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>>When i try to specify oob with --mca oob_tcp_if_include >>from ifconfig>, i alwase get error:
>>>$ mpirun  --mca oob_tcp_if_include ib0 -np 1 ./hello_c
>>>--
>>>An ORTE daemon has unexpectedly failed after launch and before
>>>communicating back to mpirun. This could be caused by a number
>>>of factors, including an inability to create a connection back
>>>to mpirun due to a lack of common network interfaces and/or no
>>>route found between them. Please check network connectivity
>>>(including firewalls and network routing requirements).
>>>-
>>>
>>>Earlier, in ompi 1.8.1, I can not run mpi jobs without " --mca 
>>>oob_tcp_if_include ib0 "... but now(ompi 1.9.a1) with this flag i get above 
>>>error.
>>>
>>>Here is an output of ifconfig
>>>$ ifconfig
>>>eth1 Link encap:Ethernet HWaddr 00:15:17:EE:89:E1  
>>>inet addr:10.0.251.53 Bcast:10.0.251.255 Mask:255.255.255.0
>>>UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>RX packets:215087433 errors:0 dropped:0 overruns:0 frame:0
>>>TX packets:2648 errors:0 dropped:0 overruns:0 carrier:0
>>>collisions:0 txqueuelen:1000  
>>>RX bytes:26925754883 (25.0 GiB) TX bytes:137971 (134.7 KiB)
>>>Memory:b2c0-b2c2
>>>eth2 Link encap:Ethernet HWaddr 00:02:C9:04:73:F8  
>>>inet addr:10.0.0.4 Bcast:10.0.0.255 Mask:255.255.255.0
>>>UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>RX packets:4892833125 errors:0 dropped:0 overruns:0 frame:0
>>>TX packets:8708606918 errors:0 dropped:0 overruns:0 carrier:0
>>>collisions:0 txqueuelen:1000  
>>>RX bytes:1823986502132 (1.6 TiB) TX bytes:11957754120037 (10.8 TiB)
>>>eth2.911 Link encap:Ethernet HWaddr 00:02:C9:04:73:F8  
>>>inet addr:93.180.7.38 Bcast:93.180.7.63 Mask:255.255.255.224
>>>UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>RX packets:3746454225 errors:0 dropped:0 overruns:0 frame:0
>>>TX packets:1131917608 errors:0 dropped:3 overruns:0 carrier:0
>>>collisions:0 txqueuelen:0  
>>>RX bytes:285174723322 (265.5 GiB) TX bytes:11523163526058 (10.4 TiB)
>>>eth3 Link encap:Ethernet HWaddr 00:02:C9:04:73:F9  
>>>inet addr:10.2.251.14 Bcast:10.2.251.255 Mask:255.255.255.0
>>>UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>>>RX packets:591156692 errors:0 dropped:56 overruns:56 frame:56
>>>TX packets:679729229 errors:0 dropped:0 overruns:0 carrier:0
>>>collisions:0 txqueuelen:1000  
>>>RX bytes:324195989293 (301.9 GiB) TX bytes:770299202886 (717.3 GiB)
>>>Ifconfig uses the ioctl access method to get the full address information, 
>>>which limits hardware addresses to 8 bytes.
>>>Because Infiniband address has 20 bytes, only the first 8 bytes are 
>>>displayed correctly.
>>>Ifconfig is obsolete! For replacement check ip.
>>>ib0 Link encap:InfiniBand HWaddr 
>>>80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  
>>>inet addr:10.128.0.4 Bcast:10.128.255.255 Mask:255.255.0.0
>>>UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
>>>RX packets:10843859 errors:0 dropped:0 overruns:0 frame:0
>>>TX packets:8089839 errors:0 dropped:15 overruns:0 carrier:0
>>>collisions:0 txqueuelen:1024  
>>>RX bytes:939249464 (895.7 MiB) TX bytes:886054008 (845.0 MiB)
>>>lo Link encap:Local Loopback  
>

[OMPI users] open shmem optimization

2014-08-29 Thread Timur Ismagilov


Hello!
What param can i tune to increase perfomance(scalability) for my app (all to 
all pattern with message size = constant/nnodes)?
I can read  this faq  for mpi, but is it correct for shmem?
I have 2 programm doing the same thing(with same input) each node send 
messages(message size = constant/nnodes) to random set of nodes (but the same 
set in prg1 and prg2):
*  with mpi_isend, mpi_irecv and mpi_waitall
*  with shmem_put and shmem_barrier_all on 1 2 4 8 16 32 nodes thay have same 
perfomance(scalabilyty)
on 64 128 256 nodes shmem programm stop scaling but over 512 nodes shmem 
programm gets much better perfomance than mpi
           1prg           2prg
           perf unit     perf unit      
1         30              30
2         50              53
4         75              85
8         110            130
16       180            200
32       310            350
64       500            400 (strange)
128     830            400 (strange)
256     1350           600 (strange)
512     1770           2350 (wow!)

In scalabel shmem(ompi 1.6.5?) I get the same scalability in this programms.

[OMPI users] shmalloc error with >=512 mb

2014-11-17 Thread Timur Ismagilov


Hello!
Why does shmalloc return NULL when I try to allocate 512MB.
When i thry to allocate 256mb - all fine.
I use Open MPI/SHMEM v1.8.4 rc1 (v1.8.3-202-gb568b6e).

programm:
#include 
#include 
int main(int argc, char **argv)
{
int *src;
start_pes(0);
int length = 1024*1024*512;
src = (int*) shmalloc(length);
  if (src == NULL) {
    printf("can not allocate src: size = %dMb\n ", length/(1024*1024));
  }
return 0;
}

command:
$oshrun -np 1 ./example_shmem
can not allocate src: size = 512Mb

How can i increse shmalloc memory size?

[OMPI users] MXM problem

2015-05-25 Thread Timur Ismagilov


Hello!

I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2;
OFED-1.5.4.1;
CentOS release 6.2;
infiniband 4x FDR



I have two problems:
1. I can not use mxm :
1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29 -mca 
plm_rsh_no_tree_spawn 1 -np 4 ./hello
--  
     
A requested component was not found, or was unable to be opened.  This  
     
means that this component is either not installed or is unable to be
     
used on your system (e.g., sometimes this means that shared libraries   
     
that the component requires are unable to be found/loaded).  Note that  
     
Open MPI stopped checking at the first component that it did not find.  
     

     
Host:  node14   
     
Framework: pml  
     
Component: yalla
     
--  
     
*** An error occurred in MPI_Init   
     
--  
     
It looks like MPI_INIT failed for some reason; your parallel process is 
     
likely to abort.  There are many reasons that a parallel process can
     
fail during MPI_INIT; some of which are due to configuration or environment 
     
problems.  This failure appears to be an internal failure; here's some  
     
additional information (which may only be relevant to an Open MPI   
     
developer): 
     

     
  mca_pml_base_open() failed
     
  --> Returned "Not found" (-13) instead of "Success" (0)   
     
--  
     
*** on a NULL communicator  
     
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
     
***    and potentially your MPI job)
     
*** An error occurred in MPI_Init   
     
[node28:102377] Local abort before MPI_INIT completed successfully; not able to 
aggregate error messages,
 and not able to guarantee that all other processes were killed!
     
*** on a NULL communicator  
     
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
     
***    and potentially your MPI job)
     
[node29:105600] Local abort before MPI_INIT completed successfully; not able to 
aggregate error messages,
 and not able to guarantee that all other processes were killed!
     
*** An error occurred in MPI_Init   
     
*** on a NULL communicator  
     
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
     
***    and potentially your MPI job)
     
[node5:102409] Local abort before MPI_INIT completed successfully; not able to 
aggregate error messages, 
and not able to guarantee that all other processes were killed! 
     
*** An error occurred in MPI_Init   
     
*** on a NULL communicator  
     
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
     
***    and potentially your MPI job)
     
[

Re: [OMPI users] MXM problem

2015-05-25 Thread Timur Ismagilov

 I can password-less ssh to all nodes:
base$ ssh node1
node1$ssh node2
Last login: Mon May 25 18:41:23 
node2$ssh node3
Last login: Mon May 25 16:25:01
node3$ssh node4
Last login: Mon May 25 16:27:04
node4$

Is this correct?

In ompi-1.9 i do not have no-tree-spawn problem.


Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain :
>I can’t speak to the mxm problem, but the no-tree-spawn issue indicates that 
>you don’t have password-less ssh authorized between the compute nodes
>
>
>>On May 25, 2015, at 8:55 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>Hello!
>>
>>I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2;
>>OFED-1.5.4.1;
>>CentOS release 6.2;
>>infiniband 4x FDR
>>
>>
>>
>>I have two problems:
>>1. I can not use mxm :
>>1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29 -mca 
>>plm_rsh_no_tree_spawn 1 -np 4 ./hello
>>--
>>   
>>A requested component was not found, or was unable to be opened.  This
>>   
>>means that this component is either not installed or is unable to be  
>>   
>>used on your system (e.g., sometimes this means that shared libraries 
>>   
>>that the component requires are unable to be found/loaded).  Note that
>>   
>>Open MPI stopped checking at the first component that it did not find.
>>   
>>  
>>   
>>Host:  node14 
>>   
>>Framework: pml
>>   
>>Component: yalla  
>>   
>>--
>>   
>>*** An error occurred in MPI_Init 
>>   
>>--
>>   
>>It looks like MPI_INIT failed for some reason; your parallel process is   
>>   
>>likely to abort.  There are many reasons that a parallel process can  
>>   
>>fail during MPI_INIT; some of which are due to configuration or environment   
>>   
>>problems.  This failure appears to be an internal failure; here's some
>>   
>>additional information (which may only be relevant to an Open MPI 
>>   
>>developer):   
>>   
>>  
>>   
>>  mca_pml_base_open() failed  
>>   
>>  --> Returned "Not found" (-13) instead of "Success" (0) 
>>   
>>--
>>   
>>*** on a NULL communicator
>>   
>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,  
>>   
>>***    and potentially your MPI job)  
>>   
>>*** An error occurred in MPI_Init 
>>   
>>[node28:102377] Local abort before MPI_INIT completed successfully; not able 
>>to aggregate error messages,
>> and not able to guarantee that all other processes were killed!  
>>   
>>*** on a NULL communicator
>>   
>>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,  
>>   
>>***    and potentially your MPI job)  
>>   
>

Re: [OMPI users] MXM problem

2015-05-25 Thread Timur Ismagilov


Hi, Mike,
that is what i have:
$ echo $LD_LIBRARY_PATH | tr ":" "\n"
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/fca/lib
   
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/hcoll/lib
     
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/mxm/lib
   
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib
 +intel compiler paths

$ echo $OPAL_PREFIX 
    
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8

I don't use LD_PRELOAD.

In the attached file(ompi_info.out) you will find the output of ompi_info -l 9  
command.

P.S . 
node1 $ ./mxm_perftest
node2 $  ./mxm_perftest node1  -t send_lat
[1432568685.067067] [node151:87372:0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file $t /dev/knem : No such file or directory. Won't use 
knem.          ( I don't have knem)
[1432568685.069699] [node151:87372:0]  ib_dev.c:531  MXM  WARN  skipping 
device scif0 (vendor_id/par$_id = 0x8086/0x0) - not a Mellanox device           
                     (???)
Failed to create endpoint: No such device

$  ibv_devinfo     
hca_id: mlx4_0  
    transport:  InfiniBand (0)  
    fw_ver: 2.10.600    
    node_guid:  0002:c903:00a1:13b0     
    sys_image_guid: 0002:c903:00a1:13b3     
    vendor_id:  0x02c9  
    vendor_part_id: 4099    
    hw_ver: 0x0     
    board_id:   MT_1090120019   
    phys_port_cnt:  2   
    port:   1   
    state:  PORT_ACTIVE (4) 
    max_mtu:    4096 (5)    
    active_mtu: 4096 (5)    
    sm_lid: 1   
    port_lid:   83  
    port_lmc:   0x00    
    
    port:   2   
    state:  PORT_DOWN (1)   
    max_mtu:    4096 (5)    
    active_mtu: 4096 (5)    
    sm_lid: 0   
    port_lid:   0   
    port_lmc:   0x00    

Best regards,
Timur.


Понедельник, 25 мая 2015, 19:39 +03:00 от Mike Dubman 
:
>Hi Timur,
>seems that yalla component was not found in your OMPI tree.
>can it be that your mpirun is not from hpcx? Can you please check 
>LD_LIBRARY_PATH,PATH, LD_PRELOAD and OPAL_PREFIX that it is pointing to the 
>right mpirun?
>
>Also, could you please check that yalla is present in the ompi_info -l 9 
>output?
>
>Thanks
>
>On Mon, May 25, 2015 at 7:11 PM, Timur Ismagilov  < tismagi...@mail.ru > wrote:
>>I can password-less ssh to all nodes:
>>base$ ssh node1
>>node1$ssh node2
>>Last login: Mon May 25 18:41:23 
>>node2$ssh node3
>>Last login: Mon May 25 16:25:01
>>node3$ssh node4
>>Last login: Mon May 25 16:27:04
>>node4$
>>
>>Is this correct?
>>
>>In ompi-1.9 i do not have no-tree-spawn problem.
>>
>>
>>Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain < r...@open-mpi.org >:
>>
>>>I can’t speak to the mxm problem, but the no-tree-spawn issue indicates that 
>>>you don’t have password-less ssh authorized between the compute nodes
>>>
>>>
>>>>On May 25, 2015, at 8:55 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>>>Hello!
>>>>
>>>>I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2;
>>>>OFED-1.5.4.1;
>>>>CentOS release 6.2;
>>>>infiniband 4x FDR
>>>>
>>>>
>>>>
>>>>I have two problems:
>>>>1. I can not use mxm :
>>>>1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29 
>>>>-mca plm_rsh_no_tree_spawn 1 -np 4 ./hello
>>>>--  
>>>>

[OMPI users] Fwd: Re[4]: MXM problem

2015-05-25 Thread Timur Ismagilov

 I did as you said, but got an error:

node1$ export MXM_IB_PORTS=mlx4_0:1
node1$  ./mxm_perftest  
  
Waiting for connection...   
     
Accepted connection from 10.65.0.253
     
[1432576262.370195] [node153:35388:0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or directory. Won't use 
knem.     
Failed to create endpoint: No such device   
     

node2$ export MXM_IB_PORTS=mlx4_0:1
node2$ ./mxm_perftest node1  -t send_lat
   
[1432576262.367523] [node158:99366:0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or directory. Won't use 
knem.
Failed to create endpoint: No such device




Понедельник, 25 мая 2015, 20:31 +03:00 от Mike Dubman 
:
>scif is a OFA device from Intel.
>can you please select export MXM_IB_PORTS=mlx4_0:1 explicitly and retry
>
>On Mon, May 25, 2015 at 8:26 PM, Timur Ismagilov  < tismagi...@mail.ru > wrote:
>>Hi, Mike,
>>that is what i have:
>>$ echo $LD_LIBRARY_PATH | tr ":" "\n"
>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/fca/lib
>>   
>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/hcoll/lib
>>     
>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/mxm/lib
>>   
>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib
>> +intel compiler paths
>>
>>$ echo $OPAL_PREFIX   
>>  
>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8
>>
>>I don't use LD_PRELOAD.
>>
>>In the attached file(ompi_info.out) you will find the output of ompi_info -l 
>>9  command.
>>
>>P.S . 
>>node1 $ ./mxm_perftest
>>node2 $  ./mxm_perftest node1  -t send_lat
>>[1432568685.067067] [node151:87372:0] shm.c:65   MXM  WARN  Could not 
>>open the KNEM device file $t /dev/knem : No such file or directory. Won't use 
>>knem.          ( I don't have knem)
>>[1432568685.069699] [node151:87372:0]  ib_dev.c:531  MXM  WARN  skipping 
>>device scif0 (vendor_id/par$_id = 0x8086/0x0) - not a Mellanox device         
>>                       (???)
>>Failed to create endpoint: No such device
>>
>>$  ibv_devinfo     
>>hca_id: mlx4_0  
>>    transport:  InfiniBand (0)  
>>    fw_ver: 2.10.600    
>>    node_guid:  0002:c903:00a1:13b0     
>>    sys_image_guid: 0002:c903:00a1:13b3     
>>    vendor_id:  0x02c9  
>>    vendor_part_id: 4099    
>>    hw_ver: 0x0     
>>    board_id:   MT_1090120019   
>>    phys_port_cnt:  2   
>>    port:   1   
>>    state:  PORT_ACTIVE (4) 
>>    max_mtu:    4096 (5)    
>>    active_mtu: 4096 (5)    
>>    sm_lid: 1   
>>    port_lid:   83  
>>    port_lmc:   0x00    
>>    
>>    port:   2   
>>    state:  PORT_DOWN (1)   
>>    max_mtu:    4096 (5)    
>>    active_mtu: 4096 (5)    
>>    sm_lid: 0   
>>    port_lid:   0   
>>    port_lmc:   0x00    
>>
>>Best regards,
>>Timur.
>>
>>
>>Понедельник, 25 мая 2015, 19:39 +03:00 от Mike Dubman < 
>>mi...@dev.mellanox.co.il >:
>>

Re: [OMPI users] MXM problem

2015-05-26 Thread Timur Ismagilov

upport  MXM ?

NOTE: Please note that the 'yalla' pml is available only from Open MPI v1.9 and 
above
...
But here we have(or not...) yalla in ompi 1.8.5



Вторник, 26 мая 2015, 9:53 +03:00 от Mike Dubman :
>Hi Timur,
>
>Here it goes:
>
>wget  
>ftp://bgate.mellanox.com/hpc/hpcx/custom/v1.3/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64.tbz
>
>Please let me know if it works for you and will add 1.5.4.1 mofed to the 
>default distribution list.
>
>M
>
>
>On Mon, May 25, 2015 at 9:38 PM, Timur Ismagilov  < tismagi...@mail.ru > wrote:
>>Thanks a lot .
>>
>>Понедельник, 25 мая 2015, 21:28 +03:00 от Mike Dubman < 
>>mi...@dev.mellanox.co.il >:
>>
>>>will send u the link tomorrow.
>>>
>>>On Mon, May 25, 2015 at 9:15 PM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>wrote:
>>>>Where can i find MXM for ofed 1.5.4.1?
>>>>
>>>>
>>>>Понедельник, 25 мая 2015, 21:11 +03:00 от Mike Dubman < 
>>>>mi...@dev.mellanox.co.il >:
>>>>
>>>>>btw, the ofed on your system is 1.5.4.1 while HPCx in use is for ofed 1.5.3
>>>>>
>>>>>seems like ABI issue between ofed versions
>>>>>
>>>>>On Mon, May 25, 2015 at 8:59 PM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>>>wrote:
>>>>>>I did as you said, but got an error:
>>>>>>
>>>>>>node1$ export MXM_IB_PORTS=mlx4_0:1
>>>>>>node1$  ./mxm_perftest
>>>>>>    
>>>>>>Waiting for connection... 
>>>>>>   
>>>>>>Accepted connection from 10.65.0.253  
>>>>>>   
>>>>>>[1432576262.370195] [node153:35388:0] shm.c:65   MXM  WARN  Could 
>>>>>>not open the KNEM device file at /dev/knem : No such file or directory. 
>>>>>>Won't use knem.     
>>>>>>Failed to create endpoint: No such device 
>>>>>>   
>>>>>>
>>>>>>node2$ export MXM_IB_PORTS=mlx4_0:1
>>>>>>node2$ ./mxm_perftest node1  -t send_lat  
>>>>>>     
>>>>>>[1432576262.367523] [node158:99366:0] shm.c:65   MXM  WARN  Could 
>>>>>>not open the KNEM device file at /dev/knem : No such file or directory. 
>>>>>>Won't use knem.
>>>>>>Failed to create endpoint: No such device
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>Понедельник, 25 мая 2015, 20:31 +03:00 от Mike Dubman < 
>>>>>>mi...@dev.mellanox.co.il >:
>>>>>>>scif is a OFA device from Intel.
>>>>>>>can you please select export MXM_IB_PORTS=mlx4_0:1 explicitly and retry
>>>>>>>
>>>>>>>On Mon, May 25, 2015 at 8:26 PM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>>>>>wrote:
>>>>>>>>Hi, Mike,
>>>>>>>>that is what i have:
>>>>>>>>$ echo $LD_LIBRARY_PATH | tr ":" "\n"
>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/fca/lib
>>>>>>>>   
>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/hcoll/lib
>>>>>>>>     
>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/mxm/lib
>>>>>>>>   
>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib
>>>>>>>> +intel compiler paths
>>>>>>>>
>>>>>>>>$ echo $OPAL_PREFIX 
>>>>>>>>    
>>>>>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8
>>>>>>>>
>>>>>>>>I don't use LD_PRELOAD.
>>>>>>>>
>

Re: [OMPI users] MXM problem

2015-05-26 Thread Timur Ismagilov


It does not work for single node:

1) host: $  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x 
MXM_SHM_KCOPY_MODE=off -host node5 -mca pml yalla -x MXM_TLS=ud,self,shm 
--prefix $HPCX_MPI_DIR -mca plm_base_verbose 5  -mca oob_base_verbose 10 -mca 
rml_base_verbose 10 --debug-daemons  -np 1 ./hello &>  yalla.out    
 
2) host: $  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x 
MXM_SHM_KCOPY_MODE=off -host node5  --mca pml cm --mca mtl mxm --prefix 
$HPCX_MPI_DIR -mca plm_base_verbose 5  -mca oob_base_verbose 10 -mca 
rml_base_verbose 10 --debug-daemons -np 1 ./hello &>  cm_mxm.out

I've attached the  yalla.out and  cm_mxm.out to this email.



Вторник, 26 мая 2015, 11:54 +03:00 от Mike Dubman :
>does it work from single node?
>could you please run with opts below and attach output?
>
> -mca plm_base_verbose 5  -mca oob_base_verbose 10 -mca rml_base_verbose 10 
>--debug-daemons
>
>On Tue, May 26, 2015 at 11:38 AM, Timur Ismagilov  < tismagi...@mail.ru > 
>wrote:
>>1. mxm_perf_test - OK.
>>2. no_tree_spawn  - OK.
>>3. ompi yalla and "--mca pml cm --mca mtl mxm" still  does not  work (I use 
>>prebuild ompi-1.8.5 from hpcx-v1.3.330)
>>3.a) host:$  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x 
>>MXM_SHM_KCOPY_MODE=off -host node5,node153  --mca pml cm --mca mtl mxm 
>>--prefix $HPCX_MPI_DIR ./hello
>>--
>>   
>>A requested component was not found, or was unable to be opened.  This
>>   
>>means that this component is either not installed or is unable to be  
>>   
>>used on your system (e.g., sometimes this means that shared libraries 
>>   
>>that the component requires are unable to be found/loaded).  Note that
>>   
>>Open MPI stopped checking at the first component that it did not find.
>>   
>>  
>>   
>>Host:  node153
>>   
>>Framework: mtl
>>   
>>Component: mxm
>>   
>>--
>>   
>>[node5:113560] PML cm cannot be selected  
>>   
>>--
>>   
>>No available pml components were found!   
>>   
>>  
>>   
>>This means that there are no components of this type installed on your
>>   
>>system or all the components reported that they could not be used.
>>   
>>  
>>   
>>This is a fatal error; your MPI process is likely to abort.  Check the
>>   
>>output of the "ompi_info" command and ensure that components of this  
>>   
>>type are available on your system.  You may also wish to check the
>>   
>>value of the "component_path" MCA parameter and ensure that it has at 
>>   
>>least one directory that contains valid MCA components.   
>>   
>>--
>>   
>>[node153:0] PML cm cannot be selected 
>>   
>>---   
>>   
>>Primary job  terminated normally, but 1 process returned  
>>   
>>a non-zero exit code.. Per user-direction, the jo

Re: [OMPI users] MXM problem

2015-05-28 Thread Timur Ismagilov


I'm sorry for the delay .

Here it is:
( I used 5 min  time limit )
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-mellanox-v1.8/bin/mpirun
 -x 
LD_PRELOAD=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-
  redhat6.2-x86_64/mxm/debug/lib/libmxm.so -x MXM_LOG_LEVEL=data -x 
MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off --mca pml yalla --hostfile 
hostlist ./hello 1> hello_debugMXM_n-2_ppn-2.out  
2>hello_debugMXM_n-2_ppn-2.err  
P.S.
yalla warks fine with rebuilded ompi: --with-mxm=$HPCX_MXM_DIR






Вторник, 26 мая 2015, 16:22 +03:00 от Alina Sklarevich 
:
>Hi Timur,
>
>HPCX has a debug version of MXM. Can you please add the following to your 
>command line with pml yalla in order to use it and attach the output? 
>"-x LD_PRELOAD=$HPCX_MXM_DIR/debug/lib/libmxm.so -x MXM_LOG_LEVEL=data"
>
>Also, could you please attach the entire output of 
>"$HPCX_MPI_DIR/bin/ompi_info -a" 
>
>Thank you,
>Alina. 
>
>On Tue, May 26, 2015 at 3:39 PM, Mike Dubman  < mi...@dev.mellanox.co.il > 
>wrote:
>>Alina - could you please take a look?
>>Thx
>>
>>
>>-- Forwarded message --
>>From:  Timur Ismagilov < tismagi...@mail.ru >
>>Date: Tue, May 26, 2015 at 12:40 PM
>>Subject: Re[12]: [OMPI users] MXM problem
>>To: Open MPI Users < us...@open-mpi.org >
>>Cc: Mike Dubman < mi...@dev.mellanox.co.il >
>>
>>
>>It does not work for single node:
>>
>>1) host: $  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x 
>>MXM_SHM_KCOPY_MODE=off -host node5 -mca pml yalla -x MXM_TLS=ud,self,shm 
>>--prefix $HPCX_MPI_DIR -mca plm_base_verbose 5  -mca oob_base_verbose 10 -mca 
>>rml_base_verbose 10 --debug-daemons  -np 1 ./hello &>  yalla.out  
>>   
>>2) host: $  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x 
>>MXM_SHM_KCOPY_MODE=off -host node5  --mca pml cm --mca mtl mxm --prefix 
>>$HPCX_MPI_DIR -mca plm_base_verbose 5  -mca oob_base_verbose 10 -mca 
>>rml_base_verbose 10 --debug-daemons -np 1 ./hello &>  cm_mxm.out
>>
>>I've attached the  yalla.out and  cm_mxm.out to this email.
>>
>>
>>
>>Вторник, 26 мая 2015, 11:54 +03:00 от Mike Dubman < mi...@dev.mellanox.co.il 
>>>:
>>>does it work from single node?
>>>could you please run with opts below and attach output?
>>>
>>> -mca plm_base_verbose 5  -mca oob_base_verbose 10 -mca rml_base_verbose 10 
>>>--debug-daemons
>>>
>>>On Tue, May 26, 2015 at 11:38 AM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>wrote:
>>>>1. mxm_perf_test - OK.
>>>>2. no_tree_spawn  - OK.
>>>>3. ompi yalla and "--mca pml cm --mca mtl mxm" still  does not  work (I use 
>>>>prebuild ompi-1.8.5 from hpcx-v1.3.330)
>>>>3.a) host:$  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x 
>>>>MXM_SHM_KCOPY_MODE=off -host node5,node153  --mca pml cm --mca mtl mxm 
>>>>--prefix $HPCX_MPI_DIR ./hello
>>>>--  
>>>>     
>>>>A requested component was not found, or was unable to be opened.  This  
>>>>     
>>>>means that this component is either not installed or is unable to be
>>>>     
>>>>used on your system (e.g., sometimes this means that shared libraries   
>>>>     
>>>>that the component requires are unable to be found/loaded).  Note that  
>>>>     
>>>>Open MPI stopped checking at the first component that it did not find.  
>>>>     
>>>>
>>>>     
>>>>Host:  node153  
>>>>     
>>>>Framework: mtl  
>>>>     
>>>>Component: mxm  
>>>>     
>>>>--  
>>>>     
>>>>[node5:113560] PML cm cannot be selected
>>>>

Re: [OMPI users] MXM problem

2015-05-28 Thread Timur Ismagilov


Is it normal to rebuild openmpi from hpcx?
Why binaries don't work?



Четверг, 28 мая 2015, 14:01 +03:00 от Alina Sklarevich 
:
>Thank you for this info.
>
>If 'yalla' now works for you, is there anything that is still wrong?
>
>Thanks,
>Alina.
>
>On Thu, May 28, 2015 at 10:21 AM, Timur Ismagilov  < tismagi...@mail.ru > 
>wrote:
>>I'm sorry for the delay .
>>
>>Here it is:
>>( I used 5 min  time limit )
>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-mellanox-v1.8/bin/mpirun
>> -x 
>>LD_PRELOAD=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-
>>  redhat6.2-x86_64/mxm/debug/lib/libmxm.so -x MXM_LOG_LEVEL=data -x 
>>MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off --mca pml yalla --hostfile 
>>hostlist ./hello 1> hello_debugMXM_n-2_ppn-2.out  
>>2>hello_debugMXM_n-2_ppn-2.err  
>>P.S.
>>yalla warks fine with rebuilded ompi: --with-mxm=$HPCX_MXM_DIR
>>
>>
>>
>>
>>
>>
>>Вторник, 26 мая 2015, 16:22 +03:00 от Alina Sklarevich < 
>>ali...@dev.mellanox.co.il >:
>>>Hi Timur,
>>>
>>>HPCX has a debug version of MXM. Can you please add the following to your 
>>>command line with pml yalla in order to use it and attach the output? 
>>>"-x LD_PRELOAD=$HPCX_MXM_DIR/debug/lib/libmxm.so -x MXM_LOG_LEVEL=data"
>>>
>>>Also, could you please attach the entire output of 
>>>"$HPCX_MPI_DIR/bin/ompi_info -a" 
>>>
>>>Thank you,
>>>Alina. 
>>>
>>>On Tue, May 26, 2015 at 3:39 PM, Mike Dubman  < mi...@dev.mellanox.co.il > 
>>>wrote:
>>>>Alina - could you please take a look?
>>>>Thx
>>>>
>>>>
>>>>-- Forwarded message --
>>>>From:  Timur Ismagilov < tismagi...@mail.ru >
>>>>Date: Tue, May 26, 2015 at 12:40 PM
>>>>Subject: Re[12]: [OMPI users] MXM problem
>>>>To: Open MPI Users < us...@open-mpi.org >
>>>>Cc: Mike Dubman < mi...@dev.mellanox.co.il >
>>>>
>>>>
>>>>It does not work for single node:
>>>>
>>>>1) host: $  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x 
>>>>MXM_SHM_KCOPY_MODE=off -host node5 -mca pml yalla -x MXM_TLS=ud,self,shm 
>>>>--prefix $HPCX_MPI_DIR -mca plm_base_verbose 5  -mca oob_base_verbose 10 
>>>>-mca rml_base_verbose 10 --debug-daemons  -np 1 ./hello &>  yalla.out   
>>>>  
>>>>2) host: $  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x 
>>>>MXM_SHM_KCOPY_MODE=off -host node5  --mca pml cm --mca mtl mxm --prefix 
>>>>$HPCX_MPI_DIR -mca plm_base_verbose 5  -mca oob_base_verbose 10 -mca 
>>>>rml_base_verbose 10 --debug-daemons -np 1 ./hello &>  cm_mxm.out
>>>>
>>>>I've attached the  yalla.out and  cm_mxm.out to this email.
>>>>
>>>>
>>>>
>>>>Вторник, 26 мая 2015, 11:54 +03:00 от Mike Dubman < 
>>>>mi...@dev.mellanox.co.il >:
>>>>>does it work from single node?
>>>>>could you please run with opts below and attach output?
>>>>>
>>>>> -mca plm_base_verbose 5  -mca oob_base_verbose 10 -mca rml_base_verbose 
>>>>>10 --debug-daemons
>>>>>
>>>>>On Tue, May 26, 2015 at 11:38 AM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>>>wrote:
>>>>>>1. mxm_perf_test - OK.
>>>>>>2. no_tree_spawn  - OK.
>>>>>>3. ompi yalla and "--mca pml cm --mca mtl mxm" still  does not  work (I 
>>>>>>use prebuild ompi-1.8.5 from hpcx-v1.3.330)
>>>>>>3.a) host:$  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x 
>>>>>>MXM_SHM_KCOPY_MODE=off -host node5,node153  --mca pml cm --mca mtl mxm 
>>>>>>--prefix $HPCX_MPI_DIR ./hello
>>>>>>--
>>>>>>   
>>>>>>A requested component was not found, or was unable to be opened.  This
>>>>>>   
>>>>>>means that this component is either not installed or is unable to be  
>>>>>>   
>>>>>>used on your system (e.g., sometimes this means that shared libraries 
>>>>>

Re: [OMPI users] Fwd[2]: OMPI yalla vs impi

2015-06-02 Thread Timur Ismagilov


Hi, Mike!
I have impi v 4.1.2 (- impi)
I build ompi 1.8.5 with MXM and hcoll (- ompi_yalla)
I build ompi 1.8.5 without MXM and hcoll (- ompi_clear)
I start osu p2p: osu_mbr_mr test with this MPIs.
You can find the result of benchmark in attached file(mvs10p_mpi.xls: list 
osu_mbr_mr)

On 64 nodes (and 1024 mpi processes) ompi_yalla get 2 time worse perf than 
ompi_clear.
Is mxm with yalla  reduces performance in p2p  compared with ompi_clear(and 
impi)?
Am  I  doing something wrong?
P.S. My colleague Alexander Semenov is in CC
Best regards,
Timur

Четверг, 28 мая 2015, 20:02 +03:00 от Mike Dubman :
>it is not apples-to-apples comparison.
>
>yalla/mxm is point-to-point library, it is not collective library.
>collective algorithm happens on top of yalla.
>
>Intel collective algorithm for a2a is better than OMPI built-in collective 
>algorithm.
>
>To see benefit of yalla - you should run p2p benchmarks (osu_lat/bw/bibw/mr)
>
>
>On Thu, May 28, 2015 at 7:35 PM, Timur Ismagilov  < tismagi...@mail.ru > wrote:
>>I compare ompi-1.8.5 (hpcx-1.3.3-icc) with impi v 4.1.4.
>>
>>I build ompi with MXM but without HCOLL and without  knem (I work on it). 
>>Configure options are:
>> ./configure  --prefix=my_prefix   
>>--with-mxm=path/to/hpcx/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/mxm   
>>--with-platform=contrib/platform/mellanox/optimized
>>
>>As a result of the IMB-MPI1 Alltoall test, I have got disappointing  results: 
>>for the most message sizes on 64 nodes and 16 processes per  node impi is 
>>much (~40%) better.
>>
>>You can look at the results in the file "mvs10p_mpi.xlsx", I attach it. 
>>System configuration is also there.
>>
>>What do you think about? Is there any way to improve ompi yalla performance 
>>results?
>>
>>I attach the output of  "IMB-MPI1 Alltoall" for yalla and impi.
>>
>>P.S. My colleague Alexander Semenov is in CC
>>
>>Best regards,
>>Timur
>
>
>-- 
>
>Kind Regards,
>
>M.




mvs10p_mpi.xlsx
Description: MS-Excel 2007 spreadsheet

Re: [OMPI users] OMPI yalla vs impi

2015-06-03 Thread Timur Ismagilov


1. Here is my 
ompi_yalla command line:
$HPCX_MPI_DIR/bin/mpirun -mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=mlx4_0:1 -x 
MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off --mca pml yalla --hostfile 
hostlist $@
echo $HPCX_MPI_DIR 
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/
 ompi-mellanox-fca-v1.8.5
This mpi was configured with: --with-mxm=/path/to/mxm 
--with-hcoll=/path/to/hcoll --with-platform=contrib/platform/mellanox/optimized 
--prefix=/path/to/ ompi-mellanox-fca-v1.8.5
ompi_clear command line:
HPCX_MPI_DIR/bin/mpirun  --hostfile hostlist $@
echo $HPCX_MPI_DIR 
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/
 ompi-clear-v1.8.5
This mpi was configured with: 
--with-platform=contrib/platform/mellanox/optimized --prefix=/path/to 
/ompi-clear-v1.8.5
2. I will run ompi_yalla with "-x MXM_TLS=self,shm,rc" and I will send you 
results in a few days.
3. I have alredy run ompi_yalla without hcoll  in IMB_alltoall test. hcoll 
provides a performance boost about 10%. You can find this results in 
mvs10p_mpi.xls: list IMB_MPI1 Alltoall.


Среда,  3 июня 2015, 10:29 +03:00 от Alina Sklarevich 
:
>Hello Timur,
>
>I will review your results and try to reproduce them in our lab.
>
>You are using an old OFED - OFED-1.5.4.1 and we suspect that this may be 
>causing the performance issues you are seeing.
>
>In the meantime, could you please:
>
>1. send us the exact command lines that you were running when you got these 
>results?
>
>2. add the following to the command line that you are running with 'pml yalla' 
>and attach the results?
>"-x MXM_TLS=self,shm,rc"
>
>3. run your command line with yalla and without hcoll?
>
>Thanks,
>Alina.
>
>
>
>On Tue, Jun 2, 2015 at 4:56 PM, Timur Ismagilov  < tismagi...@mail.ru > wrote:
>>Hi, Mike!
>>I have impi v 4.1.2 (- impi)
>>I build ompi 1.8.5 with MXM and hcoll (- ompi_yalla)
>>I build ompi 1.8.5 without MXM and hcoll (- ompi_clear)
>>I start osu p2p: osu_mbr_mr test with this MPIs.
>>You can find the result of benchmark in attached file(mvs10p_mpi.xls: list 
>>osu_mbr_mr)
>>
>>On 64 nodes (and 1024 mpi processes) ompi_yalla get 2 time worse perf than 
>>ompi_clear.
>>Is mxm with yalla  reduces performance in p2p  compared with ompi_clear(and 
>>impi)?
>>Am  I  doing something wrong?
>>P.S. My colleague Alexander Semenov is in CC
>>Best regards,
>>Timur
>>
>>Четверг, 28 мая 2015, 20:02 +03:00 от Mike Dubman < mi...@dev.mellanox.co.il 
>>>:
>>>it is not apples-to-apples comparison.
>>>
>>>yalla/mxm is point-to-point library, it is not collective library.
>>>collective algorithm happens on top of yalla.
>>>
>>>Intel collective algorithm for a2a is better than OMPI built-in collective 
>>>algorithm.
>>>
>>>To see benefit of yalla - you should run p2p benchmarks (osu_lat/bw/bibw/mr)
>>>
>>>
>>>On Thu, May 28, 2015 at 7:35 PM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>wrote:
>>>>I compare ompi-1.8.5 (hpcx-1.3.3-icc) with impi v 4.1.4.
>>>>
>>>>I build ompi with MXM but without HCOLL and without  knem (I work on it). 
>>>>Configure options are:
>>>> ./configure  --prefix=my_prefix   
>>>>--with-mxm=path/to/hpcx/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/mxm 
>>>>  --with-platform=contrib/platform/mellanox/optimized
>>>>
>>>>As a result of the IMB-MPI1 Alltoall test, I have got disappointing  
>>>>results: for the most message sizes on 64 nodes and 16 processes per  node 
>>>>impi is much (~40%) better.
>>>>
>>>>You can look at the results in the file "mvs10p_mpi.xlsx", I attach it. 
>>>>System configuration is also there.
>>>>
>>>>What do you think about? Is there any way to improve ompi yalla performance 
>>>>results?
>>>>
>>>>I attach the output of  "IMB-MPI1 Alltoall" for yalla and impi.
>>>>
>>>>P.S. My colleague Alexander Semenov is in CC
>>>>
>>>>Best regards,
>>>>Timur
>>>
>>>
>>>-- 
>>>
>>>Kind Regards,
>>>
>>>M.
>>
>>
>>
>>___
>>users mailing list
>>us...@open-mpi.org
>>Subscription:  http://www.open-mpi.org/mailman/listinfo.cgi/users
>>Link to this post:  
>>http://www.open-mpi.org/community/lists/users/2015/06/27029.php
>





mvs10p_mpi.xlsx
Description: MS-Excel 2007 spreadsheet

Re: [OMPI users] Fwd[2]: OMPI yalla vs impi

2015-06-04 Thread Timur Ismagilov


Hello, Alina.
1. Here is my 
ompi_yalla command line:
$HPCX_MPI_DIR/bin/mpirun -mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=mlx4_0:1 -x 
MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off --mca pml yalla --hostfile 
hostlist $@
echo $HPCX_MPI_DIR 
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/
 ompi-mellanox-fca-v1.8.5
This mpi was configured with: --with-mxm=/path/to/mxm 
--with-hcoll=/path/to/hcoll --with-platform=contrib/platform/mellanox/optimized 
--prefix=/path/to/ ompi-mellanox-fca-v1.8.5
ompi_clear command line:
HPCX_MPI_DIR/bin/mpirun  --hostfile hostlist $@
echo $HPCX_MPI_DIR 
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/
 ompi-clear-v1.8.5
This mpi was configured with: 
--with-platform=contrib/platform/mellanox/optimized --prefix=/path/to 
/ompi-clear-v1.8.5
2. When i run osu_mbr_mr with key "-x MXM_TLS=self,shm,rc" . It fails with 
segmentation fault : 
stdout log is in attached file osu_mbr_mr_n-2_ppn-16.out; 
stderr log is in attached file osu_mbr_mr_n-2_ppn-16.err;
cmd line:
$HPCX_MPI_DIR/bin/mpirun -mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=mlx4_0:1 -x 
MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off --mca pml yalla -x 
MXM_TLS=self,shm,rc --hostfile hostlist osu_mbw_mr -v -r=0
osu_mbw_mr.c
I have changed WINDOW_SIZES in osu_mbw_mr.c:
#define WINDOW_SIZES {8, 16, 32, 64,  128, 256, 512, 1024 }  
3. I add results of running osu_mbw_mr with yalla and without hcoll on 32 and 
64 nodes (512 and 1024 mpi procs
) to  mvs10p_mpi.xls : list osu_mbr_mr.
The results are 20 percents smaller than old results (with hcoll).



Среда,  3 июня 2015, 10:29 +03:00 от Alina Sklarevich 
:
>Hello Timur,
>
>I will review your results and try to reproduce them in our lab.
>
>You are using an old OFED - OFED-1.5.4.1 and we suspect that this may be 
>causing the performance issues you are seeing.
>
>In the meantime, could you please:
>
>1. send us the exact command lines that you were running when you got these 
>results?
>
>2. add the following to the command line that you are running with 'pml yalla' 
>and attach the results?
>"-x MXM_TLS=self,shm,rc"
>
>3. run your command line with yalla and without hcoll?
>
>Thanks,
>Alina.
>
>
>
>On Tue, Jun 2, 2015 at 4:56 PM, Timur Ismagilov  < tismagi...@mail.ru > wrote:
>>Hi, Mike!
>>I have impi v 4.1.2 (- impi)
>>I build ompi 1.8.5 with MXM and hcoll (- ompi_yalla)
>>I build ompi 1.8.5 without MXM and hcoll (- ompi_clear)
>>I start osu p2p: osu_mbr_mr test with this MPIs.
>>You can find the result of benchmark in attached file(mvs10p_mpi.xls: list 
>>osu_mbr_mr)
>>
>>On 64 nodes (and 1024 mpi processes) ompi_yalla get 2 time worse perf than 
>>ompi_clear.
>>Is mxm with yalla  reduces performance in p2p  compared with ompi_clear(and 
>>impi)?
>>Am  I  doing something wrong?
>>P.S. My colleague Alexander Semenov is in CC
>>Best regards,
>>Timur
>>
>>Четверг, 28 мая 2015, 20:02 +03:00 от Mike Dubman < mi...@dev.mellanox.co.il 
>>>:
>>>it is not apples-to-apples comparison.
>>>
>>>yalla/mxm is point-to-point library, it is not collective library.
>>>collective algorithm happens on top of yalla.
>>>
>>>Intel collective algorithm for a2a is better than OMPI built-in collective 
>>>algorithm.
>>>
>>>To see benefit of yalla - you should run p2p benchmarks (osu_lat/bw/bibw/mr)
>>>
>>>
>>>On Thu, May 28, 2015 at 7:35 PM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>wrote:
>>>>I compare ompi-1.8.5 (hpcx-1.3.3-icc) with impi v 4.1.4.
>>>>
>>>>I build ompi with MXM but without HCOLL and without  knem (I work on it). 
>>>>Configure options are:
>>>> ./configure  --prefix=my_prefix   
>>>>--with-mxm=path/to/hpcx/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/mxm 
>>>>  --with-platform=contrib/platform/mellanox/optimized
>>>>
>>>>As a result of the IMB-MPI1 Alltoall test, I have got disappointing  
>>>>results: for the most message sizes on 64 nodes and 16 processes per  node 
>>>>impi is much (~40%) better.
>>>>
>>>>You can look at the results in the file "mvs10p_mpi.xlsx", I attach it. 
>>>>System configuration is also there.
>>>>
>>>>What do you think about? Is there any way to improve ompi yalla performance 
>>>>results?
>>>>
>>>>I attach the output of  "IMB-MPI1 Alltoall" for yalla and impi.
>>>>
>>>>P.S. My colleague Alexander Semenov is in CC
>>>>
>>>>Best regards,
>>>>Timur
>>>
>>>
>>>-- 
>>>
>>>Kind Regards,
>>>
>>>M.
>>
>>
>>
>>___
>>users mailing list
>>us...@open-mpi.org
>>Subscription:  http://www.open-mpi.org/mailman/listinfo.cgi/users
>>Link to this post:  
>>http://www.open-mpi.org/community/lists/users/2015/06/27029.php
>





osu_mbw_mr_n-2_ppn-16.out
Description: Binary data


osu_mbw_mr_n-2_ppn-16.err
Description: Binary data


mvs10p_mpi.xlsx
Description: MS-Excel 2007 spreadsheet

Re: [OMPI users] Fwd[2]: OMPI yalla vs impi

2015-06-16 Thread Timur Ismagilov

 Hello, Alina!

If I use  --map-by node I will get only intranode communications on osu_mbw_mr. 
I use --map-by core instead.

I have 2 nodes, each node has 2 sockets with 8 cores per socket.

When I run osu_mbw_mr on 2 nodes with 32 MPI procs (command see below), I  
expect to see the unidirectional bandwidth of 4xFDR  link as a result  of this 
test.

With IntelMPI I get 6367 MB/s, 
With ompi_yalla I get about 3744 MB/s (problem: it is a half of impi result)
With openmpi without mxm (ompi_clear) I get 6321 MB/s.

How can I increase yalla results?

IntelMPI cmd:
/opt/software/intel/impi/4.1.0.030/intel64/bin/mpiexec.hydra  -machinefile 
machines.pYAvuK -n 32 -binding domain=core  
../osu_impi/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr -v -r=0

ompi_yalla cmd:
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-mellanox-fca-v1.8.5/bin/mpirun
  -report-bindings -display-map -mca coll_hcoll_enable 1 -x  
HCOLL_MAIN_IB=mlx4_0:1 -x MXM_IB_PORTS=mlx4_0:1 -x  MXM_SHM_KCOPY_MODE=off 
--mca pml yalla --map-by core --bind-to core  --hostfile hostlist  
../osu_ompi_hcoll/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr -v  -r=0

ompi_clear cmd:
/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-clear-v1.8.5/bin/mpirun
  -report-bindings -display-map --hostfile hostlist --map-by core  --bind-to 
core  ../osu_ompi_clear/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr -v  
-r=0

I have attached output files to this letter:
ompi_clear.out, ompi_clear.err - contains ompi_clear results
ompi_yalla.out, ompi_yalla.err - contains ompi_yalla results
impi.out, impi.err - contains intel MPI results

Best regards,
Timur

Воскресенье,  7 июня 2015, 16:11 +03:00 от Alina Sklarevich 
:
>Hi Timur,
>
>After running the osu_mbw_mr benchmark in our lab, we obsereved that the 
>binding policy made a difference on the performance.
>Can you please rerun your ompi tests with the following added to your command 
>line? (one of them in each run)
>
>1. --map-by node --bind-to socket
>2. --map-by node --bind-to core
>
>Please attach your results.
>
>Thank you,
>Alina.
>
>On Thu, Jun 4, 2015 at 6:53 PM, Timur Ismagilov  < tismagi...@mail.ru > wrote:
>>Hello, Alina.
>>1. Here is my 
>>ompi_yalla command line:
>>$HPCX_MPI_DIR/bin/mpirun -mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=mlx4_0:1 
>>-x MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off --mca pml yalla --hostfile 
>>hostlist $@
>>echo $HPCX_MPI_DIR 
>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/
>> ompi-mellanox-fca-v1.8.5
>>This mpi was configured with: --with-mxm=/path/to/mxm 
>>--with-hcoll=/path/to/hcoll 
>>--with-platform=contrib/platform/mellanox/optimized --prefix=/path/to/ 
>>ompi-mellanox-fca-v1.8.5
>>ompi_clear command line:
>>HPCX_MPI_DIR/bin/mpirun  --hostfile hostlist $@
>>echo $HPCX_MPI_DIR 
>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/
>> ompi-clear-v1.8.5
>>This mpi was configured with: 
>>--with-platform=contrib/platform/mellanox/optimized --prefix=/path/to 
>>/ompi-clear-v1.8.5
>>2. When i run osu_mbr_mr with key "-x MXM_TLS=self,shm,rc" . It fails with 
>>segmentation fault : 
>>stdout log is in attached file osu_mbr_mr_n-2_ppn-16.out; 
>>stderr log is in attached file osu_mbr_mr_n-2_ppn-16.err;
>>cmd line:
>>$HPCX_MPI_DIR/bin/mpirun -mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=mlx4_0:1 
>>-x MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off --mca pml yalla -x 
>>MXM_TLS=self,shm,rc --hostfile hostlist osu_mbw_mr -v -r=0
>>osu_mbw_mr.c
>>I have changed WINDOW_SIZES in osu_mbw_mr.c:
>>#define WINDOW_SIZES {8, 16, 32, 64,  128, 256, 512, 1024 }  
>>3. I add results of running osu_mbw_mr with yalla and without hcoll on 32 and 
>>64 nodes (512 and 1024 mpi procs
>>) to  mvs10p_mpi.xls : list osu_mbr_mr.
>>The results are 20 percents smaller than old results (with hcoll).
>>
>>
>>
>>Среда,  3 июня 2015, 10:29 +03:00 от Alina Sklarevich < 
>>ali...@dev.mellanox.co.il >:
>>>Hello Timur,
>>>
>>>I will review your results and try to reproduce them in our lab.
>>>
>>>You are using an old OFED - OFED-1.5.4.1 and we suspect that this may be 
>>>causing the performance issues you are seeing.
>>>
>>>In the meantime, could you please:
>>>
>>>1. send us the exact command lines that you were running when you got these 
>>>results?
>>>
>>>2. add the following to the command line that you are running with 'pml 
>>>yalla' and attach the results?
>>>"-x MXM_TLS=s

Re: [OMPI users] Fwd[2]: OMPI yalla vs impi

2015-06-16 Thread Timur Ismagilov


With ' --bind-to socket' i get the same results as '--bind-to-core' : 3813 MB/s.
I have attached ompi_yalla_socket.out and ompi_yalla_socket.err files to this 
letter.


Вторник, 16 июня 2015, 18:15 +03:00 от Alina Sklarevich 
:
>Hi Timur,
>
>Can you please try running your  ompi_yalla cmd with ' --bind-to socket' 
>(instead of binding to core) and check if it affects the results?
>We saw that it made a difference on the performance in our lab so that's why I 
>asked you to try the same.
>
>Thanks,
>Alina.
>
>On Tue, Jun 16, 2015 at 5:53 PM, Timur Ismagilov  < tismagi...@mail.ru > wrote:
>>Hello, Alina!
>>
>>If I use  --map-by node I will get only intranode communications on 
>>osu_mbw_mr. I use --map-by core instead.
>>
>>I have 2 nodes, each node has 2 sockets with 8 cores per socket.
>>
>>When I run osu_mbw_mr on 2 nodes with 32 MPI procs (command see below), I  
>>expect to see the unidirectional bandwidth of 4xFDR  link as a result  of 
>>this test.
>>
>>With IntelMPI I get 6367 MB/s, 
>>With ompi_yalla I get about 3744 MB/s (problem: it is a half of impi result)
>>With openmpi without mxm (ompi_clear) I get 6321 MB/s.
>>
>>How can I increase yalla results?
>>
>>IntelMPI cmd:
>>/opt/software/intel/impi/ 4.1.0.030/intel64/bin/mpiexec.hydra   -machinefile 
>>machines.pYAvuK -n 32 -binding domain=core  
>>../osu_impi/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr -v -r=0
>>
>>ompi_yalla cmd:
>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-mellanox-fca-v1.8.5/bin/mpirun
>>  -report-bindings -display-map -mca coll_hcoll_enable 1 -x  
>>HCOLL_MAIN_IB=mlx4_0:1 -x MXM_IB_PORTS=mlx4_0:1 -x  
>>MXM_SHM_KCOPY_MODE=off --mca pml yalla --map-by core --bind-to core  
>>--hostfile hostlist  
>>../osu_ompi_hcoll/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr -v  -r=0
>>
>>ompi_clear cmd:
>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-clear-v1.8.5/bin/mpirun
>>  -report-bindings -display-map --hostfile hostlist --map-by core  --bind-to 
>>core  ../osu_ompi_clear/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr -v  
>>-r=0
>>
>>I have attached output files to this letter:
>>ompi_clear.out, ompi_clear.err - contains ompi_clear results
>>ompi_yalla.out, ompi_yalla.err - contains ompi_yalla results
>>impi.out, impi.err - contains intel MPI results
>>
>>Best regards,
>>Timur
>>
>>Воскресенье,  7 июня 2015, 16:11 +03:00 от Alina Sklarevich < 
>>ali...@dev.mellanox.co.il >:
>>>Hi Timur,
>>>
>>>After running the osu_mbw_mr benchmark in our lab, we obsereved that the 
>>>binding policy made a difference on the performance.
>>>Can you please rerun your ompi tests with the following added to your 
>>>command line? (one of them in each run)
>>>
>>>1. --map-by node --bind-to socket
>>>2. --map-by node --bind-to core
>>>
>>>Please attach your results.
>>>
>>>Thank you,
>>>Alina.
>>>
>>>On Thu, Jun 4, 2015 at 6:53 PM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>wrote:
>>>>Hello, Alina.
>>>>1. Here is my 
>>>>ompi_yalla command line:
>>>>$HPCX_MPI_DIR/bin/mpirun -mca coll_hcoll_enable 1 -x HCOLL_MAIN_IB=mlx4_0:1 
>>>>-x MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off --mca pml yalla 
>>>>--hostfile hostlist $@
>>>>echo $HPCX_MPI_DIR 
>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/
>>>> ompi-mellanox-fca-v1.8.5
>>>>This mpi was configured with: --with-mxm=/path/to/mxm 
>>>>--with-hcoll=/path/to/hcoll 
>>>>--with-platform=contrib/platform/mellanox/optimized --prefix=/path/to/ 
>>>>ompi-mellanox-fca-v1.8.5
>>>>ompi_clear command line:
>>>>HPCX_MPI_DIR/bin/mpirun  --hostfile hostlist $@
>>>>echo $HPCX_MPI_DIR 
>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/
>>>> ompi-clear-v1.8.5
>>>>This mpi was configured with: 
>>>>--with-platform=contrib/platform/mellanox/optimized --prefix=/path/to 
>>>>/ompi-clear-v1.8.5
>>>>2. When i run osu_mbr_mr with key "-x MXM_TLS=self,shm,rc" . It fails with 
>>>>segmentation fault : 
>>>>stdout log is in attached file osu_mbr_mr_n-2_ppn-16.out; 
>>>>stderr log is in attached file osu_mbr_mr_n-2_ppn-16.e

Re: [OMPI users] Fwd[2]: OMPI yalla vs impi

2015-06-19 Thread Timur Ismagilov


Hello, Alina!

I use "OSU MPI Multiple Bandwidth / Message Rate Test v4.4.1". 
I downloaded it from the website: http://mvapich.cse.ohio-state.edu/benchmarks/
I have attached "osu_mbw_mr.c" to this letter.
Best regards,
Timur

Четверг, 18 июня 2015, 18:23 +03:00 от Alina Sklarevich 
:
>Hi Timur,
>
>Can you please tell me which osu version you are using?
>Unless it is from HPCX, please attach the source file of osu_mbw_mr.c you are 
>using.
>
>Thank you,
>Alina.
>
>On Tue, Jun 16, 2015 at 7:10 PM, Timur Ismagilov  < tismagi...@mail.ru > wrote:
>>I'm sorry, I'm forget to attach results.
>>With ' --bind-to socket' i get the same results as '--bind-to-core' : 3813 
>>MB/s.
>>I have attached ompi_yalla_socket.out and ompi_yalla_socket.err files to this 
>>letter.
>>
>>Best regards,
>>Timur
>>
>>
>>Вторник, 16 июня 2015, 18:15 +03:00 от Alina Sklarevich < 
>>ali...@dev.mellanox.co.il >:
>>>Hi Timur,
>>>
>>>Can you please try running your  ompi_yalla cmd with ' --bind-to socket' 
>>>(instead of binding to core) and check if it affects the results?
>>>We saw that it made a difference on the performance in our lab so that's why 
>>>I asked you to try the same.
>>>
>>>Thanks,
>>>Alina.
>>>
>>>On Tue, Jun 16, 2015 at 5:53 PM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>wrote:
>>>>Hello, Alina!
>>>>
>>>>If I use  --map-by node I will get only intranode communications on 
>>>>osu_mbw_mr. I use --map-by core instead.
>>>>
>>>>I have 2 nodes, each node has 2 sockets with 8 cores per socket.
>>>>
>>>>When I run osu_mbw_mr on 2 nodes with 32 MPI procs (command see below), I  
>>>>expect to see the unidirectional bandwidth of 4xFDR  link as a result  of 
>>>>this test.
>>>>
>>>>With IntelMPI I get 6367 MB/s, 
>>>>With ompi_yalla I get about 3744 MB/s (problem: it is a half of impi result)
>>>>With openmpi without mxm (ompi_clear) I get 6321 MB/s.
>>>>
>>>>How can I increase yalla results?
>>>>
>>>>IntelMPI cmd:
>>>>/opt/software/intel/impi/ 4.1.0.030/intel64/bin/mpiexec.hydra   
>>>>-machinefile machines.pYAvuK -n 32 -binding domain=core  
>>>>../osu_impi/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr -v -r=0
>>>>
>>>>ompi_yalla cmd:
>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-mellanox-fca-v1.8.5/bin/mpirun
>>>>  -report-bindings -display-map -mca coll_hcoll_enable 1 -x  
>>>>HCOLL_MAIN_IB=mlx4_0:1 -x MXM_IB_PORTS=mlx4_0:1 -x  
>>>>MXM_SHM_KCOPY_MODE=off --mca pml yalla --map-by core --bind-to core  
>>>>--hostfile hostlist  
>>>>../osu_ompi_hcoll/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr -v  -r=0
>>>>
>>>>ompi_clear cmd:
>>>>/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-clear-v1.8.5/bin/mpirun
>>>>  -report-bindings -display-map --hostfile hostlist --map-by core  
>>>>--bind-to core  
>>>>../osu_ompi_clear/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr -v  -r=0
>>>>
>>>>I have attached output files to this letter:
>>>>ompi_clear.out, ompi_clear.err - contains ompi_clear results
>>>>ompi_yalla.out, ompi_yalla.err - contains ompi_yalla results
>>>>impi.out, impi.err - contains intel MPI results
>>>>
>>>>Best regards,
>>>>Timur
>>>>
>>>>Воскресенье,  7 июня 2015, 16:11 +03:00 от Alina Sklarevich < 
>>>>ali...@dev.mellanox.co.il >:
>>>>>Hi Timur,
>>>>>
>>>>>After running the osu_mbw_mr benchmark in our lab, we obsereved that the 
>>>>>binding policy made a difference on the performance.
>>>>>Can you please rerun your ompi tests with the following added to your 
>>>>>command line? (one of them in each run)
>>>>>
>>>>>1. --map-by node --bind-to socket
>>>>>2. --map-by node --bind-to core
>>>>>
>>>>>Please attach your results.
>>>>>
>>>>>Thank you,
>>>>>Alina.
>>>>>
>>>>>On Thu, Jun 4, 2015 at 6:53 PM, Timur Ismagilov  < tismagi...@mail.ru > 
>>>>>wrote:
>>>>>>Hello, Alina.
>>>>>>1. Here

41 matches

Mail list logo