Re: [OMPI users] COLL-ML ATTENTION

2018-07-05 Thread Deva
Mellanox HCOLL requires  valid IPoIB setup for to use IB MCAST
capabilities.  You can disable MCAST features with
-x HCOLL_ENABLE_MCAST_ALL=0

On Wed, Jul 4, 2018 at 7:00 PM, larkym via users 
wrote:

> Good evening,
>
> Can someone help me understand the following error I am getting?
>
> [coll_ml_mca.c:471:hmca_coll_ml_register_params] COLL-ML ATTENTION:
> Available IPoIB interface was not found MCAST capability will be disabled.
>
> I am currently using open mpi 2.0 that comes with Mellanox. I am running
> CentOS 7.x. It is multihomed.
>
> Is MPI possibly using one of my NICs that does not support RoCE?
>
>
> Sent from my Verizon, Samsung Galaxy smartphone
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>



-- 


-Devendar
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?

2016-08-19 Thread Deva
Hi Martin

MXM default transport is UD (MXM_TLS=*ud*,shm,self), which is scalable when
running with large applications.  RC(MXM_TLS=*rc,*shm,self)  is recommended
for microbenchmarks and very small scale applications,

yes, max seg size setting is too small.

Did you check any message rate benchmarks(like osu_mbw_mr) with MXM?

virtualization env will have some overhead.  see some perf comparision here
with mvapich
http://mvapich.cse.ohio-state.edu/performance/v-pt_to_pt/ .





On Fri, Aug 19, 2016 at 3:11 PM, Audet, Martin 
wrote:

> Hi Devendar,
>
> Thank you for your answer.
>
> Setting MXM_TLS=rc,shm,self does improve the speed of MXM (both latency
> and bandwidth):
>
> without MXM_TLS
>
> comm   lat_min  bw_max  bw_max
>pingpong pingpongsendrecv
>(us) (MB/s)  (MB/s)
> ---
> openib 1.79 5827.9311552.4
> mxm2.23 5191.77 8201.76
> yalla  2.18 5200.55 8109.48
>
>
> with MXM_TLS=rc,shm,self
>
> comm   lat_min  bw_max  bw_max
>pingpong pingpongsendrecv
>(us) (MB/s)  (MB/s)
> ---
> openib 1.79 6021.8311529
> mxm1.78 5936.9211168.5
> yalla  1.78 5944.8611375
>
>
> Note 1: MXM_RDMA_PORTS=mlx4_0:1 and the MCA parameter
> btl_openib_include_if=mlx4_0 for both cases.
>
> Note 2: The bandwidth reported are not very accurate. Bandwidth results
> can vary easilly by 7% from one run to another.
>
> We see that the performance of MXM is now very similar to the performance
> of openib for these IMB tests.
>
> However an error is now reported a few times when MXM_TLS is set:
>
> sys.c:468  MXM  ERROR A new segment was to be created and size < SHMMIN or
> size > SHMMAX, or the new segment was to be created. A segment with given
> key existed, but size is greater than the size of that segment. Please
> check limits by 'ipcs -l'.
>
> "ipcs -l" reports among other things that:
>
>   max seg size (kbytes) = 32768
>
> By the way, is it too small ?
>
>
> Now if we run /opt/mellanox/mxm/mxm_perftest we get:
>
>   without  with
>   MXM_TLS  MXM_TLS
>   
>   avg send_lat(us)1.6261.321
>
>   avg send_bw   -s 400(MB/s)  5219.51  5514.04
>   avg bidir send_bw -s 400 -b (MB/s)  5283.13  5514.45
>
> Note: the -b for bidirectional bandwith doesn't seen to affect the result.
>
> Again it is an improvement both in term of latency and bandwidth.
>
> However a warning is reported when MXM_TLS is set on the server side when
> the send_lat test is run:
>
> icb_ep.c:287   MXM  WARN  The min value for CIB_RX_QUEUE_LEN is 2048.
>
> Note: setting the undocumented env variable MXM_CIB_RX_QUEUE_LEN=2048
> remove the warning but doesn't affect the send latency.
>
>
> * * *
>
> So now the results are better: MXM performs as well as the regular openib
> in term of latency and bandwidth (I didn't checked the overlap capacity
> though). But I'm not really impressed. I was expecting MXM (especially when
> used by yalla) to be a little better than openib. Also the latency of both
> openib, mxm and yalla at 1.8 us seems to be too high. With a configuration
> like ours, we should get something closer to 1 us.
>
> Does anyone has an idea ?
>
> Don't forget that this cluster uses LXC containers with SR-IOV enabled for
> the Infiniband adapter.
>
> Martin Audet
>
>
> > Hi Martin,
> >
> > Can you check if it is any better with  "-x MXM_TLS=rc,shm,self" ?
> >
> > -Devendar
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>



-- 


-Devendar
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] An equivalent to btl_openib_include_if when MXM over Infiniband ?

2016-08-19 Thread Deva
Hi Martin,

Can you check if it is any better with  "-x MXM_TLS=rc,shm,self" ?

-Devendar


On Tue, Aug 16, 2016 at 11:28 AM, Audet, Martin  wrote:

> Hi Josh,
>
> Thanks for your reply. I did try setting MXM_RDMA_PORTS=mlx4_0:1 for all
> my MPI processes
> and it did improve performance but the performance I obtain isn't
> completely satisfying.
>
> When I use IMB 4.1 pingpong and sendrecv benchmarks between two nodes I
> get using
> Open MPI 1.10.3:
>
> without MXM_RDMA_PORTS
>
>comm   lat_min  bw_max  bw_max
>   pingpong pingpongsendrecv
>   (us) (MB/s)  (MB/s)
>---
>openib 1.79 5947.0711534
>mxm2.51 5166.96 8079.18
>yalla  2.47 5167.29 8278.15
>
>
> with MXM_RDMA_PORTS=mlx4_0:1
>
>comm   lat_min  bw_max  bw_max
>   pingpong pingpongsendrecv
>   (us) (MB/s)  (MB/s)
>---
>openib 1.79 5827.9311552.4
>mxm2.23 5191.77 8201.76
>yalla  2.18 5200.55 8109.48
>
>
> openib means: pml=ob1 btl=openib,vader,self
> btl_openib_include_if=mlx4_0
> mxmmeans: pml=cm,ob1 mtl=mxm  btl=vader,self
> yalla  means: pml=yalla,ob1   btl=vader,self
>
> lspci reports for our FDR Infiniband HCA:
>   Infiniband Controler: Mellanox Technologies MT27500 Family [ConnectX-3]
>
> and 16 lines like:
>   Infiniband Controler: Mellanox Technologies MT27500/MT27520 Family
> [ConnectX-3/ConnectX-3 Pro Virtual Function]
>
> the nodes use two octacore Xeon E5-2650v2 Ivybridge-EP 2.67 GHz sockets
>
> ofed_info reports that mxm version is 3.4.3cce223-0.32200
>
> As you can see the results are not very good. I would expect mxm and yalla
> to perform
> better than openib both in term of latency and bandwidth (note: sendrecv
> bandwidth is
> full duplex). I would expect the yalla bandwidth to be around 1.1 us like
> shown here
> https://www.open-mpi.org/papers/sc-2014/Open-MPI-SC14-BOF.pdf (page 33).
>
> I also ran mxm_perftest (located in /opt/mellanox/bin) and it reports the
> following
> latency between two nodes:
>
> without MXM_RDMA_PORTS1.92 us
> withMXM_RDMA_PORTS=mlx4_0:1   1.65 us
>
> Again I think we can expect a better latency with our configuration. 1.65
> us is not a
> very good result.
>
> Note however that the 0.27 us (1.92 - 1.65 = 0.27) reduction reduction in
> raw mxm
> latency correspond to the above Open MPI latencies observed with mxm (2.51
> - 2.23 = 0.28)
> and yalla (2.47 - 2.18 = 0.29).
>
> Another detail: everything is run inside LXC containers. Also SR-IOV is
> probably used.
>
> Does anyone has any idea what's wrong with our cluster ?
>
> Martin Audet
>
>
> > Hi, Martin
> >
> > The environment variable:
> >
> > MXM_RDMA_PORTS=device:port
> >
> > is what you're looking for. You can specify a device/port pair on your
> OMPI
> > command line like:
> >
> > mpirun -np 2 ... -x MXM_RDMA_PORTS=mlx4_0:1 ...
> >
> >
> > Best,
> >
> > Josh
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>



-- 


-Devendar
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-12 Thread Deva
do you have "-disable-dlopen" in your configure option? This might force
coll_ml to be loaded first even with -mca coll ^ml.

next HPCX is expected to release by end of Aug.

-Devendar

On Wed, Aug 12, 2015 at 3:30 PM, David Shrader <dshra...@lanl.gov> wrote:

> I remember seeing those, but forgot about them. I am curious, though, why
> using '-mca coll ^ml' wouldn't work for me.
>
> We'll watch for the next HPCX release. Is there an ETA on when that
> release may happen? Thank you for the help!
> David
>
>
> On 08/12/2015 04:04 PM, Deva wrote:
>
> David,
>
> This is because of hcoll symbols conflict with ml coll module inside OMPI.
> HCOLL is derived from ml module. This issue is fixed in hcoll library and
> will be available in next HPCX release.
>
> Some earlier discussion on this issue:
> http://www.open-mpi.org/community/lists/users/2015/06/27154.php
> http://www.open-mpi.org/community/lists/devel/2015/06/17562.php
>
> -Devendar
>
> On Wed, Aug 12, 2015 at 2:52 PM, David Shrader <dshra...@lanl.gov> wrote:
>
>> Interesting... the seg faults went away:
>>
>> [dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so
>> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
>> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
>> [1439416182.732720] [zo-fe1:14690:0] shm.c:65   MXM  WARN  Could
>> not open the KNEM device file at /dev/knem : No such file or direc
>> tory. Won't use knem.
>> [1439416182.733640] [zo-fe1:14689:0] shm.c:65   MXM  WARN  Could
>> not open the KNEM device file at /dev/knem : No such file or direc
>> tory. Won't use knem.
>> 0: Running on host zo-fe1.lanl.gov
>> 0: We have 2 processors
>> 0: Hello 1! Processor 1 on host zo-fe1.lanl.gov reporting for duty
>>
>> This implies to me that some other library is being used instead of
>> /usr/lib64/libhcoll.so, but I am not sure how that could be...
>>
>> Thanks,
>> David
>>
>> On 08/12/2015 03:30 PM, Deva wrote:
>>
>> Hi David,
>>
>> I tried same tarball on OFED-1.5.4.1 and I could not reproduce the
>> issue.  Can you do one more quick test with seeing LD_PRELOAD to hcoll lib?
>>
>> $LD_PRELOAD=  mpirun -n 2  -mca coll ^ml
>> ./a.out
>>
>> -Devendar
>>
>> On Wed, Aug 12, 2015 at 12:52 PM, David Shrader < <dshra...@lanl.gov>
>> dshra...@lanl.gov> wrote:
>>
>>> The admin that rolled the hcoll rpm that we're using (and got it in
>>> system space) said that she got it from
>>> hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.
>>>
>>> Thanks,
>>> David
>>>
>>>
>>> On 08/12/2015 10:51 AM, Deva wrote:
>>>
>>> From where did you grab this HCOLL lib?  MOFED or HPCX? what version?
>>>
>>> On Wed, Aug 12, 2015 at 9:47 AM, David Shrader < <dshra...@lanl.gov>
>>> dshra...@lanl.gov> wrote:
>>>
>>>> Hey Devendar,
>>>>
>>>> It looks like I still get the error:
>>>>
>>>> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
>>>> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
>>>> [1439397957.351764] [zo-fe1:14678:0] shm.c:65   MXM  WARN
>>>>  Could not open the KNEM device file at /dev/knem : No such file or direc
>>>> tory. Won't use knem.
>>>> [1439397957.352704] [zo-fe1:14677:0] shm.c:65   MXM  WARN
>>>>  Could not open the KNEM device file at /dev/knem : No such file or direc
>>>> tory. Won't use knem.
>>>> [zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
>>>> [zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
>>>>  backtrace 
>>>> 2 0x00056cdc mxm_handle_error()
>>>>  
>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
>>>> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>>>>
>>>> 3 0x00056e4c mxm_error_signal_handler()
>>>>  
>>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
>>>> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>>>>
>>>> 4 0x000326a0 killpg()  ??:0
>>>> 5 0x000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
>>>> 6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
>>>> 7 0x00032ee3 hmca_col

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-12 Thread Deva
David,

This is because of hcoll symbols conflict with ml coll module inside OMPI.
HCOLL is derived from ml module. This issue is fixed in hcoll library and
will be available in next HPCX release.

Some earlier discussion on this issue:
http://www.open-mpi.org/community/lists/users/2015/06/27154.php
http://www.open-mpi.org/community/lists/devel/2015/06/17562.php

-Devendar

On Wed, Aug 12, 2015 at 2:52 PM, David Shrader <dshra...@lanl.gov> wrote:

> Interesting... the seg faults went away:
>
> [dshrader@zo-fe1 tests]$ export LD_PRELOAD=/usr/lib64/libhcoll.so
> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
> [1439416182.732720] [zo-fe1:14690:0] shm.c:65   MXM  WARN  Could
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem.
> [1439416182.733640] [zo-fe1:14689:0] shm.c:65   MXM  WARN  Could
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem.
> 0: Running on host zo-fe1.lanl.gov
> 0: We have 2 processors
> 0: Hello 1! Processor 1 on host zo-fe1.lanl.gov reporting for duty
>
> This implies to me that some other library is being used instead of
> /usr/lib64/libhcoll.so, but I am not sure how that could be...
>
> Thanks,
> David
>
> On 08/12/2015 03:30 PM, Deva wrote:
>
> Hi David,
>
> I tried same tarball on OFED-1.5.4.1 and I could not reproduce the issue.
> Can you do one more quick test with seeing LD_PRELOAD to hcoll lib?
>
> $LD_PRELOAD=  mpirun -n 2  -mca coll ^ml
> ./a.out
>
> -Devendar
>
> On Wed, Aug 12, 2015 at 12:52 PM, David Shrader <dshra...@lanl.gov> wrote:
>
>> The admin that rolled the hcoll rpm that we're using (and got it in
>> system space) said that she got it from
>> hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64.tar.
>>
>> Thanks,
>> David
>>
>>
>> On 08/12/2015 10:51 AM, Deva wrote:
>>
>> From where did you grab this HCOLL lib?  MOFED or HPCX? what version?
>>
>> On Wed, Aug 12, 2015 at 9:47 AM, David Shrader < <dshra...@lanl.gov>
>> dshra...@lanl.gov> wrote:
>>
>>> Hey Devendar,
>>>
>>> It looks like I still get the error:
>>>
>>> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
>>> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
>>> [1439397957.351764] [zo-fe1:14678:0] shm.c:65   MXM  WARN  Could
>>> not open the KNEM device file at /dev/knem : No such file or direc
>>> tory. Won't use knem.
>>> [1439397957.352704] [zo-fe1:14677:0] shm.c:65   MXM  WARN  Could
>>> not open the KNEM device file at /dev/knem : No such file or direc
>>> tory. Won't use knem.
>>> [zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
>>> [zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
>>>  backtrace 
>>> 2 0x00056cdc mxm_handle_error()
>>>  
>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
>>> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>>>
>>> 3 0x00056e4c mxm_error_signal_handler()
>>>  
>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
>>> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>>>
>>> 4 0x000326a0 killpg()  ??:0
>>> 5 0x000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
>>> 6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
>>> 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>>>  coll_ml_module.c:0
>>> 8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
>>> 9 0x0006ace9 hcoll_create_context()  ??:0
>>> 10 0x000f9706 mca_coll_hcoll_comm_query()  ??:0
>>> 11 0x000f684e mca_coll_base_comm_select()  ??:0
>>> 12 0x00073fc4 ompi_mpi_init()  ??:0
>>> 13 0x00092ea0 PMPI_Init()  ??:0
>>> 14 0x004009b6 main()  ??:0
>>> 15 0x0001ed5d __libc_start_main()  ??:0
>>> 16 0x004008c9 _start()  ??:0
>>> ===
>>>  backtrace 
>>> 2 0x00056cdc mxm_handle_error()
>>>  
>>> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
>>> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>>>
>>> 3 0x00056e4c mxm_error_signal_handler()
>&g

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-12 Thread Deva
>From where did you grab this HCOLL lib?  MOFED or HPCX? what version?

On Wed, Aug 12, 2015 at 9:47 AM, David Shrader <dshra...@lanl.gov> wrote:

> Hey Devendar,
>
> It looks like I still get the error:
>
> [dshrader@zo-fe1 tests]$ mpirun -n 2 -mca coll ^ml ./a.out
> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
> [1439397957.351764] [zo-fe1:14678:0] shm.c:65   MXM  WARN  Could
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem.
> [1439397957.352704] [zo-fe1:14677:0] shm.c:65   MXM  WARN  Could
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem.
> [zo-fe1:14677:0] Caught signal 11 (Segmentation fault)
> [zo-fe1:14678:0] Caught signal 11 (Segmentation fault)
>  backtrace 
> 2 0x00056cdc mxm_handle_error()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>
> 3 0x00056e4c mxm_error_signal_handler()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>
> 4 0x000326a0 killpg()  ??:0
> 5 0x000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
> 6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
> 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>  coll_ml_module.c:0
> 8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
> 9 0x0006ace9 hcoll_create_context()  ??:0
> 10 0x000f9706 mca_coll_hcoll_comm_query()  ??:0
> 11 0x000f684e mca_coll_base_comm_select()  ??:0
> 12 0x00073fc4 ompi_mpi_init()  ??:0
> 13 0x00092ea0 PMPI_Init()  ??:0
> 14 0x004009b6 main()  ??:0
> 15 0x0001ed5d __libc_start_main()  ??:0
> 16 0x004008c9 _start()  ??:0
> ===
>  backtrace 
> 2 0x00056cdc mxm_handle_error()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>
> 3 0x00056e4c mxm_error_signal_handler()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>
> 4 0x000326a0 killpg()  ??:0
> 5 0x000b82cb base_bcol_basesmuma_setup_library_buffers()  ??:0
> 6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
> 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>  coll_ml_module.c:0
> 8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
> 9 0x0006ace9 hcoll_create_context()  ??:0
> 10 0x000f9706 mca_coll_hcoll_comm_query()  ??:0
> 11 0x000f684e mca_coll_base_comm_select()  ??:0
> 12 0x00073fc4 ompi_mpi_init()  ??:0
> 13 0x00092ea0 PMPI_Init()  ??:0
> 14 0x004009b6 main()  ??:0
> 15 0x0001ed5d __libc_start_main()  ??:0
> 16 0x004008c9 _start()  ??:0
> ===
> --
> mpirun noticed that process rank 1 with PID 14678 on node zo-fe1 exited on
> signal 11 (Segmentation fault).
> --
>
> Thanks,
> David
>
> On 08/12/2015 10:42 AM, Deva wrote:
>
> Hi David,
>
> This issue is from hcoll library. This could be because of symbol conflict
> with ml module.  This is fixed recently in HCOLL.  Can you try with "-mca
> coll ^ml" and see if this workaround works in your setup?
>
> -Devendar
>
> On Wed, Aug 12, 2015 at 9:30 AM, David Shrader <dshra...@lanl.gov> wrote:
>
>> Hello Gilles,
>>
>> Thank you very much for the patch! It is much more complete than mine.
>> Using that patch and re-running autogen.pl, I am able to build 1.8.8
>> with './configure --with-hcoll' without errors.
>>
>> I do have issues when it comes to running 1.8.8 with hcoll built in,
>> however. In my quick sanity test of running a basic parallel hello world C
>> program, I get the following:
>>
>> [dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out
>> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
>> [1439390789.039197] [zo-fe1:31354:0] shm.c:65   MXM  WARN  Could
>> not open the KNEM device file at /dev/knem : No such file or direc
>> tory. Won't use knem.
>> [1439390789.040265] [zo-fe1:31353:0] shm.c:6

Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-12 Thread Deva
Hi David,

This issue is from hcoll library. This could be because of symbol conflict
with ml module.  This is fixed recently in HCOLL.  Can you try with "-mca
coll ^ml" and see if this workaround works in your setup?

-Devendar

On Wed, Aug 12, 2015 at 9:30 AM, David Shrader  wrote:

> Hello Gilles,
>
> Thank you very much for the patch! It is much more complete than mine.
> Using that patch and re-running autogen.pl, I am able to build 1.8.8 with
> './configure --with-hcoll' without errors.
>
> I do have issues when it comes to running 1.8.8 with hcoll built in,
> however. In my quick sanity test of running a basic parallel hello world C
> program, I get the following:
>
> [dshrader@zo-fe1 tests]$ mpirun -n 2 ./a.out
> App launch reported: 1 (out of 1) daemons - 2 (out of 2) procs
> [1439390789.039197] [zo-fe1:31354:0] shm.c:65   MXM  WARN  Could
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem.
> [1439390789.040265] [zo-fe1:31353:0] shm.c:65   MXM  WARN  Could
> not open the KNEM device file at /dev/knem : No such file or direc
> tory. Won't use knem.
> [zo-fe1:31353:0] Caught signal 11 (Segmentation fault)
> [zo-fe1:31354:0] Caught signal 11 (Segmentation fault)
>  backtrace 
> 2 0x00056cdc mxm_handle_error()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>
> 3 0x00056e4c mxm_error_signal_handler()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>
> 4 0x000326a0 killpg()  ??:0
> 5 0x000b91eb base_bcol_basesmuma_setup_library_buffers()  ??:0
> 6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
> 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>  coll_ml_module.c:0
> 8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
> 9 0x0006ace9 hcoll_create_context()  ??:0
> 10 0x000fa626 mca_coll_hcoll_comm_query()  ??:0
> 11 0x000f776e mca_coll_base_comm_select()  ??:0
> 12 0x00074ee4 ompi_mpi_init()  ??:0
> 13 0x00093dc0 PMPI_Init()  ??:0
> 14 0x004009b6 main()  ??:0
> 15 0x0001ed5d __libc_start_main()  ??:0
> 16 0x004008c9 _start()  ??:0
> ===
>  backtrace 
> 2 0x00056cdc mxm_handle_error()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_root/src/h
> pcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:641
>
> 3 0x00056e4c mxm_error_signal_handler()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u6-x86-64-MOFED-CHECKER/hpcx_ro
> ot/src/hpcx-v1.3.336-gcc-OFED-1.5.4.1-redhat6.6-x86_64/mxm-v3.3/src/mxm/util/debug/debug.c:616
>
> 4 0x000326a0 killpg()  ??:0
> 5 0x000b91eb base_bcol_basesmuma_setup_library_buffers()  ??:0
> 6 0x000969e3 hmca_bcol_basesmuma_comm_query()  ??:0
> 7 0x00032ee3 hmca_coll_ml_tree_hierarchy_discovery()
>  coll_ml_module.c:0
> 8 0x0002fda2 hmca_coll_ml_comm_query()  ??:0
> 9 0x0006ace9 hcoll_create_context()  ??:0
> 10 0x000fa626 mca_coll_hcoll_comm_query()  ??:0
> 11 0x000f776e mca_coll_base_comm_select()  ??:0
> 12 0x00074ee4 ompi_mpi_init()  ??:0
> 13 0x00093dc0 PMPI_Init()  ??:0
> 14 0x004009b6 main()  ??:0
> 15 0x0001ed5d __libc_start_main()  ??:0
> 16 0x004008c9 _start()  ??:0
> ===
> --
> mpirun noticed that process rank 0 with PID 31353 on node zo-fe1 exited on
> signal 11 (Segmentation fault).
> --
>
> I do not get this message with only 1 process.
>
> I am using hcoll 3.2.748. Could this be an issue with hcoll itself or
> something with my ompi build?
>
> Thanks,
> David
>
> On 08/12/2015 12:26 AM, Gilles Gouaillardet wrote:
>
> Thanks David,
>
> i made a PR for the v1.8 branch at
> https://github.com/open-mpi/ompi-release/pull/492
>
> the patch is attached (it required some back-porting)
>
> Cheers,
>
> Gilles
>
> On 8/12/2015 4:01 AM, David Shrader wrote:
>
> I have cloned Gilles' topic/hcoll_config branch and, after running
> autogen.pl, have found that './configure --with-hcoll' does indeed work
> now. I used Gilles' branch as I wasn't sure how best to get the pull
> request changes in to my own clone of master. It looks like the proper
> checks are happening, too:
>
> --- MCA component coll:hcoll (m4 configuration macro)
> checking for MCA component coll:hcoll compile mode... dso
> checking --with-hcoll value... simple ok (unspecified)
> checking hcoll/api/hcoll_api.h usability... yes
> checking 

Re: [OMPI users] Max Registerable Memory Warning

2015-02-08 Thread Deva
What OFED version you are running? If not latest, is it possible to upgrade
to latest OFED?.  Otherwise, Can you try latest OMPI release (>= v1.8.4),
where this warning is ignored on older OFEDs

-Devendar

On Sun, Feb 8, 2015 at 12:37 PM, Saliya Ekanayake  wrote:

> Hi,
>
> OpenMPI reports that the OpebFabrics is allowing it to register only part
> of the memory (error attached). I've looked at the suggested FAQ entry and
> see that *ulimit *settings are all unlimited (pasted below). Could you
> please give some suggestion to correct this?
>
> ulimit -a
> core file size  (blocks, -c) 0
> data seg size   (kbytes, -d) unlimited
> scheduling priority (-e) 0
> file size   (blocks, -f) unlimited
> pending signals (-i) 386359
> max locked memory   (kbytes, -l) unlimited
> max memory size (kbytes, -m) unlimited
> open files  (-n) 1024
> pipe size(512 bytes, -p) 8
> POSIX message queues (bytes, -q) 819200
> real-time priority  (-r) 0
> stack size  (kbytes, -s) 10240
> cpu time   (seconds, -t) unlimited
> max user processes  (-u) 386359
> virtual memory  (kbytes, -v) unlimited
> file locks  (-x) unlimited
>
> Thank you,
> Saliya
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/02/26307.php
>



-- 


-Devendar


Re: [OMPI users] Icreasing OFED registerable memory

2015-01-06 Thread Deva
Can you read this thread and see if following grid engine param setting is
missing?

http://marc.info/?l=npaci-rocks-discussion=135844781420954=2

---
Check that your GridEngine configuration has the following:

execd_params H_MEMORYLOCKED=infinity

The command qconf -sconf will display the current configuration.
--

-Devendar

On Tue, Jan 6, 2015 at 1:37 PM, Deva <devendar.bure...@gmail.com> wrote:

> Hi Waleed,
>
> --
>Memlock limit: 65536
> --
>
> such a low limit should be due to per-user lock memory limit . Can you
> make sure it is  set to "unlimited" on all nodes ( "ulimit -l unlimited")?
>
> -Devendar
>
> On Tue, Jan 6, 2015 at 3:42 AM, Waleed Lotfy <waleed.lo...@bibalex.org>
> wrote:
>
>> Hi guys,
>>
>> Sorry for getting back so late, but we ran into some problems during the
>> installation process and as soon as the system came up I tested the new
>> versions for the problem but it showed another memory related warning.
>>
>> --
>> The OpenFabrics (openib) BTL failed to initialize while trying to
>> allocate some locked memory.  This typically can indicate that the
>> memlock limits are set too low.  For most HPC installations, the
>> memlock limits should be set to "unlimited".  The failure occured
>> here:
>>
>>   Local host:comp003.local
>>   OMPI source:   btl_openib_component.c:1200
>>   Function:  ompi_free_list_init_ex_new()
>>   Device:mlx4_0
>>   Memlock limit: 65536
>>
>> You may need to consult with your system administrator to get this
>> problem fixed.  This FAQ entry on the Open MPI web site may also be
>> helpful:
>>
>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>> --
>> --
>> WARNING: There was an error initializing an OpenFabrics device.
>>
>>   Local host:   comp003.local
>>   Local device: mlx4_0
>> --
>>
>> <<>>
>>
>> My current running versions:
>>
>> OpenMPI: 1.6.4
>> OFED-internal-2.3-2
>>
>> I checked /etc/security/limits.d/, the scheduler's configurations (grid
>> engine) and tried adding the following line to /etc/modprobe.d/mlx4_core:
>> 'options mlx4_core log_num_mtt=22 log_mtts_per_seg=1' as suggested by Gus.
>>
>> I am running out of ideas here, so please any help is appreciated.
>>
>> P.S. I am not sure if I should open a new thread with this issue or
>> continue with the current one, so please advice.
>>
>> Waleed Lotfy
>> Bibliotheca Alexandrina
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/01/26107.php
>>
>
>
>
> --
>
>
> -Devendar
>



-- 


-Devendar


Re: [OMPI users] Icreasing OFED registerable memory

2015-01-06 Thread Deva
Hi Waleed,

--
   Memlock limit: 65536
--

such a low limit should be due to per-user lock memory limit . Can you make
sure it is  set to "unlimited" on all nodes ( "ulimit -l unlimited")?

-Devendar

On Tue, Jan 6, 2015 at 3:42 AM, Waleed Lotfy 
wrote:

> Hi guys,
>
> Sorry for getting back so late, but we ran into some problems during the
> installation process and as soon as the system came up I tested the new
> versions for the problem but it showed another memory related warning.
>
> --
> The OpenFabrics (openib) BTL failed to initialize while trying to
> allocate some locked memory.  This typically can indicate that the
> memlock limits are set too low.  For most HPC installations, the
> memlock limits should be set to "unlimited".  The failure occured
> here:
>
>   Local host:comp003.local
>   OMPI source:   btl_openib_component.c:1200
>   Function:  ompi_free_list_init_ex_new()
>   Device:mlx4_0
>   Memlock limit: 65536
>
> You may need to consult with your system administrator to get this
> problem fixed.  This FAQ entry on the Open MPI web site may also be
> helpful:
>
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
> --
> --
> WARNING: There was an error initializing an OpenFabrics device.
>
>   Local host:   comp003.local
>   Local device: mlx4_0
> --
>
> <<>>
>
> My current running versions:
>
> OpenMPI: 1.6.4
> OFED-internal-2.3-2
>
> I checked /etc/security/limits.d/, the scheduler's configurations (grid
> engine) and tried adding the following line to /etc/modprobe.d/mlx4_core:
> 'options mlx4_core log_num_mtt=22 log_mtts_per_seg=1' as suggested by Gus.
>
> I am running out of ideas here, so please any help is appreciated.
>
> P.S. I am not sure if I should open a new thread with this issue or
> continue with the current one, so please advice.
>
> Waleed Lotfy
> Bibliotheca Alexandrina
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/01/26107.php
>



-- 


-Devendar


Re: [OMPI users] Icreasing OFED registerable memory

2014-12-29 Thread Deva
Hi Waleed,

It is highly recommended to upgrade to latest OFED.  Meanwhile, Can you try
latest OMPI release (v1.8.4), where this warning is ignored on older OFEDs

-Devendar

On Sun, Dec 28, 2014 at 6:03 AM, Waleed Lotfy 
wrote:

> I have a bunch of 8 GB memory nodes in a cluster who were lately
> upgraded to 16 GB. When I run any jobs I get the following warning:
> --
> WARNING: It appears that your OpenFabrics subsystem is configured to
> only
> allow registering part of your physical memory.  This can cause MPI jobs
> to
> run with erratic performance, hang, and/or crash.
>
> This may be caused by your OpenFabrics vendor limiting the amount of
> physical memory that can be registered.  You should investigate the
> relevant Linux kernel module parameters that control how much physical
> memory can be registered, and increase them to allow registering all
> physical memory on your machine.
>
> See this Open MPI FAQ item for more information on these Linux kernel
> module
> parameters:
>
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>
>   Local host:  comp022.local
>   Registerable memory: 8192 MiB
>   Total memory:16036 MiB
>
> Your MPI job will continue, but may be behave poorly and/or hang.
> --
>
> Searching for a fix to this issue, I found that I have to set
> log_num_mtt within the kernel module, so I added this line to
> modprobe.conf:
>
> options mlx4_core log_num_mtt=21
>
> But then ib0 interface fails to start showing this error:
> ib_ipoib device ib0 does not seem to be present, delaying
> initialization.
>
> Reducing the value of log_num_mtt to 20, allows ib0 to start but shows
> the registerable memory of 8 GB warning.
>
> I am using OFED 1.3.1, I know it is pretty old and we are planning to
> upgrade soon.
>
> Output on all nodes for 'ompi_info  -v ompi full --parsable':
>
> ompi:version:full:1.2.7
> ompi:version:svn:r19401
> orte:version:full:1.2.7
> orte:version:svn:r19401
> opal:version:full:1.2.7
> opal:version:svn:r19401
>
> Any help would be appreciated.
>
> Waleed Lotfy
> Bibliotheca Alexandrina
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/12/26076.php
>



-- 


-Devendar