See this FAQ entry:
http://www.open-mpi.org/faq/?category=openfabrics#ib-xrc

-- YK

On 02-Aug-11 12:27 AM, Shamis, Pavel wrote:
> You may find some initial XRC tuning documentation here :
> 
> https://svn.open-mpi.org/trac/ompi/ticket/1260
> 
> Pavel (Pasha) Shamis
> ---
> Application Performance Tools Group
> Computer Science and Math Division
> Oak Ridge National Laboratory
> 
> 
> 
> 
> 
> 
> On Aug 1, 2011, at 11:41 AM, Yevgeny Kliteynik wrote:
> 
>> Hi,
>>
>> Please try running OMPI with XRC:
>>
>>   mpirun --mca btl openib... --mca btl_openib_receive_queues 
>> X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128,32 ...
>>
>> XRC (eXtended Reliable Connection) decreases memory consumption
>> of Open MPI by decreasing number of QP per machine.
>>
>> I'm not entirely sure that XRC is supported on OMPI 1.4, but I'm
>> sure it is on later version of the 1.4 series (1.4.3).
>>
>> BTW, I do know that the command line is extremely user friendly
>> and completely intuitive... :-)
>> I'll have an XRC entry on the OMPI FAQ web page in a day or two,
>> and you can find more details about this issue.
>>
>> OMPI FAQ: hxxp://www.open-mpi.org/faq/?category=openfabrics
>>
>> -- YK
>>
>> On 28-Jul-11 7:53 AM, 吕慧伟 wrote:
>>> Dear all,
>>>
>>> I have encounted a problem concerns running large jobs on SMP cluster with 
>>> Open MPI 1.4.
>>> The application need all-to-all communication, each process send messages 
>>> to all other processes via MPI_Isend. It goes fine when running 256 
>>> processes, the problems occurs when process number>=512.
>>>
>>> The error message is:
>>>          mpirun -np 512 -machinefile machinefile.512 ./my_app
>>>          
>>> [gh30][[23246,1],311][connect/btl_openib_connect_oob.c:463:qp_create_one] 
>>> error creating qp errno says Cannot allocate memory
>>>          ...
>>>          
>>> [gh26][[23246,1],106][connect/btl_openib_connect_oob.c:809:rml_recv_cb] 
>>> error in endpoint reply start connect
>>>          
>>> [gh26][[23246,1],117][connect/btl_openib_connect_oob.c:463:qp_create_one] 
>>> error creating qp errno says Cannot allocate memory
>>>          ...
>>>          mpirun has exited due to process rank 424 with PID 26841 on
>>>          node gh31 exiting without calling "finalize".
>>>
>>> Related post 
>>> (hxxp://www.open-mpi.org/community/lists/users/2009/07/9786.php) suggests 
>>> it may run out of HCA QP resources. So I checked my system configuration 
>>> with 'ibv_devinfo -v' and get: 'max_qp: 261056'. In my case, running with 
>>> 256 processes would be under the limit: 256^2 = 65536<  261056, but 512^2 = 
>>> 262144>  261056.
>>> My question is: how to increase the max_qp number of InfiniBand or how to 
>>> get around this problem in MPI?
>>>
>>> Thanks in advance for any help you may give!
>>>
>>> Huiwei Lv
>>> PhD Student at Institute of Computing Technology
>>>
>>> -------------------------
>>> p.s. The system informantion is provided below:
>>> $ ompi_info -v ompi full --parsable
>>> ompi:version:full:1.4
>>> ompi:version:svn:r22285
>>> ompi:version:release_date:Dec 08, 2009
>>> $ uname -a
>>> Linux gh26 2 . 6 . 18-128 . el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 
>>> x86_64 x86_64 GNU/Linux
>>> $ ulimit -l
>>> unlimited
>>> $ ibv_devinfo -v
>>> hca_id: mlx4_0
>>>          transport:                      InfiniBand (0)
>>>          fw_ver:                         2.7.000
>>>          node_guid:                      00d2:c910:0001:b6c0
>>>          sys_image_guid:                 00d2:c910:0001:b6c3
>>>          vendor_id:                      0x02c9
>>>          vendor_part_id:                 26428
>>>          hw_ver:                         0xB0
>>>          board_id:                       MT_0D20110009
>>>          phys_port_cnt:                  1
>>>          max_mr_size:                    0xffffffffffffffff
>>>          page_size_cap:                  0xfffffe00
>>>          max_qp:                         261056
>>>          max_qp_wr:                      16351
>>>          device_cap_flags:               0x00fc9c76
>>>          max_sge:                        32
>>>          max_sge_rd:                     0
>>>          max_cq:                         65408
>>>          max_cqe:                        4194303
>>>          max_mr:                         524272
>>>          max_pd:                         32764
>>>          max_qp_rd_atom:                 16
>>>          max_ee_rd_atom:                 0
>>>          max_res_rd_atom:                4176896
>>>          max_qp_init_rd_atom:            128
>>>          max_ee_init_rd_atom:            0
>>>          atomic_cap:                     ATOMIC_HCA (1)
>>>          max_ee:                         0
>>>          max_rdd:                        0
>>>          max_mw:                         0
>>>          max_raw_ipv6_qp:                0
>>>          max_raw_ethy_qp:                1
>>>          max_mcast_grp:                  8192
>>>          max_mcast_qp_attach:            56
>>>          max_total_mcast_qp_attach:      458752
>>>          max_ah:                         0
>>>          max_fmr:                        0
>>>          max_srq:                        65472
>>>          max_srq_wr:                     16383
>>>          max_srq_sge:                    31
>>>          max_pkeys:                      128
>>>          local_ca_ack_delay:             15
>>>                  port:   1
>>>                          state:                  PORT_ACTIVE (4)
>>>                          max_mtu:                2048 (4)
>>>                          active_mtu:             2048 (4)
>>>                          sm_lid:                 86
>>>                          port_lid:               73
>>>                          port_lmc:               0x00
>>>                          link_layer:             IB
>>>                          max_msg_sz:             0x40000000
>>>                          port_cap_flags:         0x02510868
>>>                          max_vl_num:             8 (4)
>>>                          bad_pkey_cntr:          0x0
>>>                          qkey_viol_cntr:         0x0
>>>                          sm_sl:                  0
>>>                          pkey_tbl_len:           128
>>>                          gid_tbl_len:            128
>>>                          subnet_timeout:         18
>>>                          init_type_reply:        0
>>>                          active_width:           4X (2)
>>>                          active_speed:           10.0 Gbps (4)
>>>                          phys_state:             LINK_UP (5)
>>>                          GID[  0]:               
>>> fe80:0000:0000:0000:00d2:c910:0001:b6c1
>>>
>>> Related threads in the list:
>>> hxxp://www.open-mpi.org/community/lists/users/2009/07/9786.php
>>> hxxp://www.open-mpi.org/community/lists/users/2009/08/10456.php
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> hxxp://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> hxxp://www.open-mpi.org/mailman/listinfo.cgi/users
> 

Reply via email to