See this FAQ entry: http://www.open-mpi.org/faq/?category=openfabrics#ib-xrc
-- YK On 02-Aug-11 12:27 AM, Shamis, Pavel wrote: > You may find some initial XRC tuning documentation here : > > https://svn.open-mpi.org/trac/ompi/ticket/1260 > > Pavel (Pasha) Shamis > --- > Application Performance Tools Group > Computer Science and Math Division > Oak Ridge National Laboratory > > > > > > > On Aug 1, 2011, at 11:41 AM, Yevgeny Kliteynik wrote: > >> Hi, >> >> Please try running OMPI with XRC: >> >> mpirun --mca btl openib... --mca btl_openib_receive_queues >> X,128,256,192,128:X,2048,256,128,32:X,12288,256,128,32:X,65536,256,128,32 ... >> >> XRC (eXtended Reliable Connection) decreases memory consumption >> of Open MPI by decreasing number of QP per machine. >> >> I'm not entirely sure that XRC is supported on OMPI 1.4, but I'm >> sure it is on later version of the 1.4 series (1.4.3). >> >> BTW, I do know that the command line is extremely user friendly >> and completely intuitive... :-) >> I'll have an XRC entry on the OMPI FAQ web page in a day or two, >> and you can find more details about this issue. >> >> OMPI FAQ: hxxp://www.open-mpi.org/faq/?category=openfabrics >> >> -- YK >> >> On 28-Jul-11 7:53 AM, 吕慧伟 wrote: >>> Dear all, >>> >>> I have encounted a problem concerns running large jobs on SMP cluster with >>> Open MPI 1.4. >>> The application need all-to-all communication, each process send messages >>> to all other processes via MPI_Isend. It goes fine when running 256 >>> processes, the problems occurs when process number>=512. >>> >>> The error message is: >>> mpirun -np 512 -machinefile machinefile.512 ./my_app >>> >>> [gh30][[23246,1],311][connect/btl_openib_connect_oob.c:463:qp_create_one] >>> error creating qp errno says Cannot allocate memory >>> ... >>> >>> [gh26][[23246,1],106][connect/btl_openib_connect_oob.c:809:rml_recv_cb] >>> error in endpoint reply start connect >>> >>> [gh26][[23246,1],117][connect/btl_openib_connect_oob.c:463:qp_create_one] >>> error creating qp errno says Cannot allocate memory >>> ... >>> mpirun has exited due to process rank 424 with PID 26841 on >>> node gh31 exiting without calling "finalize". >>> >>> Related post >>> (hxxp://www.open-mpi.org/community/lists/users/2009/07/9786.php) suggests >>> it may run out of HCA QP resources. So I checked my system configuration >>> with 'ibv_devinfo -v' and get: 'max_qp: 261056'. In my case, running with >>> 256 processes would be under the limit: 256^2 = 65536< 261056, but 512^2 = >>> 262144> 261056. >>> My question is: how to increase the max_qp number of InfiniBand or how to >>> get around this problem in MPI? >>> >>> Thanks in advance for any help you may give! >>> >>> Huiwei Lv >>> PhD Student at Institute of Computing Technology >>> >>> ------------------------- >>> p.s. The system informantion is provided below: >>> $ ompi_info -v ompi full --parsable >>> ompi:version:full:1.4 >>> ompi:version:svn:r22285 >>> ompi:version:release_date:Dec 08, 2009 >>> $ uname -a >>> Linux gh26 2 . 6 . 18-128 . el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 >>> x86_64 x86_64 GNU/Linux >>> $ ulimit -l >>> unlimited >>> $ ibv_devinfo -v >>> hca_id: mlx4_0 >>> transport: InfiniBand (0) >>> fw_ver: 2.7.000 >>> node_guid: 00d2:c910:0001:b6c0 >>> sys_image_guid: 00d2:c910:0001:b6c3 >>> vendor_id: 0x02c9 >>> vendor_part_id: 26428 >>> hw_ver: 0xB0 >>> board_id: MT_0D20110009 >>> phys_port_cnt: 1 >>> max_mr_size: 0xffffffffffffffff >>> page_size_cap: 0xfffffe00 >>> max_qp: 261056 >>> max_qp_wr: 16351 >>> device_cap_flags: 0x00fc9c76 >>> max_sge: 32 >>> max_sge_rd: 0 >>> max_cq: 65408 >>> max_cqe: 4194303 >>> max_mr: 524272 >>> max_pd: 32764 >>> max_qp_rd_atom: 16 >>> max_ee_rd_atom: 0 >>> max_res_rd_atom: 4176896 >>> max_qp_init_rd_atom: 128 >>> max_ee_init_rd_atom: 0 >>> atomic_cap: ATOMIC_HCA (1) >>> max_ee: 0 >>> max_rdd: 0 >>> max_mw: 0 >>> max_raw_ipv6_qp: 0 >>> max_raw_ethy_qp: 1 >>> max_mcast_grp: 8192 >>> max_mcast_qp_attach: 56 >>> max_total_mcast_qp_attach: 458752 >>> max_ah: 0 >>> max_fmr: 0 >>> max_srq: 65472 >>> max_srq_wr: 16383 >>> max_srq_sge: 31 >>> max_pkeys: 128 >>> local_ca_ack_delay: 15 >>> port: 1 >>> state: PORT_ACTIVE (4) >>> max_mtu: 2048 (4) >>> active_mtu: 2048 (4) >>> sm_lid: 86 >>> port_lid: 73 >>> port_lmc: 0x00 >>> link_layer: IB >>> max_msg_sz: 0x40000000 >>> port_cap_flags: 0x02510868 >>> max_vl_num: 8 (4) >>> bad_pkey_cntr: 0x0 >>> qkey_viol_cntr: 0x0 >>> sm_sl: 0 >>> pkey_tbl_len: 128 >>> gid_tbl_len: 128 >>> subnet_timeout: 18 >>> init_type_reply: 0 >>> active_width: 4X (2) >>> active_speed: 10.0 Gbps (4) >>> phys_state: LINK_UP (5) >>> GID[ 0]: >>> fe80:0000:0000:0000:00d2:c910:0001:b6c1 >>> >>> Related threads in the list: >>> hxxp://www.open-mpi.org/community/lists/users/2009/07/9786.php >>> hxxp://www.open-mpi.org/community/lists/users/2009/08/10456.php >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> hxxp://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> hxxp://www.open-mpi.org/mailman/listinfo.cgi/users >