Hi, Try the following QP parameters that only use shared receive queues.
-mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32 Samuel K. Gutierrez Los Alamos National Laboratory On May 19, 2011, at 5:28 AM, Robert Horton wrote: > Hi, > > I'm having problems getting the MPIRandomAccess part of the HPCC > benchmark to run with more than 32 processes on each node (each node has > 4 x AMD 6172 so 48 cores total). Once I go past 32 processes I get an > error like: > > [compute-1-13.local][[5637,1],18][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] > error creating qp errno says Cannot allocate memory > [compute-1-13.local][[5637,1],18][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:815:rml_recv_cb] > error in endpoint reply start connect > [compute-1-13.local:06117] [[5637,0],0]-[[5637,1],18] mca_oob_tcp_msg_recv: > readv failed: Connection reset by peer (104) > [compute-1-13.local:6137] *** An error occurred in MPI_Isend > [compute-1-13.local:6137] *** on communicator MPI_COMM_WORLD > [compute-1-13.local:6137] *** MPI_ERR_OTHER: known error not in list > [compute-1-13.local:6137] *** MPI_ERRORS_ARE_FATAL (your MPI job will now > abort) > [compute-1-13.local][[5637,1],26][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] > error creating qp errno says Cannot allocate memory > [[5637,1],66][../../../../../ompi/mca/btl/openib/btl_openib_component.c:3227:handle_wc] > from compute-1-13.local to: compute-1-13 error polling LP CQ with status > RETRY EXCEEDED ERROR status number 12 for wr_id 278870912 opcode > > I've tried changing btl_openib_receive_queues from > P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32 > to > P,128,512,256,512:S,2048,512,256,32:S,12288,512,256,32:S,65536,512,256,32 > > doing this lets the code run without the error, but it does so extremely > slowly - I'm also seeing errors in dmesg such as: > > CPU 12: > Modules linked in: nfs fscache nfs_acl blcr(U) blcr_imports(U) autofs4 > ipmi_devintf ipmi_si ipmi_msghandler lockd sunrpc ip_conntrack_netbios_ns > ipt_REJECT xt_state > ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp > ip6table_filter ip6_tables x_tables cpufreq_ondemand powernow_k8 freq_table > rdma_ucm(U) ib_sd > p(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) > ib_sa(U) ipv6 xfrm_nalgo crypto_api ib_uverbs(U) ib_umad(U) iw_nes(U) > iw_cxgb3(U) cxgb3(U) > mlx4_ib(U) mlx4_en(U) mlx4_core(U) ib_mthca(U) dm_mirror dm_multipath scsi_dh > video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac > parport_p > c lp parport joydev shpchp sg i2c_piix4 i2c_core ib_qib(U) dca ib_mad(U) > ib_core(U) igb 8021q serio_raw pcspkr dm_raid45 dm_message dm_region_hash > dm_log dm_mod dm_ > mem_cache ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd > Pid: 3980, comm: qib/12 Tainted: G 2.6.18-164.6.1.el5 #1 > RIP: 0010:[<ffffffff80094409>] [<ffffffff80094409>] tasklet_action+0x90/0xfd > RSP: 0018:ffff810c2f1bff40 EFLAGS: 00000246 > RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff810c2f1bff30 > RDX: 0000000000000000 RSI: ffff81042f063400 RDI: ffffffff8030d180 > RBP: ffff810c2f1bfec0 R08: 0000000000000001 R09: ffff8104aec2d000 > R10: ffff810c2f1bff00 R11: ffff810c2f1bff00 R12: ffffffff8005dc8e > R13: ffff81042f063480 R14: ffffffff80077874 R15: ffff810c2f1bfec0 > FS: 00002b20829592e0(0000) GS:ffff81042f186bc0(0000) knlGS:0000000000000000 > CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > CR2: 00002b2080b70720 CR3: 0000000000201000 CR4: 00000000000006e0 > > Call Trace: > <IRQ> [<ffffffff8001235a>] __do_softirq+0x89/0x133 > [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 > [<ffffffff8006cb20>] do_softirq+0x2c/0x85 > [<ffffffff8005dc8e>] apic_timer_interrupt+0x66/0x6c > <EOI> [<ffffffff800da30c>] __kmalloc+0x97/0x9f > [<ffffffff88220d8b>] :ib_qib:qib_verbs_send+0xdb3/0x104a > [<ffffffff80064b20>] _spin_unlock_irqrestore+0x8/0x9 > [<ffffffff881f66ca>] :ib_qib:qib_make_rc_req+0xbb1/0xbbf > [<ffffffff881f5b19>] :ib_qib:qib_make_rc_req+0x0/0xbbf > [<ffffffff881f8187>] :ib_qib:qib_do_send+0x0/0x950 > [<ffffffff881f8aa1>] :ib_qib:qib_do_send+0x91a/0x950 > [<ffffffff8002e2e3>] __wake_up+0x38/0x4f > [<ffffffff881f8187>] :ib_qib:qib_do_send+0x0/0x950 > [<ffffffff8004d7fb>] run_workqueue+0x94/0xe4 > [<ffffffff8004a043>] worker_thread+0x0/0x122 > [<ffffffff8009f9f0>] keventd_create_kthread+0x0/0xc4 > [<ffffffff8004a133>] worker_thread+0xf0/0x122 > [<ffffffff8008c3bd>] default_wake_function+0x0/0xe > [<ffffffff8009f9f0>] keventd_create_kthread+0x0/0xc4 > [<ffffffff8003297c>] kthread+0xfe/0x132 > [<ffffffff8005dfb1>] child_rip+0xa/0x11 > [<ffffffff8009f9f0>] keventd_create_kthread+0x0/0xc4 > [<ffffffff8003287e>] kthread+0x0/0x132 > [<ffffffff8005dfa7>] child_rip+0x0/0x11 > > Any thoughts on how to proceed? > > I'm running OpenMPI 1.4.3 compiled with gcc 4.1.2 and OFED 1.5.3.1 > > Thanks, > Rob > -- > Robert Horton > System Administrator (Research Support) - School of Mathematical Sciences > Queen Mary, University of London > r.hor...@qmul.ac.uk - +44 (0) 20 7882 7345 > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users