Re: [OMPI users] mlx4 error - looking for guidance

Jeff Layton Thu, 5 Mar 2009 17:15:25 -0500

Oops. I ran it on the head node and not the compute node. Here is the
output from a compute node:


hca_id: mlx4_0
       fw_ver:                         2.3.000
       node_guid:                      0018:8b90:97fe:1b6d
       sys_image_guid:                 0018:8b90:97fe:1b70
       vendor_id:                      0x02c9
       vendor_part_id:                 25418
       hw_ver:                         0xA0
       board_id:                       DEL08C0000001
       phys_port_cnt:                  2
       max_mr_size:                    0xffffffffffffffff
       page_size_cap:                  0xfffff000
       max_qp:                         131008
       max_qp_wr:                      16351
       device_cap_flags:               0x001c1c66
       max_sge:                        32
       max_sge_rd:                     0
       max_cq:                         65408
       max_cqe:                        4194303
       max_mr:                         131056
       max_pd:                         32764
       max_qp_rd_atom:                 16
       max_ee_rd_atom:                 0
       max_res_rd_atom:                2096128
       max_qp_init_rd_atom:            128
       max_ee_init_rd_atom:            0
       atomic_cap:                     ATOMIC_HCA (1)
       max_ee:                         0
       max_rdd:                        0
       max_mw:                         0
       max_raw_ipv6_qp:                0
       max_raw_ethy_qp:                0
       max_mcast_grp:                  8192
       max_mcast_qp_attach:            56
       max_total_mcast_qp_attach:      458752
       max_ah:                         0
       max_fmr:                        0
       max_srq:                        65472
       max_srq_wr:                     16383
       max_srq_sge:                    31
       max_pkeys:                      128
       local_ca_ack_delay:             15
               port:   1
                       state:                  PORT_ACTIVE (4)
                       max_mtu:                2048 (4)
                       active_mtu:             2048 (4)
                       sm_lid:                 41
                       port_lid:               70
                       port_lmc:               0x00
                       max_msg_sz:             0x40000000
                       port_cap_flags:         0x02510868
                       max_vl_num:             8 (4)
                       bad_pkey_cntr:          0x0
                       qkey_viol_cntr:         0x0
                       sm_sl:                  0
                       pkey_tbl_len:           128
                       gid_tbl_len:            128
                       subnet_timeout:         18
                       init_type_reply:        0
                       active_width:           4X (2)
                       active_speed:           5.0 Gbps (2)
                       phys_state:             LINK_UP (5)

GID[ 0]:fe80:0000:0000:0000:0018:8b90:97fe:1b6e


               port:   2
                       state:                  PORT_DOWN (1)
                       max_mtu:                2048 (4)
                       active_mtu:             2048 (4)
                       sm_lid:                 0
                       port_lid:               0
                       port_lmc:               0x00
                       max_msg_sz:             0x40000000
                       port_cap_flags:         0x02510868
                       max_vl_num:             8 (4)
                       bad_pkey_cntr:          0x0
                       qkey_viol_cntr:         0x0
                       sm_sl:                  0
                       pkey_tbl_len:           128
                       gid_tbl_len:            128
                       subnet_timeout:         0
                       init_type_reply:        0
                       active_width:           4X (2)
                       active_speed:           2.5 Gbps (1)
                       phys_state:             POLLING (2)

GID[ 0]:fe80:0000:0000:0000:0018:8b90:97fe:1b6f

Do you have the same HCA adapter type on all of your machines ?
In the error log I see mlx4 error message , and mlx4 is connectX driver,
but ibv_devinfo show some older hca.
Jeff,
Can you please provide more information about you HCA type(ibv_devinfo -v).Do you see this error immediate during startup, or you get it duringyour run ?
Thanks,
Pasha

Jeff Layton wrote:
Evening everyone,
I'm running a CFD code on IB and I've encountered an error I'm notsure about and I'm looking for some guidance on where to startlooking. Here's the error:
mlx4: local QP operation err (QPN 260092, WQE index 9a9e0000,vendor syndrome 6f, opcode = 5e)[0,1,6][btl_openib_component.c:1392:btl_openib_component_progress]from compute-2-0.local to: compute-2-0.local error polling HP CQ with status LOCAL QP OPERATION ERROR status number2 for wr_id 37742320 opcode 0mpirun noticed that job rank 0 with PID 21220 on nodecompute-2-0.local exited on signal 15 (Terminated).
78 additional processes aborted (not shown)
This is openmpi-1.2.9rc2 (sorry - need to upgrade to 1.3.0). Thecode works correctly for smaller cases, but when I run larger casesI get this error.
I'm heading to bed but I'll check email tomorrow (so to sleep andrun but it's been a long day).
TIA!

Jeff
------------------------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] mlx4 error - looking for guidance

Reply via email to