I've verified that ulimit -l is unlimited everywhere.
After further testing I think the errors are related to OFED not openmpi.
I've uninstalled the OFED that comes with SLES (1.4.0) and installed
OFED 1.4.2 and 1.5-beta and I don't get the errors.
I got the idea to swap out OFED that after reading this:
http://kerneltrap.org/mailarchive/openfabrics-general/2008/11/3/3903184
Under OFED 1.4.0 (from SLES 11) I had to set options mlx4_core msi_x=0
in /etc/modprobe.conf.local to even get the mlx4 module to load.
I found that advice here:
http://forums11.itrc.hp.com/service/forums/questionanswer.do?admit=109447626+1254161827534+28353475&threadId=1361415
(Under 1.4.2 and 1.5-Beta the modules load fine without mlx4_core
msi_x=0 being set)
Now my problem is that with OFED 1.4.2 and 1.5-beta the system hang and
the GigE network stops working and I have to power cycle nodes to login.
I'm going to try to get some help from the OFED mailing list now.
Pavel Shamis (Pasha) wrote:
Very strange. MPI tries to access CQ context and it get immediate error.
Please make sure that you limits configuration is ok, take a look on
this FAQ -
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
Pasha.
Charles Wright wrote:
Hello,
I just got some new cluster hardware :) :(
I can't seem to overcome an openib problem
I get this at run time
error polling HP CQ with -2 errno says Success
I've tried 2 different IB switches and multiple sets of nodes all on
one switch or the other to try to eliminate the hardware. (IPoIB
pings work and IB switches ree
I've tried both v1.3.3 and v1.2.9 and get the same errors. I'm not
really sure what these errors mean or how to get rid of them.
My MPI application work if all the CPUs are on the same node (self
btl only probably)
Any advice would be appreciated. Thanks.
asnrcw@dmc:~> qsub -I -l nodes=32,partition=dmc,feature=qc226 -q sysadm
qsub: waiting for job 232035.mds1.asc.edu to start
qsub: job 232035.mds1.asc.edu ready
####################################################################
# Alabama Supercomputer Center - PBS Prologue
# Your job id is : 232035
# Your job name is : STDIN
# Your job's queue is : sysadm
# Your username for this job is : asnrcw
# Your group for this job is : analyst
# Your job used : # 8 CPUs on dmc101
# 8 CPUs on dmc102
# 8 CPUs on dmc103
# 8 CPUs on dmc104
# Your job started at : Fri Sep 25 10:20:05 CDT 2009
####################################################################
asnrcw@dmc101:~> asnrcw@dmc101:~> asnrcw@dmc101:~> asnrcw@dmc101:~>
asnrcw@dmc101:~> cd mpiprintrank
asnrcw@dmc101:~/mpiprintrank> which mpirun
/apps/openmpi-1.3.3-intel/bin/mpirun
asnrcw@dmc101:~/mpiprintrank> mpirun ./mpiprintrank-dmc-1.3.3-intel
[dmc103][[46071,1],19][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],16][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],17][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],18][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],20][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],21][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],23][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
[dmc101][[46071,1],6][btl_openib_component.c:3047:poll_device]
[dmc102][[46071,1],14][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
error polling HP CQ with -2 errno says Success
[dmc101][[46071,1],7][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
[dmc103][[46071,1],22][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],15][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],11][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],11][btl_openib_component.c:3047:poll_device]
[dmc102][[46071,1],12][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],12][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
error polling HP CQ with -2 errno says Success
[dmc101][[46071,1],3][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
[dmc101][[46071,1],4][btl_openib_component.c:3047:poll_device]
[dmc102][[46071,1],8][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
[dmc101][[46071,1],0][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
error polling HP CQ with -2 errno says Success
[dmc102][[46071,1],15][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
[dmc101][[46071,1],1][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],9][btl_openib_component.c:3047:poll_device]
[dmc102][[46071,1],14][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
error polling HP CQ with -2 errno says Success
[dmc102][[46071,1],9][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
[dmc101][[46071,1],5][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],13][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],13][btl_openib_component.c:3047:poll_device]
[dmc101][[46071,1],2][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
error polling HP CQ with -2 errno says
Success[dmc102][[46071,1],10][btl_openib_component.c:3047:poll_device]
error polling HP CQ with -2 errno says Success
[dmc102][[46071,1],10][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
[dmc102][[46071,1],8][btl_openib_component.c:3047:poll_device] error
polling HP CQ with -2 errno says Success
System info:
Compute nodes:
http://www.supermicro.com/products/system/2U/6026/SYS-6026TT-IBXF.cfm
Which has an integrated Mellanox Technologies MT26418 [ConnectX IB
DDR, PCIe 2.0 5GT/s] (rev a0)
asnrcw@dmc129:~> uname -a
Linux dmc129 2.6.27.29-0.1-default #1 SMP 2009-08-15 17:53:59 +0200
x86_64 x86_64 x86_64 GNU/Linux
asnrcw@dmc129:~> rpm -qa | grep ofed
ofed-doc-1.4.0-11.12
ofed-1.4.0-11.12
asnrcw@dmc129:~> cat /etc/SuSE-release
SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 0
asnrcw@dmc129:~>
Subnet manager is running an a Voltaire 9024 DM Switch (firmware
version 5.1.0)
asnrcw@dmc129:~> ibv_devinfo
hca_id: mlx4_0
fw_ver: 2.6.000
node_guid: 0030:48c8:b919:0000
sys_image_guid: 0030:48c8:b919:0003
vendor_id: 0x02c9
vendor_part_id: 26418
hw_ver: 0xA0
board_id: SM_2081000001000
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 139
port_lmc: 0x00
asnrcw@dmc129:~> ulimit -l
unlimited
------------------------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Charles Wright, HPC Systems Specialist
Computer Sciences Corporation
High Performance Computing Center of Excellence
http://www.cschpc.com
(256)971-7429
cwrig...@csc.com