Random thought: would it be easy for the output of cat /dev/knem to indicate whether IOAT hardware is present?

On Jul 1, 2009, at 5:23 AM, Jose Gracia wrote:

Dear all,

I have problems running large jobs on a PC cluster with OpenMPI V1.3.
Typically the error appears only for processor count >= 2048 (actually
cores), sometimes also bellow.

The nodes (Intel Nehalem, 2 procs, 4 cores each) run (scientific?) linux.
$> uname -a
Linux cl3fr1 2.6.18-128.1.10.el5 #1 SMP Thu May 7 12:48:13 EDT 2009
x86_64 x86_64 x86_64 GNU/Linux

The code starts normally, reads it's input data sets (~4GB), does some
initialization and then continues the actual calculations. So time after
that, it fails with the following error message:

[n100501][[40339,1],6][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one]
error creating qp errno says Cannot allocate memory

Memory usage by the application should not be the problem. At this proc count, the code uses only ~100MB per proc. Also, the code runs for lower
number of procs where it consumes more mem.


I also get the apparently secondary error messages:

[n100501:14587] [[40339,0],0]-[[40339,1],4] mca_oob_tcp_msg_recv: readv
failed: Connection reset by peer (104)


The cluster uses InfiniBand connections. I am aware only of the
following parameter changes (systemwide):
btl_openib_ib_min_rnr_timer = 25
btl_openib_ib_timeout = 20

$> ulimit -l
unlimited


I attached:
1) $> ompi_info --all > ompi_info.log
2) stderr from the PBS: stderr.log


Thanks for any help you may give!

Cheers,
Jose


<ompi_info.log.gz>+ export OMP_NUM_THREADS=1
+ OMP_NUM_THREADS=1
+ module load compiler/intel mpi/openmpi/1.3-intel-11.0
++ /opt/system/modules/3.2.6/Modules/3.2.6/bin/modulecmd bash load compiler/intel mpi/openmpi/1.3-intel-11.0 + eval LD_LIBRARY_PATH=/opt/mpi/openmpi/1.3-intel-11.0/lib:/usr/ local/lib:/opt/compiler/intel//cc/11.0.074/idb/lib/intel64:/opt/ compiler/intel//fc/11.0.074/lib/intel64:/opt/compiler/intel//cc/ 11.0.074/lib/intel64 ';export' 'LD_LIBRARY_PATH;LOADEDMODULES=system/ maui/3.2.6p21:compiler/intel/11.0:mpi/openmpi/1.3-intel-11.0' ';export' 'LOADEDMODULES;MANPATH=/usr/local/man::/opt/system/modules/ default/man:/opt/compiler/intel//cc/11.0.074/man:/opt/compiler/ intel//fc/11.0.074/man:/opt/mpi/openmpi/1.3-intel-11.0/man' ';export' 'MANPATH;MPIDIR=/opt/mpi/openmpi/1.3-intel-11.0' ';export' 'MPIDIR;MPI_BIN_DIR=/opt/mpi/openmpi/1.3-intel-11.0/bin' ';export' 'MPI_BIN_DIR;MPI_INC_DIR=/opt/mpi/openmpi/1.3-intel-11.0/include' ';export' 'MPI_INC_DIR;MPI_LIB_DIR=/opt/mpi/openmpi/1.3-intel-11.0/ lib' ';export' 'MPI_LIB_DIR;MPI_MAN_DIR=/opt/mpi/openmpi/1.3- intel-11.0/man' ';export' 'MPI_MAN_DIR;MPI_VERSION=1.3-intel-11.0' ';export' 'MPI_VERSION;NLSPATH=/opt/compiler/intel//cc/11.0.074/idb/ intel64/locale/%l_%t/%N' ';export' 'NLSPATH;PATH=/opt/mpi/openmpi/ 1.3-intel-11.0/bin:/opt/compiler/intel//fc/11.0.074/bin/intel64:/opt/ compiler/intel//java/jre1.6.0_14/bin:/opt/compiler/intel//cc/ 11.0.074/bin/intel64:/nfs/home4/HLRS/hlrs/hpcjgrac/bin:/usr/local/ bin:/usr/lib64/qt-3.3/bin:/opt/system/maui/3.2.6p21/bin:/usr/ kerberos/bin:/bin:/usr/bin' ';export' 'PATH;_LMFILES_=/opt/system/ modulefiles/system/maui/3.2.6p21:/opt/modulefiles/compiler/intel/ 11.0:/opt/modulefiles/mpi/openmpi/1.3-intel-11.0' ';export' '_LMFILES_;' ++ LD_LIBRARY_PATH=/opt/mpi/openmpi/1.3-intel-11.0/lib:/usr/local/ lib:/opt/compiler/intel//cc/11.0.074/idb/lib/intel64:/opt/compiler/ intel//fc/11.0.074/lib/intel64:/opt/compiler/intel//cc/11.0.074/lib/ intel64
++ export LD_LIBRARY_PATH
++ LOADEDMODULES=system/maui/3.2.6p21:compiler/intel/11.0:mpi/ openmpi/1.3-intel-11.0
++ export LOADEDMODULES
++ MANPATH=/usr/local/man::/opt/system/modules/default/man:/opt/ compiler/intel//cc/11.0.074/man:/opt/compiler/intel//fc/11.0.074/ man:/opt/mpi/openmpi/1.3-intel-11.0/man
++ export MANPATH
++ MPIDIR=/opt/mpi/openmpi/1.3-intel-11.0
++ export MPIDIR
++ MPI_BIN_DIR=/opt/mpi/openmpi/1.3-intel-11.0/bin
++ export MPI_BIN_DIR
++ MPI_INC_DIR=/opt/mpi/openmpi/1.3-intel-11.0/include
++ export MPI_INC_DIR
++ MPI_LIB_DIR=/opt/mpi/openmpi/1.3-intel-11.0/lib
++ export MPI_LIB_DIR
++ MPI_MAN_DIR=/opt/mpi/openmpi/1.3-intel-11.0/man
++ export MPI_MAN_DIR
++ MPI_VERSION=1.3-intel-11.0
++ export MPI_VERSION
++ NLSPATH=/opt/compiler/intel//cc/11.0.074/idb/intel64/locale/%l_%t/ %N
++ export NLSPATH
++ PATH=/opt/mpi/openmpi/1.3-intel-11.0/bin:/opt/compiler/intel//fc/ 11.0.074/bin/intel64:/opt/compiler/intel//java/jre1.6.0_14/bin:/opt/ compiler/intel//cc/11.0.074/bin/intel64:/nfs/home4/HLRS/hlrs/ hpcjgrac/bin:/usr/local/bin:/usr/lib64/qt-3.3/bin:/opt/system/maui/ 3.2.6p21/bin:/usr/kerberos/bin:/bin:/usr/bin
++ export PATH
++ _LMFILES_=/opt/system/modulefiles/system/maui/3.2.6p21:/opt/ modulefiles/compiler/intel/11.0:/opt/modulefiles/mpi/openmpi/1.3- intel-11.0
++ export _LMFILES_
+ module list
++ /opt/system/modules/3.2.6/Modules/3.2.6/bin/modulecmd bash list
Currently Loaded Modulefiles:
 1) system/maui/3.2.6p21         3) mpi/openmpi/1.3-intel-11.0
 2) compiler/intel/11.0
+ eval
+ cd /nfs/nas/homeB/home4/HLRS/hlrs/hpcjgrac/prace/benchmark/ applications/gadget/tmp/GADGET_NEHALEM- HLRS_StrongScaling_2048_i000083/n256p8t1_t001_i01
++ date
+ echo '<jobstart at="Fri Jun 19 09:50:05 CEST 2009" />'
+ mpiexec time /nfs/nas/homeB/home4/HLRS/hlrs/hpcjgrac/prace/ benchmark/applications/gadget/tmp/GADGET_NEHALEM- HLRS_StrongScaling_2048_i000083/n256p8t1_t001_i01/GADGET_NEHALEM- HLRS_cname_NEHALEM-HLRS.exe param.txt [n100501][[40339,1],6][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n100501][[40339,1],5][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n100501][[40339,1],5][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply start connect [n100501][[40339,1],1][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n100501][[40339,1],1][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply start connect [n100501][[40339,1],2][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n100501][[40339,1],2][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n100501][[40339,1],2][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply start connect [n100501][[40339,1],3][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n100501][[40339,1],3][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply start connect [n100501][[40339,1],4][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n100501][[40339,1],4][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply start connect [n100501][[40339,1],6][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n100501][[40339,1],6][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply start connect [n100501][[40339,1],7][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n100501][[40339,1],7][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply start connect [n100501:14587] [[40339,0],0]-[[40339,1],4] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) [n100501:14587] [[40339,0],0]-[[40339,1],7] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) [n100501:14587] [[40339,0],0]-[[40339,1],6] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) [n100501:14587] [[40339,0],0]-[[40339,1],5] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) [n100501:14587] [[40339,0],0]-[[40339,1],1] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) [n100501:14587] [[40339,0],0]-[[40339,1],2] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) [n100501:14587] [[40339,0],0]-[[40339,1],3] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) [n033201][[40339,1],1551][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n033201][[40339,1],1551][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply start connect [n033201][[40339,1],1547][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory
[n033201:3588] *** An error occurred in MPI_Sendrecv
[n033201:3588] *** on communicator MPI_COMM_WORLD
[n033201:3588] *** MPI_ERR_OTHER: known error not in list
[n033201:3588] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[n033102][[40339,1],1538][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n033102][[40339,1],1543][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n033201][[40339,1],1549][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n033201][[40339,1],1545][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n033201][[40339,1],1545][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply start connect [n033102][[40339,1],1540][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n033102][[40339,1],1540][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply start connect [n033102][[40339,1],1541][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n033102][[40339,1],1536][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n033201][[40339,1],1544][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n033201][[40339,1],1550][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n033201][[40339,1],1550][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply start connect [n033201][[40339,1],1548][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n033201][[40339,1],1548][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply start connect [n033201][[40339,1],1546][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n033201][[40339,1],1546][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply start connect [n033202][[40339,1],1553][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n033202][[40339,1],1555][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n033202][[40339,1],1555][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply start connect [n033202][[40339,1],1556][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n033202][[40339,1],1552][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n033202][[40339,1],1552][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:804:rml_recv_cb] error in endpoint reply start connect [n033202][[40339,1],1558][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n033202][[40339,1],1559][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n033202][[40339,1],1557][../../../../../ompi/mca/btl/openib/connect/ btl_openib_connect_oob.c:459:qp_create_one] error creating qp errno says Cannot allocate memory [n033201:03576] [[40339,0],193]-[[40339,1],1544] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) [n033102:03498] [[40339,0],192]-[[40339,1],1538] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) [n033102:03498] [[40339,0],192]-[[40339,1],1543] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) [n033201:03576] [[40339,0],193]-[[40339,1],1551] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) [n033102:03498] [[40339,0],192]-[[40339,1],1540] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) [n033201:03576] [[40339,0],193]-[[40339,1],1549] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) [n033202:03719] [[40339,0],194]-[[40339,1],1555] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) [n033202:03719] [[40339,0],194]-[[40339,1],1552] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
Command exited with non-zero status 16
64.36user 3.48system 1:20.39elapsed 84%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (7major+125286minor)pagefaults 0swaps
--------------------------------------------------------------------------
mpiexec has exited due to process rank 1538 with PID 3501 on
node n033102 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
[n100501:14587] 11 more processes have sent help message help-mpi- errors.txt / mpi_errors_are_fatal [n100501:14587] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
++ date
+ echo '<jobend at="Fri Jun 19 09:51:27 CEST 2009" />'
<ATT3807088.txt>


--
Jeff Squyres
Cisco Systems

Reply via email to