We are having quite a bit of trouble reliably launching larger jobs (1920 nodes, 1 ppn) with OMPI (1.1.2rc4 with gcc) at the moment. The launches usually either just hang or fail with output like:
Cbench numprocs: 1920 Cbench numnodes: 1921 Cbench ppn: 1 Cbench jobname: xhpl-1ppn-1920 Cbench joblaunchmethod: openmpi tcp_puts: error! out of space in buffer and cannot commit message (bufsize=262144, buflen=261801, ct=450) [cn1023:02832] pls:tm: start_procs returned error -1 [cn1023:02832] [0,0,0] ORTE_ERROR_LOG: Error in file rmgr_urm.c at line 186 [cn1023:02832] [0,0,0] ORTE_ERROR_LOG: Error in file rmgr_urm.c at line 490 [cn1023:02832] orterun: spawn failed with errno=-1 [dn622:00631] [0,0,43]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with errno=104 [dn583:00606] [0,0,7]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with errno=104 [dn584:00606] [0,0,8]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with errno=104 [dn585:00604] [0,0,9]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with errno=104 [dn591:00606] [0,0,15]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with errno=104 [dn592:00604] [0,0,16]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with errno=104 [dn582:00607] [0,0,6]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with errno=104 [dn588:00605] [0,0,12]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with errno=104 [dn590:00606] [0,0,14]-[0,0,0] mca_oob_tcp_msg_recv: readv failed with errno=104 The OMPI environment parameters we are using are: %env | grep OMPI OMPI_MCA_oob_tcp_include=eth0 OMPI_MCA_oob_tcp_listen_mode=listen_thread OMPI_MCA_btl_openib_ib_timeout=18 OMPI_MCA_oob_tcp_listen_thread_max_time=100 OMPI_MCA_oob_tcp_listen_thread_max_queue=100 OMPI_MCA_btl_tcp_if_include=eth0 OMPI_MCA_btl_openib_ib_retry_count=15 OMPI_MCA_btl_openib_ib_cq_size=65536 OMPI_MCA_rmaps_base_schedule_policy=node I have full output with generated from the following OMPI params attached: export OMPI_MCA_pls_tm_debug=1 export OMPI_MCA_pls_tm_verbose=1 We are running Toruqe 2.1.2. I'm mostly suspicious of the tcp_puts error and the 262144 bufsize limit... Any ideas? Thanks.
xhpl-1ppn-1920..o127407
Description: xhpl-1ppn-1920..o127407
xhpl-1ppn-1920..e127407
Description: xhpl-1ppn-1920..e127407