errno 24 means "Too many open files". Looks like you may be hitting the upper limit for the number of open file descriptors. Check /proc/sys/fs/file-max and see if you need to bump it up. Not sure if you need to bump up "ulimit -n", but worth a try.
-Aleph On 10/14/06, Adam Moody <mood...@llnl.gov> wrote:
Hello, I'm trying to run a 500 node job using mpirun / slurm with OpenMPI-1.1.1 and see the following errors at startup: [rhea342:09444] [0,1,318]-[0,0,0] mca_oob_tcp_peer_recv_blocking: recv() failed with errno=104 [rhea32:13463] mca_oob_tcp_accept: accept() failed with errno 24. [rhea32:13463] mca_oob_tcp_accept: accept() failed with errno 24. [rhea326:09641] [0,1,302]-[0,0,0] mca_oob_tcp_peer_recv_blocking: recv() failed with errno=104 ... I'm starting the job with the following commands: srun -N 500 -A mpirun -np 500 -bynode hello_mpi Smaller jobs around 50 nodes run just fine. Any ideas? Thanks, -Adam Moody LLNL _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users