Hi All - I've been running a parallel application using OpenMPI and SLURM and getting the following error messages. The same application runs fine in another cluster with Torque so I'm suspecting I'm missing some kind of SLURM configuration setting.
Ps. Simple commands like srun -N 128 hostname work fine. Any help is greatly appreciated. Thanks! ======================================================= srun: auth plugin for Munge (http://code.google.com/p/munge/) loaded srun: jobid 8232: nodes(16):`n[0242-0257]', cpu counts: 16(x16) srun: launching 8232.0 on host n0242, 1 tasks: 0 srun: launching 8232.0 on host n0243, 1 tasks: 1 srun: launching 8232.0 on host n0244, 1 tasks: 2 srun: launching 8232.0 on host n0245, 1 tasks: 3 srun: launching 8232.0 on host n0246, 1 tasks: 4 srun: launching 8232.0 on host n0247, 1 tasks: 5 srun: launching 8232.0 on host n0248, 1 tasks: 6 srun: launching 8232.0 on host n0249, 1 tasks: 7 srun: launching 8232.0 on host n0250, 1 tasks: 8 srun: launching 8232.0 on host n0251, 1 tasks: 9 srun: launching 8232.0 on host n0252, 1 tasks: 10 srun: launching 8232.0 on host n0253, 1 tasks: 11 srun: launching 8232.0 on host n0254, 1 tasks: 12 srun: launching 8232.0 on host n0255, 1 tasks: 13 srun: launching 8232.0 on host n0256, 1 tasks: 14 srun: launching 8232.0 on host n0257, 1 tasks: 15 srun: Node n0244, 1 tasks started srun: Node n0243, 1 tasks started srun: Node n0245, 1 tasks started srun: Node n0242, 1 tasks started srun: Node n0247, 1 tasks started srun: Node n0246, 1 tasks started srun: Node n0249, 1 tasks started srun: Node n0248, 1 tasks started srun: Node n0251, 1 tasks started srun: Node n0250, 1 tasks started srun: Node n0254, 1 tasks started srun: Node n0252, 1 tasks started srun: Node n0256, 1 tasks started srun: Node n0253, 1 tasks started srun: Node n0257, 1 tasks started srun: Node n0255, 1 tasks started nmf: error: slurm_accept_msg_conn: Interrupted system call [n0244:26366] [[8232,1],2][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit failed: Operation failed -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_modex failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- [n0244:26366] *** An error occurred in MPI_Init_thread [n0244:26366] *** on a NULL communicator [n0244:26366] *** Unknown error [n0244:26366] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: n0244 PID: 26366 -------------------------------------------------------------------------- nmf: error: slurm_accept_msg_conn: Interrupted system call [n0243:22413] [[8232,1],1][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit failed: Operation failed -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_modex failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- [n0243:22413] *** An error occurred in MPI_Init_thread [n0243:22413] *** on a NULL communicator [n0243:22413] *** Unknown error [n0243:22413] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: n0243 PID: 22413 -------------------------------------------------------------------------- nmf: error: slurm_accept_msg_conn: Interrupted system call [n0245:17194] [[8232,1],3][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit failed: Operation failed nmf: error: slurm_accept_msg_conn: Interrupted system call [n0242:32029] [[8232,1],0][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit failed: Operation failed -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_modex failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- [n0245:17194] *** An error occurred in MPI_Init_thread [n0245:17194] *** on a NULL communicator [n0245:17194] *** Unknown error [n0245:17194] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_modex failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: n0245 PID: 17194 -------------------------------------------------------------------------- [n0242:32029] *** An error occurred in MPI_Init_thread [n0242:32029] *** on a NULL communicator [n0242:32029] *** Unknown error [n0242:32029] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: n0242 PID: 32029 -------------------------------------------------------------------------- nmf: error: slurm_accept_msg_conn: Interrupted system call [n0247:08603] [[8232,1],5][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit failed: Operation failed -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_modex failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- [n0247:8603] *** An error occurred in MPI_Init_thread [n0247:8603] *** on a NULL communicator [n0247:8603] *** Unknown error [n0247:8603] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: n0247 PID: 8603 -------------------------------------------------------------------------- nmf: error: slurm_accept_msg_conn: Interrupted system call [n0249:01391] [[8232,1],7][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit failed: Operation failed -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_modex failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- [n0249:1391] *** An error occurred in MPI_Init_thread [n0249:1391] *** on a NULL communicator [n0249:1391] *** Unknown error [n0249:1391] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort nmf: error: slurm_accept_msg_conn: Interrupted system call -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: n0249 PID: 1391 -------------------------------------------------------------------------- [n0246:19634] [[8232,1],4][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit failed: Operation failed -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_modex failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- [n0246:19634] *** An error occurred in MPI_Init_thread [n0246:19634] *** on a NULL communicator [n0246:19634] *** Unknown error [n0246:19634] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: n0246 PID: 19634 -------------------------------------------------------------------------- srun: Received task exit notification for 1 task (status=0x0100). nmf: error: slurm_accept_msg_conn: Interrupted system call [n0251:20401] [[8232,1],9][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit failed: Operation failed nmf: error: slurm_accept_msg_conn: Interrupted system call -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_modex failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- [n0251:20401] *** An error occurred in MPI_Init_thread [n0251:20401] *** on a NULL communicator [n0251:20401] *** Unknown error [n0251:20401] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: n0251 PID: 20401 -------------------------------------------------------------------------- [n0250:08714] [[8232,1],8][grpcomm_pmi_module.c:398:modex] PMI_KVS_Commit failed: Operation failed -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_modex failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- [n0250:8714] *** An error occurred in MPI_Init_thread [n0250:8714] *** on a NULL communicator [n0250:8714] *** Unknown error [n0250:8714] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: n0250 PID: 8714 -------------------------------------------------------------------------- srun: error: n0244: task 2: Exited with exit code 1 srun: Received task exit notification for 1 task (status=0x0100). srun: error: n0242: task 0: Exited with exit code 1 srun: Received task exit notification for 1 task (status=0x0100). srun: error: n0245: task 3: Exited with exit code 1 srun: Received task exit notification for 1 task (status=0x0100). srun: error: n0243: task 1: Exited with exit code 1 srun: Received task exit notification for 1 task (status=0x0100). srun: error: n0247: task 5: Exited with exit code 1 srun: Received task exit notification for 1 task (status=0x0100). srun: error: n0249: task 7: Exited with exit code 1 srun: Received task exit notification for 1 task (status=0x0100). srun: error: n0246: task 4: Exited with exit code 1 srun: Received task exit notification for 1 task (status=0x0100). srun: error: n0251: task 9: Exited with exit code 1 srun: Received task exit notification for 1 task (status=0x0100). srun: Sent KVS info to 16 nodes, up to 1 tasks per node srun: error: slurm_send_recv_rc_msg_only_one to n0242:46095 : Connection refused srun: error: slurm_send_recv_rc_msg_only_one to n0243:56307 : Connection refused srun: error: slurm_send_recv_rc_msg_only_one to n0244:42705 : Connection refused srun: error: n0250: task 8: Exited with exit code 1 srun: error: slurm_send_recv_rc_msg_only_one to n0245:57746 : Connection refused srun: error: slurm_send_recv_rc_msg_only_one to n0246:57496 : Connection refused srun: error: slurm_send_recv_rc_msg_only_one to n0250:50673 : Connection refused srun: error: slurm_send_recv_rc_msg_only_one to n0249:36371 : Connection refused srun: error: slurm_send_recv_rc_msg_only_one to n0251:47692 : Connection refused srun: error: slurm_send_recv_rc_msg_only_one to n0247:57347 : Connection refused nmf: error: slurm_accept_msg_conn: Interrupted system call [n0254:18359] [[8232,1],12][grpcomm_pmi_module.c:195:pmi_barrier] PMI_Barrier: Operation failed -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_barrier failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- [n0254:18359] *** An error occurred in MPI_Init_thread [n0254:18359] *** on a NULL communicator [n0254:18359] *** Unknown error [n0254:18359] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: n0254 PID: 18359 -------------------------------------------------------------------------- nmf: error: slurm_accept_msg_conn: Interrupted system call nmf: error: slurm_accept_msg_conn: Interrupted system call [n0252:05124] [[8232,1],10][grpcomm_pmi_module.c:195:pmi_barrier] PMI_Barrier: Operation failed [n0256:27363] [[8232,1],14][grpcomm_pmi_module.c:195:pmi_barrier] PMI_Barrier: Operation failed -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_barrier failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- [n0256:27363] *** An error occurred in MPI_Init_thread [n0256:27363] *** on a NULL communicator [n0256:27363] *** Unknown error [n0256:27363] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: n0256 PID: 27363 -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_barrier failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- [n0252:5124] *** An error occurred in MPI_Init_thread [n0252:5124] *** on a NULL communicator [n0252:5124] *** Unknown error [n0252:5124] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: n0252 PID: 5124 -------------------------------------------------------------------------- nmf: error: slurm_accept_msg_conn: Interrupted system call [n0253:00668] [[8232,1],11][grpcomm_pmi_module.c:195:pmi_barrier] PMI_Barrier: Operation failed -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_barrier failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- [n0253:668] *** An error occurred in MPI_Init_thread [n0253:668] *** on a NULL communicator [n0253:668] *** Unknown error [n0253:668] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: n0253 PID: 668 -------------------------------------------------------------------------- nmf: error: slurm_accept_msg_conn: Interrupted system call [n0257:26458] [[8232,1],15][grpcomm_pmi_module.c:195:pmi_barrier] PMI_Barrier: Operation failed -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_barrier failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- [n0257:26458] *** An error occurred in MPI_Init_thread [n0257:26458] *** on a NULL communicator [n0257:26458] *** Unknown error [n0257:26458] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: n0257 PID: 26458 -------------------------------------------------------------------------- nmf: error: slurm_accept_msg_conn: Interrupted system call [n0255:32396] [[8232,1],13][grpcomm_pmi_module.c:195:pmi_barrier] PMI_Barrier: Operation failed -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_barrier failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- [n0255:32396] *** An error occurred in MPI_Init_thread [n0255:32396] *** on a NULL communicator [n0255:32396] *** Unknown error [n0255:32396] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: n0255 PID: 32396 -------------------------------------------------------------------------- srun: Received task exit notification for 1 task (status=0x0100). srun: error: n0252: task 10: Exited with exit code 1 srun: Received task exit notification for 1 task (status=0x0100). srun: error: n0256: task 14: Exited with exit code 1 srun: Received task exit notification for 1 task (status=0x0100). srun: error: n0253: task 11: Exited with exit code 1 srun: Received task exit notification for 1 task (status=0x0100). srun: error: n0254: task 12: Exited with exit code 1 srun: Received task exit notification for 1 task (status=0x0100). srun: error: n0257: task 15: Exited with exit code 1 srun: Received task exit notification for 1 task (status=0x0100). srun: error: n0255: task 13: Exited with exit code 1 nmf: error: slurm_accept_msg_conn: Interrupted system call [n0248:07754] [[8232,1],6][grpcomm_pmi_module.c:195:pmi_barrier] PMI_Barrier: Operation failed -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_grpcomm_barrier failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- [n0248:7754] *** An error occurred in MPI_Init_thread [n0248:7754] *** on a NULL communicator [n0248:7754] *** Unknown error [n0248:7754] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: n0248 PID: 7754 -------------------------------------------------------------------------- srun: Received task exit notification for 1 task (status=0x0100). srun: error: n0248: task 6: Exited with exit code 1 -- Abraços \ Regards \ Saludos ----------------------------- Fabricio Silva Kyt
