To help figure out what is going on, please send the following (to
the list, not to me!):
[*]Your Slurm configuration file (with private data like IP
addresses and node names removed)
[*]Your ./configure command lines for
[*]Slurm
[*]Mpich
[*]OpenMPI
[*]The command(s) that you use to submit the job
Andy
On 08/11/2015 01:57 AM, Igor Chebotar
wrote:
Hi all,
We are issuing a problem with a few applications that using MPI.
When we executing the job, the processes looks like running well without any
errors, but when the job is ending in the Slurm output we always get errors
like:
Slurm output file:
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 25147 RUNNING AT bee002
= EXIT CODE: 2
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0@bee001] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885):
assert (!closed) failed
[proxy:0:0@bee001] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76):
callback returned error status
[proxy:0:0@bee001] main (pm/pmiserv/pmip.c:206): demux engine error waiting for
event
srun: error: bee001: task 0: Exited with exit code 7
srun: error: _server_read: fd 19 got error or unexpected eof reading header
srun: error: step_launch_notify_io_failure: aborting, io error with slurmstepd
on node 2
[mpiexec@bee001] HYDT_bscu_wait_for_completion
(tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly;
aborting
[mpiexec@bee001] HYDT_bsci_wait_for_completion
(tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
completion
[mpiexec@bee001] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218):
launcher returned error waiting for completion
[mpiexec@bee001] main (ui/mpich/mpiexec.c:344): process manager error waiting
for completion
Slurm's Email Title upon failure:
SLURM Job_id=243524 Name=cloudy Failed, Run time 1-14:46:11, FAILED, ExitCode
255
The output of the job itself looks fine, so it means that everything running
fine and the problem is only in the job ending process.
Information about our system:
Applications that has that problem: FLASH, CLOUDY
MPI versions that we was running those applications: openmpi 1.8.4, openmpi
1.8.5, mpich 3.1.3
Slurm version: tried to run the applications both on slurm 14.11.8 and 14.11.3
OS: CentOS 6.5
Does anyone is familiar with that kind of problem? How can we solve it?
Any information regarding the issue could help us a lot.
Thanks!