To help figure out what is going on, please send the following (to
 the list, not to me!):
 
   [*]Your Slurm configuration file (with private data like IP
     addresses and node names removed)
     [*]Your ./configure command lines for
     [*]Slurm
       [*]Mpich
       [*]OpenMPI
   [*]The command(s) that you use to submit the job
 Andy
 
 On 08/11/2015 01:57 AM, Igor Chebotar
   wrote:
Hi all,

We are issuing a problem with a few applications that using MPI.

When we executing the job, the processes looks like running well without any 
errors, but when the job is ending in the Slurm output we always get errors 
like:

Slurm output file:
===================================================================================

=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

=   PID 25147 RUNNING AT bee002

=   EXIT CODE: 2

=   CLEANING UP REMAINING PROCESSES

=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================

[proxy:0:0@bee001] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): 
assert (!closed) failed

[proxy:0:0@bee001] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): 
callback returned error status

[proxy:0:0@bee001] main (pm/pmiserv/pmip.c:206): demux engine error waiting for 
event

srun: error: bee001: task 0: Exited with exit code 7

srun: error: _server_read: fd 19 got error or unexpected eof reading header

srun: error: step_launch_notify_io_failure: aborting, io error with slurmstepd 
on node 2

[mpiexec@bee001] HYDT_bscu_wait_for_completion 
(tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; 
aborting

[mpiexec@bee001] HYDT_bsci_wait_for_completion 
(tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for 
completion

[mpiexec@bee001] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): 
launcher returned error waiting for completion

[mpiexec@bee001] main (ui/mpich/mpiexec.c:344): process manager error waiting 
for completion
Slurm's Email Title upon failure:
SLURM Job_id=243524 Name=cloudy Failed, Run time 1-14:46:11, FAILED, ExitCode 
255
The output of the job itself looks fine, so it means that everything running 
fine and the problem is only in the job ending process.

Information about our system:

Applications that has that problem: FLASH, CLOUDY
MPI versions that we was running those applications: openmpi 1.8.4, openmpi 
1.8.5, mpich 3.1.3
Slurm version: tried to run the applications both on slurm 14.11.8 and 14.11.3
OS: CentOS 6.5

Does anyone is familiar with that kind of problem? How can we solve it?

Any information regarding the issue could help us a lot.
Thanks!

Reply via email to