Hi Igor,

I stumbled across something that looks reallllllly similar to this. You can
check out what I found here:
https://groups.google.com/d/topic/slurm-devel/4q1-1GYE28U/discussion. I
posted a fix to the mpich folks not too long ago for the hydra side of
things. A resolution for us was to set the following environment variable:

bash:
export HYDRA_LAUNCHER_EXTRA_ARGS="--input none"

tcsh/csh:
setenv HYDRA_LAUNCHER_EXTRA_ARGS "--input none"

The only problem is you have to be careful if you set this globally. It can
throw of interactive runs. There's also a SPANK plugin I wrote that should
help with this. I'll post that to github in a few and post the link.

-Aaron

On Tue, Aug 11, 2015 at 1:57 AM, Igor Chebotar <[email protected]>
wrote:

>
> Hi all,
>
> We are issuing a problem with a few applications that using MPI.
>
> When we executing the job, the processes looks like running well without
> any errors, but when the job is ending in the Slurm output we always get
> errors like:
>
> Slurm output file:
>
> ===================================================================================
>
> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>
> =   PID 25147 RUNNING AT bee002
>
> =   EXIT CODE: 2
>
> =   CLEANING UP REMAINING PROCESSES
>
> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>
>
> ===================================================================================
>
> [proxy:0:0@bee001] HYD_pmcd_pmip_control_cmd_cb
> (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
>
> [proxy:0:0@bee001] HYDT_dmxu_poll_wait_for_event
> (tools/demux/demux_poll.c:76): callback returned error status
>
> [proxy:0:0@bee001] main (pm/pmiserv/pmip.c:206): demux engine error
> waiting for event
>
> srun: error: bee001: task 0: Exited with exit code 7
>
> srun: error: _server_read: fd 19 got error or unexpected eof reading header
>
> srun: error: step_launch_notify_io_failure: aborting, io error with
> slurmstepd on node 2
>
> [mpiexec@bee001] HYDT_bscu_wait_for_completion
> (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
> badly; aborting
>
> [mpiexec@bee001] HYDT_bsci_wait_for_completion
> (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
> completion
>
> [mpiexec@bee001] HYD_pmci_wait_for_completion
> (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for
> completion
>
> [mpiexec@bee001] main (ui/mpich/mpiexec.c:344): process manager error
> waiting for completion
>
>
> Slurm's Email Title upon failure:
> SLURM Job_id=243524 Name=cloudy Failed, Run time 1-14:46:11, FAILED,
> ExitCode 255
>
>
> The output of the job itself looks fine, so it means that everything
> running fine and the problem is only in the job ending process.
>
> Information about our system:
>
> Applications that has that problem: FLASH, CLOUDY
> MPI versions that we was running those applications: openmpi 1.8.4,
> openmpi 1.8.5, mpich 3.1.3
> Slurm version: tried to run the applications both on slurm 14.11.8 and
> 14.11.3
> OS: CentOS 6.5
>
> Does anyone is familiar with that kind of problem? How can we solve it?
>
> Any information regarding the issue could help us a lot.
>
>
> Thanks!

Reply via email to