Hi Igor, I stumbled across something that looks reallllllly similar to this. You can check out what I found here: https://groups.google.com/d/topic/slurm-devel/4q1-1GYE28U/discussion. I posted a fix to the mpich folks not too long ago for the hydra side of things. A resolution for us was to set the following environment variable:
bash: export HYDRA_LAUNCHER_EXTRA_ARGS="--input none" tcsh/csh: setenv HYDRA_LAUNCHER_EXTRA_ARGS "--input none" The only problem is you have to be careful if you set this globally. It can throw of interactive runs. There's also a SPANK plugin I wrote that should help with this. I'll post that to github in a few and post the link. -Aaron On Tue, Aug 11, 2015 at 1:57 AM, Igor Chebotar <[email protected]> wrote: > > Hi all, > > We are issuing a problem with a few applications that using MPI. > > When we executing the job, the processes looks like running well without > any errors, but when the job is ending in the Slurm output we always get > errors like: > > Slurm output file: > > =================================================================================== > > = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES > > = PID 25147 RUNNING AT bee002 > > = EXIT CODE: 2 > > = CLEANING UP REMAINING PROCESSES > > = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES > > > =================================================================================== > > [proxy:0:0@bee001] HYD_pmcd_pmip_control_cmd_cb > (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed > > [proxy:0:0@bee001] HYDT_dmxu_poll_wait_for_event > (tools/demux/demux_poll.c:76): callback returned error status > > [proxy:0:0@bee001] main (pm/pmiserv/pmip.c:206): demux engine error > waiting for event > > srun: error: bee001: task 0: Exited with exit code 7 > > srun: error: _server_read: fd 19 got error or unexpected eof reading header > > srun: error: step_launch_notify_io_failure: aborting, io error with > slurmstepd on node 2 > > [mpiexec@bee001] HYDT_bscu_wait_for_completion > (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated > badly; aborting > > [mpiexec@bee001] HYDT_bsci_wait_for_completion > (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for > completion > > [mpiexec@bee001] HYD_pmci_wait_for_completion > (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for > completion > > [mpiexec@bee001] main (ui/mpich/mpiexec.c:344): process manager error > waiting for completion > > > Slurm's Email Title upon failure: > SLURM Job_id=243524 Name=cloudy Failed, Run time 1-14:46:11, FAILED, > ExitCode 255 > > > The output of the job itself looks fine, so it means that everything > running fine and the problem is only in the job ending process. > > Information about our system: > > Applications that has that problem: FLASH, CLOUDY > MPI versions that we was running those applications: openmpi 1.8.4, > openmpi 1.8.5, mpich 3.1.3 > Slurm version: tried to run the applications both on slurm 14.11.8 and > 14.11.3 > OS: CentOS 6.5 > > Does anyone is familiar with that kind of problem? How can we solve it? > > Any information regarding the issue could help us a lot. > > > Thanks!
