Here's the SPANK plugin: https://github.com/aaronknister/slurm-spank-stdin_fd
To use it just set the environment variable SLURM_SPANK_ENSURE_STDIN to 1. You shouldn't then need the HYDRA_LAUNCHER_EXTRA_ARGS variable. Disclaimer: I haven't done extensive testing but it worked very well in the testing I did do. -Aaron On Thu, Aug 20, 2015 at 2:36 PM, Aaron Knister <[email protected]> wrote: > Hi Igor, > > I stumbled across something that looks reallllllly similar to this. You > can check out what I found here: > https://groups.google.com/d/topic/slurm-devel/4q1-1GYE28U/discussion. I > posted a fix to the mpich folks not too long ago for the hydra side of > things. A resolution for us was to set the following environment variable: > > bash: > export HYDRA_LAUNCHER_EXTRA_ARGS="--input none" > > tcsh/csh: > setenv HYDRA_LAUNCHER_EXTRA_ARGS "--input none" > > The only problem is you have to be careful if you set this globally. It > can throw of interactive runs. There's also a SPANK plugin I wrote that > should help with this. I'll post that to github in a few and post the link. > > -Aaron > > On Tue, Aug 11, 2015 at 1:57 AM, Igor Chebotar <[email protected] > > wrote: > >> >> Hi all, >> >> We are issuing a problem with a few applications that using MPI. >> >> When we executing the job, the processes looks like running well without >> any errors, but when the job is ending in the Slurm output we always get >> errors like: >> >> Slurm output file: >> >> =================================================================================== >> >> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES >> >> = PID 25147 RUNNING AT bee002 >> >> = EXIT CODE: 2 >> >> = CLEANING UP REMAINING PROCESSES >> >> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES >> >> >> =================================================================================== >> >> [proxy:0:0@bee001] HYD_pmcd_pmip_control_cmd_cb >> (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed >> >> [proxy:0:0@bee001] HYDT_dmxu_poll_wait_for_event >> (tools/demux/demux_poll.c:76): callback returned error status >> >> [proxy:0:0@bee001] main (pm/pmiserv/pmip.c:206): demux engine error >> waiting for event >> >> srun: error: bee001: task 0: Exited with exit code 7 >> >> srun: error: _server_read: fd 19 got error or unexpected eof reading >> header >> >> srun: error: step_launch_notify_io_failure: aborting, io error with >> slurmstepd on node 2 >> >> [mpiexec@bee001] HYDT_bscu_wait_for_completion >> (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated >> badly; aborting >> >> [mpiexec@bee001] HYDT_bsci_wait_for_completion >> (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for >> completion >> >> [mpiexec@bee001] HYD_pmci_wait_for_completion >> (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for >> completion >> >> [mpiexec@bee001] main (ui/mpich/mpiexec.c:344): process manager error >> waiting for completion >> >> >> Slurm's Email Title upon failure: >> SLURM Job_id=243524 Name=cloudy Failed, Run time 1-14:46:11, FAILED, >> ExitCode 255 >> >> >> The output of the job itself looks fine, so it means that everything >> running fine and the problem is only in the job ending process. >> >> Information about our system: >> >> Applications that has that problem: FLASH, CLOUDY >> MPI versions that we was running those applications: openmpi 1.8.4, >> openmpi 1.8.5, mpich 3.1.3 >> Slurm version: tried to run the applications both on slurm 14.11.8 and >> 14.11.3 >> OS: CentOS 6.5 >> >> Does anyone is familiar with that kind of problem? How can we solve it? >> >> Any information regarding the issue could help us a lot. >> >> >> Thanks! > > >
