Here's the SPANK plugin:
https://github.com/aaronknister/slurm-spank-stdin_fd

To use it just set the environment variable SLURM_SPANK_ENSURE_STDIN to 1.
You shouldn't then need the HYDRA_LAUNCHER_EXTRA_ARGS variable. Disclaimer:
I haven't done extensive testing but it worked very well in the testing I
did do.

-Aaron

On Thu, Aug 20, 2015 at 2:36 PM, Aaron Knister <[email protected]>
wrote:

> Hi Igor,
>
> I stumbled across something that looks reallllllly similar to this. You
> can check out what I found here:
> https://groups.google.com/d/topic/slurm-devel/4q1-1GYE28U/discussion. I
> posted a fix to the mpich folks not too long ago for the hydra side of
> things. A resolution for us was to set the following environment variable:
>
> bash:
> export HYDRA_LAUNCHER_EXTRA_ARGS="--input none"
>
> tcsh/csh:
> setenv HYDRA_LAUNCHER_EXTRA_ARGS "--input none"
>
> The only problem is you have to be careful if you set this globally. It
> can throw of interactive runs. There's also a SPANK plugin I wrote that
> should help with this. I'll post that to github in a few and post the link.
>
> -Aaron
>
> On Tue, Aug 11, 2015 at 1:57 AM, Igor Chebotar <[email protected]
> > wrote:
>
>>
>> Hi all,
>>
>> We are issuing a problem with a few applications that using MPI.
>>
>> When we executing the job, the processes looks like running well without
>> any errors, but when the job is ending in the Slurm output we always get
>> errors like:
>>
>> Slurm output file:
>>
>> ===================================================================================
>>
>> =   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>
>> =   PID 25147 RUNNING AT bee002
>>
>> =   EXIT CODE: 2
>>
>> =   CLEANING UP REMAINING PROCESSES
>>
>> =   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>
>>
>> ===================================================================================
>>
>> [proxy:0:0@bee001] HYD_pmcd_pmip_control_cmd_cb
>> (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
>>
>> [proxy:0:0@bee001] HYDT_dmxu_poll_wait_for_event
>> (tools/demux/demux_poll.c:76): callback returned error status
>>
>> [proxy:0:0@bee001] main (pm/pmiserv/pmip.c:206): demux engine error
>> waiting for event
>>
>> srun: error: bee001: task 0: Exited with exit code 7
>>
>> srun: error: _server_read: fd 19 got error or unexpected eof reading
>> header
>>
>> srun: error: step_launch_notify_io_failure: aborting, io error with
>> slurmstepd on node 2
>>
>> [mpiexec@bee001] HYDT_bscu_wait_for_completion
>> (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated
>> badly; aborting
>>
>> [mpiexec@bee001] HYDT_bsci_wait_for_completion
>> (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for
>> completion
>>
>> [mpiexec@bee001] HYD_pmci_wait_for_completion
>> (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for
>> completion
>>
>> [mpiexec@bee001] main (ui/mpich/mpiexec.c:344): process manager error
>> waiting for completion
>>
>>
>> Slurm's Email Title upon failure:
>> SLURM Job_id=243524 Name=cloudy Failed, Run time 1-14:46:11, FAILED,
>> ExitCode 255
>>
>>
>> The output of the job itself looks fine, so it means that everything
>> running fine and the problem is only in the job ending process.
>>
>> Information about our system:
>>
>> Applications that has that problem: FLASH, CLOUDY
>> MPI versions that we was running those applications: openmpi 1.8.4,
>> openmpi 1.8.5, mpich 3.1.3
>> Slurm version: tried to run the applications both on slurm 14.11.8 and
>> 14.11.3
>> OS: CentOS 6.5
>>
>> Does anyone is familiar with that kind of problem? How can we solve it?
>>
>> Any information regarding the issue could help us a lot.
>>
>>
>> Thanks!
>
>
>

Reply via email to