Yes, it is a change in behaviour. There was a fix in the I/O module that
unfortunately introduced this side effect.
On 05/20/2014 12:38 PM, Martins Innus wrote:
OK thanks, that works with salloc, but does that mean that what used to
work in 2.6, namely running:
srun -N 2 --pty /bin/bash
on its own and not within an salloc, is no longer supported and expected
to fail?
Thanks
Martins
On 5/20/14 1:23 PM, David Bigagli wrote:
In 14.03 you should use the SallocDefaultCommand as documented in
http://slurm.schedmd.com/slurm.conf.html
to srun with the --pty option.
On 05/19/2014 10:59 AM, Martins Innus wrote:
Has there been any news on this issue? I am seeing the same thing with
14.03.3.
Thanks
Martins
On 5/6/14 2:43 PM, Michael Robbert wrote:
We just upgraded Slurm from 2.6.6 to 14.03.2-1 on a Linux cluster and
now were are having problems with interactive jobs using srun and
--pty. If we get more than one node then the job exits after 2 key
strokes with these errors:
srun: error: _server_read: fd 18 got error or unexpected eof reading
header
srun: debug: IO error on node 1
srun: error: step_launch_notify_io_failure: aborting, io error with
slurmstepd on node 1
[mrobbert@node001 mpi]$ srun: Job step aborted: Waiting up to 2
seconds for job step to finish.
srun: Complete job step 422.0 received
srun: debug: task 0 done
srun: Received task exit notification for 1 task (status=0x0009).
srun: error: node001: task 0: Killed
srun: debug: IO thread exiting
srun: debug: Leaving _msg_thr_internal
This does not happen if the job is on one node or if we don't use
--pty. I have run with some debugging on and we are receiving task
exit from the tasks on the secondary node right after startup. Let me
know what other debugging output might be useful here.
Thanks,
Mike Robbert
--
Thanks,
/David/Bigagli
www.schedmd.com