[slurm-dev] Re: srun interactive failure after upgrade

David Bigagli Tue, 20 May 2014 12:55:33 -0700

Yes, it is a change in behaviour. There was a fix in the I/O module thatunfortunately introduced this side effect.


On 05/20/2014 12:38 PM, Martins Innus wrote:


OK thanks, that works with salloc, but does that mean that what used to
work in 2.6, namely running:

srun -N 2 --pty /bin/bash

on its own and not within an salloc, is no longer supported and expected
to fail?

Thanks

Martins

On 5/20/14 1:23 PM, David Bigagli wrote:


In 14.03 you should use the SallocDefaultCommand as documented in

http://slurm.schedmd.com/slurm.conf.html

to srun with the --pty option.

On 05/19/2014 10:59 AM, Martins Innus wrote:


Has there been any news on this issue? I am seeing the same thing with
14.03.3.

Thanks

Martins

On 5/6/14 2:43 PM, Michael Robbert wrote:

We just upgraded Slurm from 2.6.6 to 14.03.2-1 on a Linux cluster and
now were are having problems with interactive jobs using srun and
--pty. If we get more than one node then the job exits after 2 key
strokes with these errors:

srun: error: _server_read: fd 18 got error or unexpected eof reading
header
srun: debug:  IO error on node 1
srun: error: step_launch_notify_io_failure: aborting, io error with
slurmstepd on node 1

[mrobbert@node001 mpi]$ srun: Job step aborted: Waiting up to 2
seconds for job step to finish.
srun: Complete job step 422.0 received
srun: debug:  task 0 done
srun: Received task exit notification for 1 task (status=0x0009).
srun: error: node001: task 0: Killed
srun: debug:  IO thread exiting
srun: debug:  Leaving _msg_thr_internal

This does not happen if the job is on one node or if we don't use
--pty. I have run with some debugging on and we are receiving task
exit from the tasks on the secondary node right after startup. Let me
know what other debugging output might be useful here.

Thanks,
Mike Robbert


--

Thanks,
      /David/Bigagli

www.schedmd.com

[slurm-dev] Re: srun interactive failure after upgrade

Reply via email to