Dear SLURM,
Running into this error(srun: debug3: got eof-stdout msg on _server_read
header), which I believe is the root cause, that is causing an "interactive"
job that is allocated resources to be immediately terminated. As far as SLURM
and the job is concerned, the status of the job shows completed. I have ran out
of all options, debugged pam stack and everything but couldn't get the remote
bash shell launched. Next obvious step was to add debug in slurmstepd sources
files and see if I can interpret what is going on but wanted to see if I can
get helped here. Also wanted to mention I have another set of nodes that I have
exactly the same setup and works perfectly fine for the interactive jobs. Only
difference between the two node types is one is RHEL6.5 and one is RHEL6.7.
Any help is greatly appreciated.
Thank you,
Amit
Here is what I tried and see ..
#srun -d9 -p gpu --exclusive --pty $SHELL
Or
#srun -d9 -p gpu -n1 --x11=first --pty $SHELL
srun: job 12908567 queued and waiting for resources
srun: job 12908567 has been allocated resources
srun: error: x11: job has no allocated nodes defined
login01:/users/ahkumar
$
With additional debug flags I see something is causing the stdout output return
eof msg. Below is a snippet from SlurmdLogFile. I do have correctly installed
the slurm-spank-x11 plugin
[2016-10-02T21:43:39.825] [12894576.0] Handling REQUEST_SIGNAL_CONTAINER
[2016-10-02T21:43:39.825] [12894576.0] _handle_signal_container for
step=12894576.0 uid=0 signal=995
[2016-10-02T21:43:39.825] [12894576.0] Leaving _handle_request: SLURM_SUCCESS
[2016-10-02T21:43:39.825] [12894576.0] Entering _handle_request
[2016-10-02T21:43:39.825] [12894576.0] Leaving _handle_accept
[2016-10-02T21:43:39.841] [12894576.0] Entering _task_read for obj 23a7240
[2016-10-02T21:43:39.841] [12894576.0] error in _task_read: Input/output error
[2016-10-02T21:43:39.841] [12894576.0] got eof on task
[2016-10-02T21:43:39.841] [12894576.0] ************************ -1 bytes read
from task STDOUT
[2016-10-02T21:43:39.841] [12894576.0] Entering _send_eof_msg
[2016-10-02T21:43:39.841] [12894576.0] Myname in build_hashtbl: (slurmstepd)
[2016-10-02T21:43:39.841] [12894576.0] ======================== Enqueued eof
message
[2016-10-02T21:43:39.841] [12894576.0] Leaving _send_eof_msg