Hi all
I'm facing an issue with the Spank X11 plugin, in that it is not working
whereas interactive jobs are working just fine, a la:
OK non-X11 interactive job:
[ade.fewings@cstl001 ~]$ srun -n1 hostname
cst001
Failing X11 interactive job:
[ade.fewings@cstl001 ~]$ srun --x11 xterm
srun: error: x11: job has no allocated nodes defined
srun: error: spank: required plugin x11.so: local_user_init() failed with rc=-5
srun: error: Failure in local plugin stack
Looking at the code for the plugin, this is failing at:
/* get job infos */
status = slurm_load_job(&job_buffer_ptr,jobid,SHOW_ALL);
if ( status != 0 ) {
ERROR("x11: unable to get job infos");
status = -3;
goto exit;
}
/* check infos validity */
if ( job_buffer_ptr->record_count != 1 ) {
ERROR("x11: job infos are invalid");
status = -4;
goto clean_exit;
}
job_ptr = job_buffer_ptr->job_array;
/* check allocated nodes var */
if ( job_ptr->nodes == NULL ) {
ERROR("x11: job has no allocated nodes defined");
status = -5;
goto clean_exit;
}
Which seems a bit strange, does anybody know why the 'nodes' array would be
NULL and cause this error?
I believe it 'used to work', so I guess it might well be configuration related,
but we don't do anything special. I've verified the X forwarding works using
'ssh -X' to the relevant compute node and I've tried various minor versions of
15.08 - which is all we have ever run - to no avail.
Possibly related, there is an error that is logged that does not occur in the
successful non-X11 jobs, but I've drawn several fails in looking at what is
causing this (e.g. a mailing list posting suggested
http://slurm.schedmd.com/big_sys.html).......:
OK non-X11 interactive job:
Jun 3 11:51:00 cstl002 slurmctld[3189]: sched: _slurm_rpc_allocate_resources
JobId=760 NodeList=cst001 usec=1988
Jun 3 11:51:00 cst001 slurmd[3574]: launch task 760.0 request from
[email protected] (port 4777)
Jun 3 11:51:00 cst001 slurmd[3574]: lllp_distribution jobid [760] implicit
auto binding: cores,one_thread, dist 1
Jun 3 11:51:00 cst001 slurmd[3574]: _task_layout_lllp_cyclic
Jun 3 11:51:00 cst001 slurmd[3574]: _lllp_generate_cpu_bind jobid [760]:
mask_cpu,one_thread, 0x0001
Jun 3 11:51:00 cst001 slurmd[3574]: _run_prolog: run job script took usec=30553
Jun 3 11:51:00 cst001 slurmd[3574]: _run_prolog: prolog with lock for job 760
ran for 0 seconds
Jun 3 11:51:00 cstl002 slurmctld[3189]: job_complete: JobID=760 State=0x1
NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
Jun 3 11:51:00 cstl002 slurmctld[3189]: job_complete: JobID=760 State=0x8003
NodeCnt=1 done
Failing X11 interactive job:
Jun 3 11:51:09 cstl002 slurmctld[3189]: sched: _slurm_rpc_allocate_resources
JobId=761 NodeList=cst001 usec=1039
Jun 3 11:51:09 cstl002 slurmctld[3189]: job_complete: JobID=761 State=0x1
NodeCnt=1 WIFEXITED 0 WEXITSTATUS 255
Jun 3 11:51:09 cstl002 slurmctld[3189]: job_complete: JobID=761 State=0x1
NodeCnt=1 cancelled by interactive user
Jun 3 11:51:09 cst001 slurmd[3574]: error: _step_connect: connect() failed dir
/var/spool/slurmd node cst001 job 761 step 0 No such file or directory
Jun 3 11:51:09 cstl002 slurmctld[3189]: job_complete: JobID=761 State=0x8004
NodeCnt=1 done
The directory mentioned exists, and is the default option for SlurmdSpoolDir.
Any advice on a way forward would be appreciated.
Thanks
Ade
________________________________
[HPC Wales - www.hpcwales.co.uk] <http://www.hpcwales.co.uk>
________________________________
The contents of this email and any files transmitted with it are confidential
and intended solely for the named addressee only. Unless you are the named
addressee (or authorised to receive this on their behalf) you may not copy it
or use it, or disclose it to anyone else. If you have received this email in
error, please notify the sender by email or telephone. All emails sent by High
Performance Computing Wales have been checked using an Anti-Virus system. We
would advise you to run your own virus check before opening any attachments
received as we will not in any event accept any liability whatsoever, once an
email and/or attachment is received.
High Performance Computing Wales is a private limited company incorporated in
Wales on 8 March 2010 as company number 07181701.
Our registered office is at Finance Office, Bangor University, Cae Derwen,
College Road, Bangor, Gwynedd. LL57 2DG. UK.
High Performance Computing Wales is part funded by the European Regional
Development Fund through the Welsh Government.