I'm using SLURM to offload firmware builds from a set of systems that 
developers use for general use (editing, etc.) to another set of systems 
provisioned for that purpose.  This is done with a make wrapper that uses srun 
to transparently dispatch the make to the build partition.

For the most part, this works well.  However, on occasion we find that the 
build has completed but SLURM still has an outstanding job allocation.  
Eventually the partition is full of such allocations, which causes subsequent 
jobs to queue until I manually clear them with scancel.

I believe the job is exiting normally, but something is going awry after it 
exits.  As far as my developers/users are concerned, nothing appears out of 
place.  The one difference I see between a job that goes wrong and one that 
cleans up after itself is the "error: Abandoning IO 180 secs after job shutdown 
initiated" message in the slurmd/slurmstepd logs.  I don't know if that message 
is relevant.  FWIW, I did notice that the shutdown time comparison is using 
wall clock time rather than a monotonically increasing clock source, so that 
code may be susceptible to time sync adjustments, but that shouldn't be in play 
in our environment.

I recently upgraded from 2.6.9 to 14.11.3 hoping that it would help.

Can anyone offer any pointers or guidance that would help me find out what's 
going wrong?  Many thanks in advance.

    --jtc

Logs for a job from the controller:
Feb 17 18:28:30 slurm-maa-01 slurmctld[31130]: sched: 
_slurm_rpc_allocate_resources JobId=126300 NodeList=emake-maa-02 usec=2337

Logs for a job from the node:
Feb 17 18:28:30 emake-maa-02 slurmd[21360]: launch task 126300.0 request from 
[email protected] (port 50820)
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: switch NONE plugin loaded
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: AcctGatherProfile NONE plugin 
loaded
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: AcctGatherEnergy NONE plugin 
loaded
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: AcctGatherInfiniband NONE 
plugin loaded
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: AcctGatherFilesystem NONE 
plugin loaded
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: Job accounting gather 
NOT_INVOKED plugin loaded
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: Message thread started pid = 
15414
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: task NONE plugin loaded
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: Checkpoint plugin loaded: 
checkpoint/none
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: mpi type = none
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: spank: opening plugin stack 
/etc/slurm/plugstack.conf
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: mpi type = (null)
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: mpi/none: slurmstepd prefork
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: /proc/self/oom_score_adj not 
found. Falling back to oom_adj
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: debug level = 2
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: IO handler started pid=15414
Feb 17 18:28:30 emake-maa-02 slurmstepd[15419]: task_p_pre_launch_priv: 126300.0
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: task 0 (15419) started 
2015-02-17T18:28:30
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: job_container none plugin loaded
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: No cgroup.conf file 
(/etc/slurm/cgroup.conf)
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: Sending launch resp rc=0
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: auth plugin for Munge 
(http://code.google.com/p/munge/) loaded
Feb 17 18:28:30 emake-maa-02 slurmstepd[15419]: mpi type = (null)
Feb 17 18:28:30 emake-maa-02 slurmstepd[15419]: Using mpi/none
Feb 17 18:28:30 emake-maa-02 slurmstepd[15419]: task_p_pre_launch: 126300.0, 
task 0
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: Handling REQUEST_STEP_UID
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: Handling 
REQUEST_SIGNAL_CONTAINER
Feb 17 18:28:30 emake-maa-02 slurmstepd[15414]: _handle_signal_container for 
step=126300.0 uid=0 signal=995
Feb 17 18:32:23 emake-maa-02 slurmstepd[15414]: task 0 (15419) exited with exit 
code 0.
Feb 17 18:32:23 emake-maa-02 slurmstepd[15414]: task_p_post_term: 126300.0, 
task 0
Feb 17 18:32:23 emake-maa-02 slurmstepd[15414]: Waiting for IO
Feb 17 18:32:23 emake-maa-02 slurmstepd[15414]: Closing debug channel
Feb 17 18:35:23 emake-maa-02 slurmstepd[15414]: error: Abandoning IO 180 secs 
after job shutdown initiated
Feb 17 18:35:23 emake-maa-02 slurmstepd[15414]: IO handler exited, rc=-1
Feb 17 18:35:23 emake-maa-02 slurmstepd[15414]: Message thread exited
Feb 17 18:35:23 emake-maa-02 slurmstepd[15414]: done with job

Reply via email to