[slurm-dev] Randomly jobs failures

Andrea del Monaco Tue, 11 Apr 2017 00:41:39 -0700

Hello There,

Some of the jobs crashes without any apparent valid reason:
Logs are the following:
Controller:
[2017-04-11T08:22:03+02:00] debug2: Processing RPC: MESSAGE_EPILOG_COMPLETE
uid=0
[2017-04-11T08:22:03+02:00] debug2: _slurm_rpc_epilog_complete JobId=830468
Node=cnode001 usec=60
[2017-04-11T08:22:03+02:00] debug2: Processing RPC: MESSAGE_EPILOG_COMPLETE
uid=0
[2017-04-11T08:22:03+02:00] debug2: _slurm_rpc_epilog_complete JobId=830468
Node=cnode007 usec=25
[2017-04-11T08:22:03+02:00] debug:  sched: Running job scheduler
[2017-04-11T08:22:03+02:00] debug2: found 92 usable nodes from config
containing cnode[001-100]
[2017-04-11T08:22:03+02:00] debug2: select_p_job_test for job 830332
[2017-04-11T08:22:03+02:00] sched: Allocate JobId=830332
NodeList=cnode[001,007,022,030-033,041-044,047-048,052-054,058-061]
#CPUs=320
[2017-04-11T08:22:03+02:00] debug2: prolog_slurmctld job 830332 prolog
completed
[2017-04-11T08:22:03+02:00] error: Error opening file
/cm/shared/apps/slurm/var/cm/statesave/job.830332/script, No such file or
directory
[2017-04-11T08:22:03+02:00] error: Error opening file
/cm/shared/apps/slurm/var/cm/statesave/job.830332/environment, No such file
or directory
[2017-04-11T08:22:03+02:00] debug2: Spawning RPC agent for msg_type 4005
[2017-04-11T08:22:03+02:00] debug2: got 1 threads to send out
[2017-04-11T08:22:03+02:00] debug2: Tree head got back 0 looking for 1
[2017-04-11T08:22:03+02:00] debug2: Tree head got back 1
[2017-04-11T08:22:03+02:00] debug2: Tree head got them all
[2017-04-11T08:22:03+02:00] debug2: node_did_resp cnode001
[2017-04-11T08:22:03+02:00] debug2: Processing RPC:
REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 JobId=830332
[2017-04-11T08:22:03+02:00] error: slurmd error running JobId=830332 on
node(s)=cnode001: Slurmd could not create a batch directory or file
[2017-04-11T08:22:03+02:00] update_node: node cnode001 reason set to: batch
job complete failure
[2017-04-11T08:22:03+02:00] update_node: node cnode001 state set to DRAINING
[2017-04-11T08:22:03+02:00] completing job 830332
[2017-04-11T08:22:03+02:00] Batch job launch failure, JobId=830332
[2017-04-11T08:22:03+02:00] debug2: Spawning RPC agent for msg_type 6011
[2017-04-11T08:22:03+02:00] sched: job_complete for JobId=830332 successful


Node:
[2017-04-11T08:22:03+02:00] debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH
[2017-04-11T08:22:03+02:00] debug:  task_slurmd_batch_request: 830332
[2017-04-11T08:22:03+02:00] debug:  Calling
/cm/shared/apps/slurm/2.5.7/sbin/slurmstepd spank prolog
[2017-04-11T08:22:03+02:00] Reading slurm.conf file: /etc/slurm/slurm.conf
[2017-04-11T08:22:03+02:00] Running spank/prolog for jobid [830332] uid
[40281]
[2017-04-11T08:22:03+02:00] spank: opening plugin stack
/etc/slurm/plugstack.conf
[2017-04-11T08:22:03+02:00] debug:  [job 830332] attempting to run prolog
[/cm/local/apps/cmd/scripts/prolog]
[2017-04-11T08:22:03+02:00] Launching batch job 830332 for UID 40281
[2017-04-11T08:22:03+02:00] debug level is 6.
[2017-04-11T08:22:03+02:00] Job accounting gather LINUX plugin loaded
[2017-04-11T08:22:03+02:00] WARNING: We will use a much slower algorithm
with proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other
proctrack when using jobacct_gather/linux
[2017-04-11T08:22:03+02:00] switch NONE plugin loaded
[2017-04-11T08:22:03+02:00] Received cpu frequency information for 16 cpus
[2017-04-11T08:22:03+02:00] setup for a batch_job
[2017-04-11T08:22:03+02:00] [830332] _make_batch_script: called with NULL
script
[2017-04-11T08:22:03+02:00] [830332] batch script setup failed for job
830332.4294967294
[2017-04-11T08:22:03+02:00] [830332] sending REQUEST_COMPLETE_BATCH_SCRIPT,
error:4010
[2017-04-11T08:22:03+02:00] [830332] auth plugin for Munge (
http://code.google.com/p/munge/) loaded
[2017-04-11T08:22:03+02:00] [830332] _step_setup: no job returned
[2017-04-11T08:22:03+02:00] [830332] done with job
[2017-04-11T08:22:03+02:00] debug2: got this type of message 6011
[2017-04-11T08:22:03+02:00] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2017-04-11T08:22:03+02:00] debug:  _rpc_terminate_job, uid = 450
[2017-04-11T08:22:03+02:00] debug:  task_slurmd_release_resources: 830332
[2017-04-11T08:22:03+02:00] debug:  credential for job 830332 revoked
[2017-04-11T08:22:03+02:00] debug2: No steps in jobid 830332 to send signal
18
[2017-04-11T08:22:03+02:00] debug2: No steps in jobid 830332 to send signal
15
[2017-04-11T08:22:03+02:00] debug2: set revoke expiration for jobid 830332
to 1491892923 UTS
[2017-04-11T08:22:03+02:00] debug:  Waiting for job 830332's prolog to
complete
[2017-04-11T08:22:03+02:00] debug:  Finished wait for job 830332's prolog
to complete


I have already checked if /cm/shared/apps/slurm/var/cm/statesave is
accessible and it is, from the node and from the master node.

What i wonder is what triggers this behavior? Is that the Master is not
able to create the files so the slurm daemon on the compute node fails or
is the opposite?

The issue happens randomly and it is not possible to reproduce. The same
kind of job can fail or can work, there is no pattern.

I have increased the verbose to 6 to 9 now but i am not sure it will
actually help.

I have checked also the logs on the compute node, in order to see if the
nfs client had issues reaching the server but logs are clean.

Note: For example for this job multiple nodes have been allocated, but only
cnode001 has failed. The nodes are all running the same configuration. I
have now undrained the cnode001 and it works without problems. It is always
like this when this happens.

Any idea?

Kind regards,

-- 

[image: clustervision_logo.png]
Andrea Del Monaco
Internal Engineer



Skype: delmonaco.andrea
andrea.delmon...@clustervision.com

ClusterVision BV
Gyroscoopweg 56
1042 AC Amsterdam
The Netherlands
Tel: +31 20 407 7550
Fax: +31 84 759 8389
www.clustervision.com

[slurm-dev] Randomly jobs failures

Reply via email to