Hello There, Some of the jobs crashes without any apparent valid reason: Logs are the following: Controller: [2017-04-11T08:22:03+02:00] debug2: Processing RPC: MESSAGE_EPILOG_COMPLETE uid=0 [2017-04-11T08:22:03+02:00] debug2: _slurm_rpc_epilog_complete JobId=830468 Node=cnode001 usec=60 [2017-04-11T08:22:03+02:00] debug2: Processing RPC: MESSAGE_EPILOG_COMPLETE uid=0 [2017-04-11T08:22:03+02:00] debug2: _slurm_rpc_epilog_complete JobId=830468 Node=cnode007 usec=25 [2017-04-11T08:22:03+02:00] debug: sched: Running job scheduler [2017-04-11T08:22:03+02:00] debug2: found 92 usable nodes from config containing cnode[001-100] [2017-04-11T08:22:03+02:00] debug2: select_p_job_test for job 830332 [2017-04-11T08:22:03+02:00] sched: Allocate JobId=830332 NodeList=cnode[001,007,022,030-033,041-044,047-048,052-054,058-061] #CPUs=320 [2017-04-11T08:22:03+02:00] debug2: prolog_slurmctld job 830332 prolog completed [2017-04-11T08:22:03+02:00] error: Error opening file /cm/shared/apps/slurm/var/cm/statesave/job.830332/script, No such file or directory [2017-04-11T08:22:03+02:00] error: Error opening file /cm/shared/apps/slurm/var/cm/statesave/job.830332/environment, No such file or directory [2017-04-11T08:22:03+02:00] debug2: Spawning RPC agent for msg_type 4005 [2017-04-11T08:22:03+02:00] debug2: got 1 threads to send out [2017-04-11T08:22:03+02:00] debug2: Tree head got back 0 looking for 1 [2017-04-11T08:22:03+02:00] debug2: Tree head got back 1 [2017-04-11T08:22:03+02:00] debug2: Tree head got them all [2017-04-11T08:22:03+02:00] debug2: node_did_resp cnode001 [2017-04-11T08:22:03+02:00] debug2: Processing RPC: REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 JobId=830332 [2017-04-11T08:22:03+02:00] error: slurmd error running JobId=830332 on node(s)=cnode001: Slurmd could not create a batch directory or file [2017-04-11T08:22:03+02:00] update_node: node cnode001 reason set to: batch job complete failure [2017-04-11T08:22:03+02:00] update_node: node cnode001 state set to DRAINING [2017-04-11T08:22:03+02:00] completing job 830332 [2017-04-11T08:22:03+02:00] Batch job launch failure, JobId=830332 [2017-04-11T08:22:03+02:00] debug2: Spawning RPC agent for msg_type 6011 [2017-04-11T08:22:03+02:00] sched: job_complete for JobId=830332 successful
Node: [2017-04-11T08:22:03+02:00] debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH [2017-04-11T08:22:03+02:00] debug: task_slurmd_batch_request: 830332 [2017-04-11T08:22:03+02:00] debug: Calling /cm/shared/apps/slurm/2.5.7/sbin/slurmstepd spank prolog [2017-04-11T08:22:03+02:00] Reading slurm.conf file: /etc/slurm/slurm.conf [2017-04-11T08:22:03+02:00] Running spank/prolog for jobid [830332] uid [40281] [2017-04-11T08:22:03+02:00] spank: opening plugin stack /etc/slurm/plugstack.conf [2017-04-11T08:22:03+02:00] debug: [job 830332] attempting to run prolog [/cm/local/apps/cmd/scripts/prolog] [2017-04-11T08:22:03+02:00] Launching batch job 830332 for UID 40281 [2017-04-11T08:22:03+02:00] debug level is 6. [2017-04-11T08:22:03+02:00] Job accounting gather LINUX plugin loaded [2017-04-11T08:22:03+02:00] WARNING: We will use a much slower algorithm with proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other proctrack when using jobacct_gather/linux [2017-04-11T08:22:03+02:00] switch NONE plugin loaded [2017-04-11T08:22:03+02:00] Received cpu frequency information for 16 cpus [2017-04-11T08:22:03+02:00] setup for a batch_job [2017-04-11T08:22:03+02:00] [830332] _make_batch_script: called with NULL script [2017-04-11T08:22:03+02:00] [830332] batch script setup failed for job 830332.4294967294 [2017-04-11T08:22:03+02:00] [830332] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4010 [2017-04-11T08:22:03+02:00] [830332] auth plugin for Munge ( http://code.google.com/p/munge/) loaded [2017-04-11T08:22:03+02:00] [830332] _step_setup: no job returned [2017-04-11T08:22:03+02:00] [830332] done with job [2017-04-11T08:22:03+02:00] debug2: got this type of message 6011 [2017-04-11T08:22:03+02:00] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2017-04-11T08:22:03+02:00] debug: _rpc_terminate_job, uid = 450 [2017-04-11T08:22:03+02:00] debug: task_slurmd_release_resources: 830332 [2017-04-11T08:22:03+02:00] debug: credential for job 830332 revoked [2017-04-11T08:22:03+02:00] debug2: No steps in jobid 830332 to send signal 18 [2017-04-11T08:22:03+02:00] debug2: No steps in jobid 830332 to send signal 15 [2017-04-11T08:22:03+02:00] debug2: set revoke expiration for jobid 830332 to 1491892923 UTS [2017-04-11T08:22:03+02:00] debug: Waiting for job 830332's prolog to complete [2017-04-11T08:22:03+02:00] debug: Finished wait for job 830332's prolog to complete I have already checked if /cm/shared/apps/slurm/var/cm/statesave is accessible and it is, from the node and from the master node. What i wonder is what triggers this behavior? Is that the Master is not able to create the files so the slurm daemon on the compute node fails or is the opposite? The issue happens randomly and it is not possible to reproduce. The same kind of job can fail or can work, there is no pattern. I have increased the verbose to 6 to 9 now but i am not sure it will actually help. I have checked also the logs on the compute node, in order to see if the nfs client had issues reaching the server but logs are clean. Note: For example for this job multiple nodes have been allocated, but only cnode001 has failed. The nodes are all running the same configuration. I have now undrained the cnode001 and it works without problems. It is always like this when this happens. Any idea? Kind regards, -- [image: clustervision_logo.png] Andrea Del Monaco Internal Engineer Skype: delmonaco.andrea andrea.delmon...@clustervision.com ClusterVision BV Gyroscoopweg 56 1042 AC Amsterdam The Netherlands Tel: +31 20 407 7550 Fax: +31 84 759 8389 www.clustervision.com