Am 16.01.2016 um 21:10 schrieb Benjamin Redling:
[...] how is it at all possible that the jobs get lost? What
happened that the slurm master thinks all went well? (Does it? Am I just
missing something?)
Where can I start to investigate next?
I could fire several hundert jobs with a dummy shell script against that
node but as soon as one of my users tries a complex pipeline jobs get
lost with a slurm-*.out
What do I fail to understand?
--- %< ---
3003|runAllPipelinePmc.sh|MC20GBplus||4|00:05:25|FAILED|1:0
3004|runAllPipelinePmc.sh|MC20GBplus||4|00:00:40|CANCELLED|0:0
3005|runAllPipelinePmc.sh|MC20GBplus||8|00:07:25|CANCELLED|0:0
3006|runAllPipelinePmc.sh|MC20GBplus||11|00:00:00|CANCELLED|0:0
3008|runAllPipelinePmc.sh|MC20GBplus||11|00:00:00|CANCELLED|0:0
3007|runAllPipelinePmc.sh|MC20GBplus||11|00:00:00|CANCELLED|0:0
3009|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|PENDING|0:0
3010|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|PENDING|0:0
3011|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|PENDING|0:0
3012|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|PENDING|0:0
3013|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|PENDING|0:0
3014|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|PENDING|0:0
3015|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|PENDING|0:0
3016|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|PENDING|0:0
3017|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|PENDING|0:0
3018|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|FAILED|1:0
3019|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|FAILED|1:0
3020|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|FAILED|1:0
3021|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|FAILED|1:0
3022|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|FAILED|1:0
3023|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|FAILED|1:0
3024|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|FAILED|1:0
3025|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|FAILED|1:0
3026|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|FAILED|1:0
3027|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|FAILED|1:0
--- %< ---
But 3003, 3024, 3025 and 3027 get a "job_complete ... success".
fgrep job_complete /var/log/slurm-llnl/slurmctld.log:
-
Jan 25 09:31:01 darwin slurmctld[12198]: sched: job_complete for
JobId=3003 successful
Jan 25 10:25:06 darwin slurmctld[12198]: sched: job_complete for
JobId=3024 successful
Jan 25 10:25:06 darwin slurmctld[12198]: sched: job_complete for
JobId=3025 successful
Jan 25 10:25:25 darwin slurmctld[12198]: sched: job_complete for
JobId=3027 successful
slurm-3*.out for all failed nodes are missing.
slurmctlddebug=9 give me
--- %< ---
Jan 25 10:25:25 darwin slurmctld[12198]: cons_res: cr_job_test:
evaluating job 3027 on 1 nodes
Jan 25 10:25:25 darwin slurmctld[12198]: cons_res: _can_job_run_on_node:
16 cpus on s17(1), mem 0/64000
Jan 25 10:25:25 darwin slurmctld[12198]: cons_res: eval_nodes:0 consec
c=16 n=1 b=14 e=14 r=-1
Jan 25 10:25:25 darwin slurmctld[12198]: cons_res: cr_job_test: test 0
pass - job fits on given resources
Jan 25 10:25:25 darwin slurmctld[12198]: cons_res: _can_job_run_on_node:
16 cpus on s17(1), mem 0/64000
Jan 25 10:25:25 darwin slurmctld[12198]: cons_res: eval_nodes:0 consec
c=16 n=1 b=14 e=14 r=-1
Jan 25 10:25:25 darwin slurmctld[12198]: cons_res: cr_job_test: test 1
pass - idle resources found
Jan 25 10:25:25 darwin slurmctld[12198]: cons_res: cr_job_test:
distributing job 3027
Jan 25 10:25:25 darwin slurmctld[12198]: cons_res: cr_job_test: job 3027
ncpus 4 cbits 16/16 nbits 1
Jan 25 10:25:25 darwin slurmctld[12198]: DEBUG: job 3027 node s17 vpus 1
cpus 4
Jan 25 10:25:25 darwin slurmctld[12198]:
Jan 25 10:25:25 darwin slurmctld[12198]: job_id:3027 nhosts:1 ncpus:4
node_req:1 nodes=s17
Jan 25 10:25:25 darwin slurmctld[12198]: Node[0]:
Jan 25 10:25:25 darwin slurmctld[12198]: Mem(MB):23000:0 Sockets:2
Cores:8 CPUs:4:0
Jan 25 10:25:25 darwin slurmctld[12198]: Socket[0] Core[0] is allocated
Jan 25 10:25:25 darwin slurmctld[12198]: Socket[0] Core[1] is allocated
Jan 25 10:25:25 darwin slurmctld[12198]: Socket[1] Core[0] is allocated
Jan 25 10:25:25 darwin slurmctld[12198]: Socket[1] Core[1] is allocated
Jan 25 10:25:25 darwin slurmctld[12198]:
Jan 25 10:25:25 darwin slurmctld[12198]: cpu_array_value[0]:4 reps:1
Jan 25 10:25:25 darwin slurmctld[12198]:
Jan 25 10:25:25 darwin slurmctld[12198]: DEBUG: Dump job_resources:
nhosts 1 cb 0-1,8-9
Jan 25 10:25:25 darwin slurmctld[12198]: DEBUG: _add_job_to_res (after):
Jan 25 10:25:25 darwin slurmctld[12198]: part:MC20GBplus rows:1 pri:50
Jan 25 10:25:25 darwin slurmctld[12198]: