[slurm-dev] Re: jobs vanishing w/o trace(?)

2016-01-25 Thread Benjamin Redling


Am 16.01.2016 um 21:10 schrieb Benjamin Redling:

[...] how is it at all possible that the jobs get lost? What
happened that the slurm master thinks all went well? (Does it? Am I just
missing something?)
Where can I start to investigate next?


I could fire several hundert jobs with a dummy shell script against that 
node but as soon as one of my users tries a complex pipeline jobs get 
lost with a slurm-*.out

What do I fail to understand?

--- %< ---
3003|runAllPipelinePmc.sh|MC20GBplus||4|00:05:25|FAILED|1:0
3004|runAllPipelinePmc.sh|MC20GBplus||4|00:00:40|CANCELLED|0:0
3005|runAllPipelinePmc.sh|MC20GBplus||8|00:07:25|CANCELLED|0:0
3006|runAllPipelinePmc.sh|MC20GBplus||11|00:00:00|CANCELLED|0:0
3008|runAllPipelinePmc.sh|MC20GBplus||11|00:00:00|CANCELLED|0:0
3007|runAllPipelinePmc.sh|MC20GBplus||11|00:00:00|CANCELLED|0:0
3009|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|PENDING|0:0
3010|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|PENDING|0:0
3011|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|PENDING|0:0
3012|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|PENDING|0:0
3013|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|PENDING|0:0
3014|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|PENDING|0:0
3015|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|PENDING|0:0
3016|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|PENDING|0:0
3017|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|PENDING|0:0
3018|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|FAILED|1:0
3019|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|FAILED|1:0
3020|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|FAILED|1:0
3021|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|FAILED|1:0
3022|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|FAILED|1:0
3023|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|FAILED|1:0
3024|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|FAILED|1:0
3025|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|FAILED|1:0
3026|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|FAILED|1:0
3027|runAllPipelinePmc.sh|MC20GBplus||4|00:00:00|FAILED|1:0
--- %< ---

But 3003, 3024, 3025 and 3027 get a "job_complete ... success".
fgrep job_complete /var/log/slurm-llnl/slurmctld.log:
-
Jan 25 09:31:01 darwin slurmctld[12198]: sched: job_complete for 
JobId=3003 successful
Jan 25 10:25:06 darwin slurmctld[12198]: sched: job_complete for 
JobId=3024 successful
Jan 25 10:25:06 darwin slurmctld[12198]: sched: job_complete for 
JobId=3025 successful
Jan 25 10:25:25 darwin slurmctld[12198]: sched: job_complete for 
JobId=3027 successful



slurm-3*.out for all failed nodes are missing.

slurmctlddebug=9 give me
--- %< ---
Jan 25 10:25:25 darwin slurmctld[12198]: cons_res: cr_job_test: 
evaluating job 3027 on 1 nodes
Jan 25 10:25:25 darwin slurmctld[12198]: cons_res: _can_job_run_on_node: 
16 cpus on s17(1), mem 0/64000
Jan 25 10:25:25 darwin slurmctld[12198]: cons_res: eval_nodes:0 consec 
c=16 n=1 b=14 e=14 r=-1
Jan 25 10:25:25 darwin slurmctld[12198]: cons_res: cr_job_test: test 0 
pass - job fits on given resources
Jan 25 10:25:25 darwin slurmctld[12198]: cons_res: _can_job_run_on_node: 
16 cpus on s17(1), mem 0/64000
Jan 25 10:25:25 darwin slurmctld[12198]: cons_res: eval_nodes:0 consec 
c=16 n=1 b=14 e=14 r=-1
Jan 25 10:25:25 darwin slurmctld[12198]: cons_res: cr_job_test: test 1 
pass - idle resources found
Jan 25 10:25:25 darwin slurmctld[12198]: cons_res: cr_job_test: 
distributing job 3027
Jan 25 10:25:25 darwin slurmctld[12198]: cons_res: cr_job_test: job 3027 
ncpus 4 cbits 16/16 nbits 1
Jan 25 10:25:25 darwin slurmctld[12198]: DEBUG: job 3027 node s17 vpus 1 
cpus 4

Jan 25 10:25:25 darwin slurmctld[12198]: 
Jan 25 10:25:25 darwin slurmctld[12198]: job_id:3027 nhosts:1 ncpus:4 
node_req:1 nodes=s17

Jan 25 10:25:25 darwin slurmctld[12198]: Node[0]:
Jan 25 10:25:25 darwin slurmctld[12198]:   Mem(MB):23000:0  Sockets:2 
Cores:8  CPUs:4:0

Jan 25 10:25:25 darwin slurmctld[12198]:   Socket[0] Core[0] is allocated
Jan 25 10:25:25 darwin slurmctld[12198]:   Socket[0] Core[1] is allocated
Jan 25 10:25:25 darwin slurmctld[12198]:   Socket[1] Core[0] is allocated
Jan 25 10:25:25 darwin slurmctld[12198]:   Socket[1] Core[1] is allocated
Jan 25 10:25:25 darwin slurmctld[12198]: 
Jan 25 10:25:25 darwin slurmctld[12198]: cpu_array_value[0]:4 reps:1
Jan 25 10:25:25 darwin slurmctld[12198]: 
Jan 25 10:25:25 darwin slurmctld[12198]: DEBUG: Dump job_resources: 
nhosts 1 cb 0-1,8-9

Jan 25 10:25:25 darwin slurmctld[12198]: DEBUG: _add_job_to_res (after):
Jan 25 10:25:25 darwin slurmctld[12198]: part:MC20GBplus rows:1 pri:50
Jan 25 10:25:25 darwin slurmctld[12198]:   

[slurm-dev] Re: jobs vanishing w/o trace(?)

2016-01-22 Thread Benjamin Redling


Am 16.01.2016 um 21:10 schrieb Benjamin Redling:

I loose every job that gets allocated on a certain node (KVM instance).

[...]

Now I had to change the default route of the host because of a brittle
non-slurm instances with a web app.


after starting the unchanged instance several days later for another 
investigation the problem is gone.


Benjamin
--
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321