[slurm-dev] jobs vanishing w/o trace(?)

Benjamin Redling Sat, 16 Jan 2016 12:11:03 -0800

Hello everybody,

I loose every job that gets allocated on a certain node (KVM instance).


Background:
to enable and test the resources of a cluster of new machines I run
Slurm 2.6 inside a Debian 7 KVM instance. Mainly because the hosts run
Debian 8 and the old cluster is Debian 7. I prefer the Debian packages
and do not want to build Slurm 2.6 from source and on top of that I need
easy resource isolation because I don't have the luxury of using that
cluster exclusively for Slurm: Ganeti is running on the hosts.
I don't see how to handle Slurm and Ganeti side by side reasonable well
with cgroups.
So far that setup worked reasonable well. Performance loss is negligible.

Now I had to change the default route of the host because of a brittle
non-slurm instances with a web app.
Since than jobs that are appointed to that KVM instance disappear.
And there isn't even an error log.
Every other job I scancel on other hosts will give me the log file (in
the users home on a NFS4 file server, where the job is submitted).

job accounting says
[...]
2253 MC20GBplus 1451848294 1451848294 10064 50 - - 0 runAllPipeline.sh 1
2 4 s2 (null)
2254 MC20GBplus 1451848298 1451848298 10064 50 - - 0 runAllPipeline.sh 1
2 4 s3 (null)
2255 MC20GBplus 1451848302 1451848302 10064 50 - - 0 runAllPipeline.sh 1
2 4 s4 (null)
2256 MC20GBplus 1451848306 1451848306 10064 50 - - 0 runAllPipeline.sh 1
2 4 s5 (null)
2257 MC20GBplus 1451848310 1451848310 10064 50 - - 0 runAllPipeline.sh 1
2 4 s7 (null)
2258 MC20GBplus 1451848313 1451848313 10064 50 - - 0 runAllPipeline.sh 1
2 4 s9 (null)
2259 MC20GBplus 1451848317 1451848317 10064 50 - - 0 runAllPipeline.sh 1
2 4 s10 (null)
2260 MC20GBplus 1451848320 1451848320 10064 50 - - 0 runAllPipeline.sh 1
2 4 s11 (null)
2261 MC20GBplus 1451848323 1451848323 10064 50 - - 0 runAllPipeline.sh 1
2 4 s12 (null)
2262 MC20GBplus 1451848326 1451848326 10064 50 - - 0 runAllPipeline.sh 1
2 4 s13 (null)
2263 MC20GBplus 1451848329 1451848329 10064 50 - - 0 runAllPipeline.sh 1
2 4 s15 (null)
2265 express 1451848341 1451848341 10064 50 - - 0 runAllPipeline.sh 1 2
4 darwin (null)
2267 MC20GBplus 1451848349 1451848349 10064 50 - - 0 runAllPipeline.sh 1
2 4 stemnet1 (null)
2268 express 1451900653 1451900653 10064 50 - - 0 runAllPipeline.sh 1 2
4 darwin (null)
2270 MC20GBplus 1451983401 1451983401 10064 50 - - 0 runAllPipeline.sh 1
2 4 s17 (null)
2270 MC20GBplus 1451983401 1451983401 10064 50 - - 0 runAllPipeline.sh 1
2 4 s17 (null)
2270 MC20GBplus 1451983401 1451983401 10064 50 - - 0 runAllPipeline.sh 1
2 4 s17 (null)
2270 MC20GBplus 1451983401 1451983401 10064 50 - - 3 0 5 4294967295 256
2271 MC20GBplus 1451983452 1451983452 10064 50 - - 0 runAllPipeline.sh 1
2 4 s17 (null)
2271 MC20GBplus 1451983452 1451983452 10064 50 - - 0 runAllPipeline.sh 1
2 4 s17 (null)
2271 MC20GBplus 1451983452 1451983452 10064 50 - - 0 runAllPipeline.sh 1
2 4 s17 (null)
[...]

But s17 (the KVM instance) _never_ gives results. The jobs more or less
immediately disappear.
Now I wonder: how is it at all possible that the jobs get lost? What
happened that the slurm master thinks all went well? (Does it? Am I just
missing something?)
Where can I start to investigate next?

Regards,
Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321

[slurm-dev] jobs vanishing w/o trace(?)

Reply via email to