[slurm-dev] Re: Share free cpus

2016-01-16 Thread Benjamin Redling

Hello Jordan,

On 2016-01-16 01:21, Jordan Willis wrote:
> If my partition is used up according to the node configuration, but still has 
> available CPUS, is there a way to allow a user to who only has a task that 
> takes 1 cpu on that node?
> 
> For instance here is my partition:
> 
> NODELISTNODES PARTITION  STATE  NODES(A/I) CPUS   CPUS(A/I/O/T)   
> MEMORY
> loma-node[ 38 all*   mix38/0   16+981/171/0/1152  64+
> 
> 
> According to the nodes, there is nothing Idling, but there are 171 available 
> cpus. Does anyone know what’s going on? When a new user asks for 1 task, why 
> can’t they get on one of those free cpus? What should I change in my 
> configuration.

without seeing your configuration thats just guesswork.
Are you using "select/linear" and "Shared=NO"?

Apart from that you might want to see the column "Resulting Behavior" to
get an idea what you have to check in your config:
http://slurm.schedmd.com/cons_res_share.html

Regards,
Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321


[slurm-dev] jobs vanishing w/o trace(?)

2016-01-16 Thread Benjamin Redling

Hello everybody,

I loose every job that gets allocated on a certain node (KVM instance).

Background:
to enable and test the resources of a cluster of new machines I run
Slurm 2.6 inside a Debian 7 KVM instance. Mainly because the hosts run
Debian 8 and the old cluster is Debian 7. I prefer the Debian packages
and do not want to build Slurm 2.6 from source and on top of that I need
easy resource isolation because I don't have the luxury of using that
cluster exclusively for Slurm: Ganeti is running on the hosts.
I don't see how to handle Slurm and Ganeti side by side reasonable well
with cgroups.
So far that setup worked reasonable well. Performance loss is negligible.

Now I had to change the default route of the host because of a brittle
non-slurm instances with a web app.
Since than jobs that are appointed to that KVM instance disappear.
And there isn't even an error log.
Every other job I scancel on other hosts will give me the log file (in
the users home on a NFS4 file server, where the job is submitted).

job accounting says
[...]
2253 MC20GBplus 1451848294 1451848294 10064 50 - - 0 runAllPipeline.sh 1
2 4 s2 (null)
2254 MC20GBplus 1451848298 1451848298 10064 50 - - 0 runAllPipeline.sh 1
2 4 s3 (null)
2255 MC20GBplus 1451848302 1451848302 10064 50 - - 0 runAllPipeline.sh 1
2 4 s4 (null)
2256 MC20GBplus 1451848306 1451848306 10064 50 - - 0 runAllPipeline.sh 1
2 4 s5 (null)
2257 MC20GBplus 1451848310 1451848310 10064 50 - - 0 runAllPipeline.sh 1
2 4 s7 (null)
2258 MC20GBplus 1451848313 1451848313 10064 50 - - 0 runAllPipeline.sh 1
2 4 s9 (null)
2259 MC20GBplus 1451848317 1451848317 10064 50 - - 0 runAllPipeline.sh 1
2 4 s10 (null)
2260 MC20GBplus 1451848320 1451848320 10064 50 - - 0 runAllPipeline.sh 1
2 4 s11 (null)
2261 MC20GBplus 1451848323 1451848323 10064 50 - - 0 runAllPipeline.sh 1
2 4 s12 (null)
2262 MC20GBplus 1451848326 1451848326 10064 50 - - 0 runAllPipeline.sh 1
2 4 s13 (null)
2263 MC20GBplus 1451848329 1451848329 10064 50 - - 0 runAllPipeline.sh 1
2 4 s15 (null)
2265 express 1451848341 1451848341 10064 50 - - 0 runAllPipeline.sh 1 2
4 darwin (null)
2267 MC20GBplus 1451848349 1451848349 10064 50 - - 0 runAllPipeline.sh 1
2 4 stemnet1 (null)
2268 express 1451900653 1451900653 10064 50 - - 0 runAllPipeline.sh 1 2
4 darwin (null)
2270 MC20GBplus 1451983401 1451983401 10064 50 - - 0 runAllPipeline.sh 1
2 4 s17 (null)
2270 MC20GBplus 1451983401 1451983401 10064 50 - - 0 runAllPipeline.sh 1
2 4 s17 (null)
2270 MC20GBplus 1451983401 1451983401 10064 50 - - 0 runAllPipeline.sh 1
2 4 s17 (null)
2270 MC20GBplus 1451983401 1451983401 10064 50 - - 3 0 5 4294967295 256
2271 MC20GBplus 1451983452 1451983452 10064 50 - - 0 runAllPipeline.sh 1
2 4 s17 (null)
2271 MC20GBplus 1451983452 1451983452 10064 50 - - 0 runAllPipeline.sh 1
2 4 s17 (null)
2271 MC20GBplus 1451983452 1451983452 10064 50 - - 0 runAllPipeline.sh 1
2 4 s17 (null)
[...]

But s17 (the KVM instance) _never_ gives results. The jobs more or less
immediately disappear.
Now I wonder: how is it at all possible that the jobs get lost? What
happened that the slurm master thinks all went well? (Does it? Am I just
missing something?)
Where can I start to investigate next?

Regards,
Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321