[slurm-dev] Re: Share free cpus
Hello Jordan, On 2016-01-16 01:21, Jordan Willis wrote: > If my partition is used up according to the node configuration, but still has > available CPUS, is there a way to allow a user to who only has a task that > takes 1 cpu on that node? > > For instance here is my partition: > > NODELISTNODES PARTITION STATE NODES(A/I) CPUS CPUS(A/I/O/T) > MEMORY > loma-node[ 38 all* mix38/0 16+981/171/0/1152 64+ > > > According to the nodes, there is nothing Idling, but there are 171 available > cpus. Does anyone know what’s going on? When a new user asks for 1 task, why > can’t they get on one of those free cpus? What should I change in my > configuration. without seeing your configuration thats just guesswork. Are you using "select/linear" and "Shared=NO"? Apart from that you might want to see the column "Resulting Behavior" to get an idea what you have to check in your config: http://slurm.schedmd.com/cons_res_share.html Regards, Benjamin -- FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html vox: +49 3641 9 44323 | fax: +49 3641 9 44321
[slurm-dev] jobs vanishing w/o trace(?)
Hello everybody, I loose every job that gets allocated on a certain node (KVM instance). Background: to enable and test the resources of a cluster of new machines I run Slurm 2.6 inside a Debian 7 KVM instance. Mainly because the hosts run Debian 8 and the old cluster is Debian 7. I prefer the Debian packages and do not want to build Slurm 2.6 from source and on top of that I need easy resource isolation because I don't have the luxury of using that cluster exclusively for Slurm: Ganeti is running on the hosts. I don't see how to handle Slurm and Ganeti side by side reasonable well with cgroups. So far that setup worked reasonable well. Performance loss is negligible. Now I had to change the default route of the host because of a brittle non-slurm instances with a web app. Since than jobs that are appointed to that KVM instance disappear. And there isn't even an error log. Every other job I scancel on other hosts will give me the log file (in the users home on a NFS4 file server, where the job is submitted). job accounting says [...] 2253 MC20GBplus 1451848294 1451848294 10064 50 - - 0 runAllPipeline.sh 1 2 4 s2 (null) 2254 MC20GBplus 1451848298 1451848298 10064 50 - - 0 runAllPipeline.sh 1 2 4 s3 (null) 2255 MC20GBplus 1451848302 1451848302 10064 50 - - 0 runAllPipeline.sh 1 2 4 s4 (null) 2256 MC20GBplus 1451848306 1451848306 10064 50 - - 0 runAllPipeline.sh 1 2 4 s5 (null) 2257 MC20GBplus 1451848310 1451848310 10064 50 - - 0 runAllPipeline.sh 1 2 4 s7 (null) 2258 MC20GBplus 1451848313 1451848313 10064 50 - - 0 runAllPipeline.sh 1 2 4 s9 (null) 2259 MC20GBplus 1451848317 1451848317 10064 50 - - 0 runAllPipeline.sh 1 2 4 s10 (null) 2260 MC20GBplus 1451848320 1451848320 10064 50 - - 0 runAllPipeline.sh 1 2 4 s11 (null) 2261 MC20GBplus 1451848323 1451848323 10064 50 - - 0 runAllPipeline.sh 1 2 4 s12 (null) 2262 MC20GBplus 1451848326 1451848326 10064 50 - - 0 runAllPipeline.sh 1 2 4 s13 (null) 2263 MC20GBplus 1451848329 1451848329 10064 50 - - 0 runAllPipeline.sh 1 2 4 s15 (null) 2265 express 1451848341 1451848341 10064 50 - - 0 runAllPipeline.sh 1 2 4 darwin (null) 2267 MC20GBplus 1451848349 1451848349 10064 50 - - 0 runAllPipeline.sh 1 2 4 stemnet1 (null) 2268 express 1451900653 1451900653 10064 50 - - 0 runAllPipeline.sh 1 2 4 darwin (null) 2270 MC20GBplus 1451983401 1451983401 10064 50 - - 0 runAllPipeline.sh 1 2 4 s17 (null) 2270 MC20GBplus 1451983401 1451983401 10064 50 - - 0 runAllPipeline.sh 1 2 4 s17 (null) 2270 MC20GBplus 1451983401 1451983401 10064 50 - - 0 runAllPipeline.sh 1 2 4 s17 (null) 2270 MC20GBplus 1451983401 1451983401 10064 50 - - 3 0 5 4294967295 256 2271 MC20GBplus 1451983452 1451983452 10064 50 - - 0 runAllPipeline.sh 1 2 4 s17 (null) 2271 MC20GBplus 1451983452 1451983452 10064 50 - - 0 runAllPipeline.sh 1 2 4 s17 (null) 2271 MC20GBplus 1451983452 1451983452 10064 50 - - 0 runAllPipeline.sh 1 2 4 s17 (null) [...] But s17 (the KVM instance) _never_ gives results. The jobs more or less immediately disappear. Now I wonder: how is it at all possible that the jobs get lost? What happened that the slurm master thinks all went well? (Does it? Am I just missing something?) Where can I start to investigate next? Regards, Benjamin -- FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html vox: +49 3641 9 44323 | fax: +49 3641 9 44321