Am 25.01.2016 um 16:41 schrieb Benjamin Redling:
> I could fire several hundert jobs with a dummy shell script against that
> node but as soon as one of my users tries a complex pipeline jobs get
> lost with a slurm-*.out
       ^^^^ typo: lost _without_ a .out-file

Question:

> What do I fail to understand?

Answer: attentively reading slurmd log output from the affected node.

<--- %< --->
[2016-01-26T09:43:55] [3154] Could not open stdout file /var/[...].out:
No such file or directory
<--- %< --->

After reading that, it was rather obvious that all the succeeding test
cases by the user and me were launched from different, accessible
directories.
So until now we didn't even recognize we changed test cases from time to
time.

Eventually the KVM instance running Slurm was not properly setup via FAI
(provisioning) or ansible (post-inst).

Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321

Reply via email to