[slurm-dev] Re: Slurm license management

2017-04-12 Thread mercan
Hi; I found my mistake: "sacctmgr add resource" command without "cluster=CLUSTERNAME" parameter, does not work. But it not complains, also. With cluster parameter it works as expected. Regards. Ahmet Mercan 11.04.2017 11:26 tarihinde merca...@itu.edu.tr yazdı: Hi; We are using

[slurm-dev] Re: LDAP required?

2017-04-12 Thread Christopher Samuel
On 13/04/17 01:47, Jeff White wrote: > +1 for Active Directory bashing. I wasn't intending to "bash" AD here, just that the AD that we were trying to use (and I suspect that Lachlan might me talking to) has tens of thousands of accounts in it and we just could not get the Slurm->sssd->AD chain

[slurm-dev] How to allow Epilog script to run for job that is cancelled

2017-04-12 Thread Roger Moye
I have an unusual configuration question for the Slurm community. When a user cancels a job he sees this message: srun: Job step aborted: Waiting up to 2 seconds for job step to finish. Is there a way to lengthen this time? One of our users has constructed his jobs such that when a job is

[slurm-dev] Re: LDAP required?

2017-04-12 Thread Jeff White
On 04/11/2017 11:58 PM, Christopher Samuel wrote: On 11/04/17 16:05, Lachlan Musicman wrote: Our auth actually backs onto an Active Directory domain You have my sympathies. That caused us no end of headaches when we tried that on a cluster I help out on and in the end we gave up and fell

[slurm-dev] Re: Re:Best Way to Schedule Jobs based on predetermined Lists

2017-04-12 Thread dani
On 12/04//2017 02:19, maviko.wag...@fau.de wrote: Hello Thomas and others, thanks again for the feedback. I agree, i don't actually need Slurm for my small-scale cluster. However it's part of the

[slurm-dev] Re: Randomly jobs failures

2017-04-12 Thread Andrea del Monaco
Dear all, Today i have been trying to strace the slurmctld without any luck. I am able to see that the number of files in $StateSaveLocation (/cm/shared/apps/slurm/var/cm/statesave) increases (almost) immediately after a job is submitted. The only thing that i am able to see is that the

[slurm-dev] Re: Slurm leaving nodes in COMPLETING state

2017-04-12 Thread Sander Kuusemets
Thank you John. I don't quite understand why restarting the slurmctl daemon helps with cleaning up the cgroups then, though? -- Sander Kuusemets University of Tartu, High Performance Computing, IT Specialist Skype: sander.kuusemets1 +372 737 5694 On 04/12/2017 03:41 PM, Burian, John wrote:

[slurm-dev] Re: Slurm leaving nodes in COMPLETING state

2017-04-12 Thread Burian, John
If you're using proctrack/cgroup or task/cgroup, you may be waiting for cgroups cleanup to finish. On our cluster, a large batch of jobs that die immediately, or canceling a large batch of jobs all at once, leaves those jobs in CG state for some time. If I look at a node, I see that all the

[slurm-dev] Re: Slurm leaving nodes in COMPLETING state

2017-04-12 Thread Sander Kuusemets
Hello, Did you changed /etc/slurm.conf? Have, several times. But I do this with configuration management tools, which does restart the slurmd and slurmctld daemons afterwards. Do you have prolog scripts running on the compute nodes that might be stuck?? No, I do not have any epi/prolog

[slurm-dev] Re: Slurm leaving nodes in COMPLETING state

2017-04-12 Thread Miguel Gila
Hello, Do you have prolog scripts running on the compute nodes that might be stuck (e.g. doing IO)??? Is there any Slurm plugin that is issuing any external command (like the nhc that runs on Cray nodes) at job termination? M. -- Miguel Gila CSCS Swiss National Supercomputing Centre HPC

[slurm-dev] AW: Slurm leaving nodes in COMPLETING state

2017-04-12 Thread Benedikt Schäfer
Did you changed /etc/slurm.conf? You can try: - on clients systemctl restart slurmd (be sure that slurm service is down and only slurmd is running) - do on master: scontrol reconfigure best regards Benedikt ~ Benedikt Schaefer

[slurm-dev] Slurm leaving nodes in COMPLETING state

2017-04-12 Thread Sander Kuusemets
Hello! I have a problem with my slurm cluster (16.05, 135 nodes, Centos 7 across the cluster) where after jobs complete, nodes are left into COMPLETING state for a very long time. Right now, for an example, I cancelled a 34 node MPI job, made sure that I killed every process on the nodes,

[slurm-dev] Re: LDAP required?

2017-04-12 Thread Raymond Wan
Markus, On Tue, Apr 11, 2017 at 4:20 PM, Markus Koeberl wrote: > On Tuesday 11 April 2017 08:17:00 Raymond Wan wrote: >> >> Dear all, >> >> Thank you all of you for the many helpful alternatives! >> >> Unfortunately, system administration isn't my main responsibility

[slurm-dev] Re: LDAP required?

2017-04-12 Thread Raymond Wan
Benjamin and Uwe, Thank you both for your advice about Ansible. I have not used it before but after a look over it, it does seem it'll be useful. I'll look into it further -- thank you! Ray On Tue, Apr 11, 2017 at 3:51 PM, Benjamin Redling wrote: > > Am 11.

[slurm-dev] Re: Randomly jobs failures

2017-04-12 Thread Andrea del Monaco
Dear Chris and Doug, Please find below the submission script: #!/bin/bash ## Job name: #SBATCH --job-name=snSN_ANX_z08 #SBATCH --dependency=singleton ## Project: #SBATCH --partition=cpuall,data,datanew #SBATCH --output=/gpfs/scratch/duverger/correlation/snrmat/ zone08/logs/snSN_ANX_z08.log

[slurm-dev] Re: LDAP required?

2017-04-12 Thread Christopher Samuel
On 11/04/17 16:05, Lachlan Musicman wrote: > Our auth actually backs onto an Active Directory domain You have my sympathies. That caused us no end of headaches when we tried that on a cluster I help out on and in the end we gave up and fell back to running our own LDAP to make things reliable

[slurm-dev] Re: LDAP required?

2017-04-12 Thread Janne Blomqvist
On 2017-04-11 09:04, Lachlan Musicman wrote: > On 11 April 2017 at 02:36, Raymond Wan > wrote: > > > For SLURM to work, I understand from web pages such as > https://slurm.schedmd.com/accounting.html >