[slurm-dev] Re: Slurm version 14.03.0 is now available

2014-03-31 Thread Morris Jette
Also see configuration parameter SlurmdTimeout. On March 31, 2014 7:26:30 PM PDT, Christopher Samuel wrote: > >-BEGIN PGP SIGNED MESSAGE- >Hash: SHA1 > >On 01/04/14 11:44, Morris Jette wrote: > >> See job option --no-kill > >Wonderful! > >We might tweak the logic of that with a local pat

[slurm-dev] Re: Slurm version 14.03.0 is now available

2014-03-31 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 01/04/14 11:44, Morris Jette wrote: > See job option --no-kill Wonderful! We might tweak the logic of that with a local patch so that every job has it enabled by default and using those options disables it in case anyone ever wants it. Thanks M

[slurm-dev] Re: Slurm version 14.03.0 is now available

2014-03-31 Thread Morris Jette
See job option --no-kill On March 31, 2014 5:40:29 PM PDT, Franco Broi wrote: > >On Mon, 2014-03-31 at 15:58 -0700, je...@schedmd.com wrote: >> >> Quoting Christopher Samuel : >> > [killing jobs when nodes go down] >> >> I agree. I can see that a parallel job that loses a node could be >> >> r

[slurm-dev] Re: Slurm version 14.03.0 is now available

2014-03-31 Thread Franco Broi
On Mon, 2014-03-31 at 15:58 -0700, je...@schedmd.com wrote: > > Quoting Christopher Samuel : > > [killing jobs when nodes go down] > >> I agree. I can see that a parallel job that loses a node could be > >> restarted but maybe Slurm should ping the node before deciding that > >> it's definitely

[slurm-dev] Re: sacct access for end users

2014-03-31 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 30/03/14 07:15, Jagga Soorma wrote: > Or is there a different preferred way to achieve this or a > different command that the end users should be running: We use slurmdbd to store all our usage information in a central MySQL database, so users ca

[slurm-dev] Re: Slurm version 14.03.0 is now available

2014-03-31 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 01/04/14 09:58, je...@schedmd.com wrote: > If your app doesn't recognize the down node and exit, slurm merely > draining the node will leave the job's entire allocation in place > until the end of its time limit. That's true, but we can refund t

[slurm-dev] Re: Slurm version 14.03.0 is now available

2014-03-31 Thread jette
Quoting Christopher Samuel : [killing jobs when nodes go down] I agree. I can see that a parallel job that loses a node could be restarted but maybe Slurm should ping the node before deciding that it's definitely down. For us Open-MPI catches those sorts of things for us, so I think we'd rat

[slurm-dev] Re: Slurm version 14.03.0 is now available

2014-03-31 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 31/03/14 19:57, Franco Broi wrote: [nodes going down on migration from 2.6 to 14.03] > The slurmctld log just shows them as unresponsive and I don't have > logging on the cluster nodes - never needed it before. No worries, we're going to do some

[slurm-dev] Re: Gres GPU Problem with new slurm cluster

2014-03-31 Thread David Bigagli
You may want to look at the variable ReturnToService in slurm.conf http://slurm.schedmd.com/slurm.conf.html On 03/29/2014 01:11 PM, Jagga Soorma wrote: Okay, so looks like I just had to clear the node manually using scontrol after updating the gres.conf on each node. Isn't there a way to hav

[slurm-dev] Re: heterogeneous cluster -- memory and accounting

2014-03-31 Thread Ulf Markwardt
You are correct that the fair-share calculation is currently based only on CPU allocations. We plan to put in place a framework that will permit charging for additional resources, probably available late in 2014 with the next release. Charges would be based upon configurable weight factors applied

[slurm-dev] Re: Slurm version 14.03.0 is now available

2014-03-31 Thread Franco Broi
On Sun, 2014-03-30 at 17:33 -0700, Christopher Samuel wrote: > On 28/03/14 11:02, Franco Broi wrote: > > > Just an update on this, after running with the new version of slurm > > on the control node overnight, this morning several nodes running > > 2.6.5 were showing as down. Restarted the local