Also see configuration parameter SlurmdTimeout.
On March 31, 2014 7:26:30 PM PDT, Christopher Samuel
wrote:
>
>-BEGIN PGP SIGNED MESSAGE-
>Hash: SHA1
>
>On 01/04/14 11:44, Morris Jette wrote:
>
>> See job option --no-kill
>
>Wonderful!
>
>We might tweak the logic of that with a local pat
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On 01/04/14 11:44, Morris Jette wrote:
> See job option --no-kill
Wonderful!
We might tweak the logic of that with a local patch so that every job
has it enabled by default and using those options disables it in case
anyone ever wants it.
Thanks M
See job option --no-kill
On March 31, 2014 5:40:29 PM PDT, Franco Broi wrote:
>
>On Mon, 2014-03-31 at 15:58 -0700, je...@schedmd.com wrote:
>>
>> Quoting Christopher Samuel :
>> > [killing jobs when nodes go down]
>> >> I agree. I can see that a parallel job that loses a node could be
>> >> r
On Mon, 2014-03-31 at 15:58 -0700, je...@schedmd.com wrote:
>
> Quoting Christopher Samuel :
> > [killing jobs when nodes go down]
> >> I agree. I can see that a parallel job that loses a node could be
> >> restarted but maybe Slurm should ping the node before deciding that
> >> it's definitely
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On 30/03/14 07:15, Jagga Soorma wrote:
> Or is there a different preferred way to achieve this or a
> different command that the end users should be running:
We use slurmdbd to store all our usage information in a central MySQL
database, so users ca
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On 01/04/14 09:58, je...@schedmd.com wrote:
> If your app doesn't recognize the down node and exit, slurm merely
> draining the node will leave the job's entire allocation in place
> until the end of its time limit.
That's true, but we can refund t
Quoting Christopher Samuel :
[killing jobs when nodes go down]
I agree. I can see that a parallel job that loses a node could be
restarted but maybe Slurm should ping the node before deciding that
it's definitely down.
For us Open-MPI catches those sorts of things for us, so I think we'd
rat
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
On 31/03/14 19:57, Franco Broi wrote:
[nodes going down on migration from 2.6 to 14.03]
> The slurmctld log just shows them as unresponsive and I don't have
> logging on the cluster nodes - never needed it before.
No worries, we're going to do some
You may want to look at the variable ReturnToService in slurm.conf
http://slurm.schedmd.com/slurm.conf.html
On 03/29/2014 01:11 PM, Jagga Soorma wrote:
Okay, so looks like I just had to clear the node manually using
scontrol after updating the gres.conf on each node. Isn't there a way
to hav
You are correct that the fair-share calculation is currently based only
on CPU allocations. We plan to put in place a framework that will permit
charging for additional resources, probably available late in 2014 with
the next release. Charges would be based upon configurable weight
factors applied
On Sun, 2014-03-30 at 17:33 -0700, Christopher Samuel wrote:
> On 28/03/14 11:02, Franco Broi wrote:
>
> > Just an update on this, after running with the new version of slurm
> > on the control node overnight, this morning several nodes running
> > 2.6.5 were showing as down. Restarted the local
11 matches
Mail list logo