After running a bunch of tests it seems the problem is that, as of
the upgrade, the memory requested with sbatch / srun is being treated as a hard
limit. If a job process exceeds this amount, even for an instant, the job is
(essentially) killed and the node is put into either a “drain” or “drng” state.
In order to restore the previous behavior I need a set of srun / sbatch /
slurm.conf options that equate to: use the requested memory for scheduling
purposes but allow the processes to overrun / share memory (but not cores). The
Shared option in defining a partition doesn’t seem to be able to do this based
on the online docs I could find.
Thanks,
~Mike C.
From: Morris Jette [mailto:[email protected]]
Sent: Friday, September 26, 2014 11:44 AM
To: slurm-dev
Subject: [slurm-dev] Re: change in node sharing with new(er) version?
I can't think of any relevant changes. Your config files would help a lot.
On September 26, 2014 11:32:38 AM PDT, Michael Colonno <[email protected]>
wrote:
Hi All ~
I just upgraded a cluster several versions (from 2.5.2 to the 14.03.8); no
changes were made to the config file (slurm.conf). Prior to the upgrade the
cluster was configured to allow more than one job to run on a given node
(specifying cores, memory, etc.). After the upgrade all jobs seem to be
allocated as if they require exclusive nodes (or the as if the --exclusive flag
was used) and don't seem to be sharing nodes. I'm guessing there was a change
in the config file syntax for resource allocation but I can't find anything in
the docs. Any thoughts?
Thanks,
~Mike C.
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.Image
removed by sender.