A bit more data: it seems the users are requesting both an
allocation of cores and memory when submitting jobs but there is no guarantee
(that I’m aware of) that the application is actually limited to the memory
requested. Could this be the root cause if this state?
Thanks,
~Mike C.
From: Michael Colonno [mailto:[email protected]]
Sent: Friday, September 26, 2014 11:49 AM
To: slurm-dev
Subject: [slurm-dev] Re: change in node sharing with new(er) version?
Relevant portion of the config file is below – pretty vanilla and I
don’t think that’s the cause after some more time spent debugging. The nodes in
question are in state “drng” which I have not seen before. sinfo reports “Low
RealMemory” for them and this must be the reason additional jobs aren’t being
scheduled on the offending nodes. So it seems there have been some changings in
resource monitoring. Prior to the upgrade more than one job would coexist on
these systems without this warning (and may have been fighting for memory
sometimes – unknown).
Thanks,
~Mike C.
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
FastSchedule=1
From: Morris Jette [mailto:[email protected]]
Sent: Friday, September 26, 2014 11:44 AM
To: slurm-dev
Subject: [slurm-dev] Re: change in node sharing with new(er) version?
I can't think of any relevant changes. Your config files would help a lot.
On September 26, 2014 11:32:38 AM PDT, Michael Colonno <[email protected]>
wrote:
Hi All ~
I just upgraded a cluster several versions (from 2.5.2 to the 14.03.8); no
changes were made to the config file (slurm.conf). Prior to the upgrade the
cluster was configured to allow more than one job to run on a given node
(specifying cores, memory, etc.). After the upgrade all jobs seem to be
allocated as if they require exclusive nodes (or the as if the --exclusive flag
was used) and don't seem to be sharing nodes. I'm guessing there was a change
in the config file syntax for resource allocation but I can't find anything in
the docs. Any thoughts?
Thanks,
~Mike C.
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.Image
removed by sender.
<http://smd-server.schedmd.local/cgi-bin/dada/mail.cgi/spacer_image/slurmdev/0c2801cfd9ba$4d6569b0$e8303d10$@stanford/spacer.png>