After running a bunch of tests it seems the problem is that, as of 
the upgrade, the memory requested with sbatch / srun is being treated as a hard 
limit. If a job process exceeds this amount, even for an instant, the job is 
(essentially) killed and the node is put into either a “drain” or “drng” state. 
In order to restore the previous behavior I need a set of srun / sbatch / 
slurm.conf options that equate to: use the requested memory for scheduling 
purposes but allow the processes to overrun / share memory (but not cores). The 
Shared option in defining a partition doesn’t seem to be able to do this based 
on the online docs I could find. 

 

            Thanks,

            ~Mike C. 

 

From: Morris Jette [mailto:[email protected]] 
Sent: Friday, September 26, 2014 11:44 AM
To: slurm-dev
Subject: [slurm-dev] Re: change in node sharing with new(er) version?

 

I can't think of any relevant changes. Your config files would help a lot.

On September 26, 2014 11:32:38 AM PDT, Michael Colonno <[email protected]> 
wrote:


 Hi All ~

 I just upgraded a cluster several versions (from 2.5.2 to the 14.03.8); no 
changes were made to the config file (slurm.conf). Prior to the upgrade the 
cluster was configured to allow more than one job to run on a given node 
(specifying cores, memory, etc.). After the upgrade all jobs seem to be 
allocated as if they require exclusive nodes (or the as if the --exclusive flag 
was used) and don't seem to be sharing nodes. I'm guessing there was a change 
in the config file syntax for resource allocation but I can't find anything in 
the docs. Any thoughts? 

 Thanks,
 ~Mike C. 


-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.Image 
removed by sender. 

Reply via email to