Hi Steffen,

We handle a similar case with GRES. Define your own gres for each node with a 
count of 8 and require the users to request that resource. The case we have 
actually implemented has a GRES called ‘one’ with a count of 1, to account for 
a set of jobs that include a client-server model and clash on port allocation. 
Jobs that request gres=one:1 will only run one per node concurrently.  That 
meets what you ask but a gres my_sw:8 may be better if you want nodes to be 
shared with up to 8 instances from a mix of jobs (even if it would usually only 
be one job).

Gareth

From: Steffen Schuldenzucker [mailto:[email protected]]
Sent: Tuesday, 2 February 2016 6:15 AM
To: slurm-dev <[email protected]>
Subject: [slurm-dev] License restricting tasks per node with independent jobs

Dear all,

I have a set of simulation runs, each consisting of running a certain 
executable with a certain set of parameters.
Each simulation run uses two cores. Different simulation runs are independent 
of each other.

If have about 20 nodes with between 20 and 40 cores each.

My problem is that I'm using a proprietary programming language where licensing 
only allows me to run 8 parallel processes per node.

My question is how to handle this additional resource "license" using slurm.

Some approaches I tried:

1. Each simulation run is a job.

This leads to crashes because more than 8 jobs can be allocated to the same 
node.

2. The set of all simulation runs forms one job with sbatch --tasks-per-node=8. 
One simulation run is a parallel srun --exclusive call.

This should work, but I see an efficiency problem (please correct me if I'm 
wrong):

I'm creating basically my own private "pool" of a size specified by the value 
of --ntasks. Now it's not clear what that value should be (the optimal value 
would depend on the current usage of the cluster), and I also shouldn't have to 
worry about it: the job scheduler should decide which tasks to allocate where, 
not the user, and it should be done dynamically rather than statically.

3. 8 simulation runs form a job with sbatch --exclusive -N 1-1. One simulation 
run is a parallel srun --exclusive call.

This should work as well, but has a similar efficiency problem:
I'm allocating a full node per job, but each job can only use 8*2=16 cores, out 
of the 20-40 ones available.

What would be ideal is alternative 1 or 3, but with an option like 
--exclusive-among-jobs-of-the-same-kind (whatever that means).

Any ideas?

Thanks a lot,
Steffen


--

Steffen Schuldenzucker

Ph.D. Student

Department of Informatics

University of Zurich

Binzmühlestrasse 14

CH-8050 Zürich



Room  BIN 2.A.25

Tel   +41 44 635 45 82

Mob   +49 176 5337 8181

Email [email protected]<mailto:[email protected]>

Web   http://www.ifi.uzh.ch/ce/people/schuldenzucker.html

Reply via email to