Hi Steffen, We handle a similar case with GRES. Define your own gres for each node with a count of 8 and require the users to request that resource. The case we have actually implemented has a GRES called ‘one’ with a count of 1, to account for a set of jobs that include a client-server model and clash on port allocation. Jobs that request gres=one:1 will only run one per node concurrently. That meets what you ask but a gres my_sw:8 may be better if you want nodes to be shared with up to 8 instances from a mix of jobs (even if it would usually only be one job).
Gareth From: Steffen Schuldenzucker [mailto:[email protected]] Sent: Tuesday, 2 February 2016 6:15 AM To: slurm-dev <[email protected]> Subject: [slurm-dev] License restricting tasks per node with independent jobs Dear all, I have a set of simulation runs, each consisting of running a certain executable with a certain set of parameters. Each simulation run uses two cores. Different simulation runs are independent of each other. If have about 20 nodes with between 20 and 40 cores each. My problem is that I'm using a proprietary programming language where licensing only allows me to run 8 parallel processes per node. My question is how to handle this additional resource "license" using slurm. Some approaches I tried: 1. Each simulation run is a job. This leads to crashes because more than 8 jobs can be allocated to the same node. 2. The set of all simulation runs forms one job with sbatch --tasks-per-node=8. One simulation run is a parallel srun --exclusive call. This should work, but I see an efficiency problem (please correct me if I'm wrong): I'm creating basically my own private "pool" of a size specified by the value of --ntasks. Now it's not clear what that value should be (the optimal value would depend on the current usage of the cluster), and I also shouldn't have to worry about it: the job scheduler should decide which tasks to allocate where, not the user, and it should be done dynamically rather than statically. 3. 8 simulation runs form a job with sbatch --exclusive -N 1-1. One simulation run is a parallel srun --exclusive call. This should work as well, but has a similar efficiency problem: I'm allocating a full node per job, but each job can only use 8*2=16 cores, out of the 20-40 ones available. What would be ideal is alternative 1 or 3, but with an option like --exclusive-among-jobs-of-the-same-kind (whatever that means). Any ideas? Thanks a lot, Steffen -- Steffen Schuldenzucker Ph.D. Student Department of Informatics University of Zurich Binzmühlestrasse 14 CH-8050 Zürich Room BIN 2.A.25 Tel +41 44 635 45 82 Mob +49 176 5337 8181 Email [email protected]<mailto:[email protected]> Web http://www.ifi.uzh.ch/ce/people/schuldenzucker.html
