Lets say you got a full dollar! Yes, I'm using task/affinity and not task/cgroup....
Should I use task/cgroup then? On 10/14/11 13:55, HAUTREUX Matthieu wrote:
Dear Matteo, are you using the task/affinity (or task/cgroup) plugin on your system ? The only way to ensure that your jobs have exclusive access to their allocated resources is to do that. Indeed, select/cons_res reserve a part of the cores to each of your job but do not guarantee that each job will only be able to use the associated set of cores. This is the role of the task/affinity or task/cgroup (option ConstrainCores=yes in cgroup.conf). In your current scenario, if you are not currently using such a plugin, it could possible that due to memory access optimization in the OpenMP library, applications started on a particular socket, try to stay on that socket. As a result, if more than 4 applications primarily start on a same socket, you will have bad performances due to threads congestion. My 2 cents, Matthieu Matteo Guglielmi a écrit :Dar Community, I'm facing a problem when I submit a series of (openmp) jobs using a simple for loop. Our (fat)nodes have 4 sockets which host 4 AMD 6176 SE cpus (12-core per cpu). The relevant part of the jobfile is outlined here below: ### jobfile ### #SBATCH -n 4 #SBATCH -N 1 export OMP_NUM_THREADS=4 mpc --L=32 --out=./data --dt=0.05 ...etc ############### The way I submit a series of 12 jobs is: for i in {0..11}; do sbatch jobfile; done Slurm is configured as follow: SelectType=select/cons_res As you can see I basically reserve 4 cores per job; each mpc job will start 4 threads. Now, If i submit the 12 jobs "by hand" so to speak I get what I expect to have namely 12 jobs running at 400%... perfect. But if I submit the 12 jobs via a for cycle as outlined above I always get 2 or 3 jobs out of 12 running at 300%. To me it seems a racing problem which ultimately leads to more than one thread being "assigned" to the very same core. Question) Can this be possible? How to avoid it? Of course inserting a "sleep 0.5" into the for cycle does fix the problem... but I'm still worried about what will happen when hundreds of users will be submitting jobs at the same time. I'm still testing slurm and I'd like to make sure that this problem will not occur when I will set it as the default batch system. Thanks, --matt
