This is the repartition of the cores when the user has
bad performance on a node (-N 1 -n 4):

out4test.foff09-8864.txt:Cpus_allowed_list: 0,12,24,36
out4test.foff09-8865.txt:Cpus_allowed_list: 6,18,30,42
out4test.foff09-8866.txt:Cpus_allowed_list: 1,13,25,37
out4test.foff09-8867.txt:Cpus_allowed_list: 7,19,31,43
out4test.foff09-8868.txt:Cpus_allowed_list: 2,14,26,38
out4test.foff09-8869.txt:Cpus_allowed_list: 8,20,32,44
out4test.foff09-8870.txt:Cpus_allowed_list: 3,15,27,39
out4test.foff09-8871.txt:Cpus_allowed_list: 9,21,33,45
out4test.foff09-8872.txt:Cpus_allowed_list: 4,16,28,40
out4test.foff09-8873.txt:Cpus_allowed_list: 10,22,34,46
out4test.foff09-8874.txt:Cpus_allowed_list: 5,17,29,41
out4test.foff09-8875.txt:Cpus_allowed_list: 11,23,35,47

And this is the output of "numactl --hardware" on the
same node:

available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5
node 0 size: 16349 MB
node 0 free: 15854 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 16384 MB
node 1 free: 16036 MB
node 2 cpus: 12 13 14 15 16 17
node 2 size: 16384 MB
node 2 free: 16075 MB
node 3 cpus: 18 19 20 21 22 23
node 3 size: 16384 MB
node 3 free: 16111 MB
node 4 cpus: 24 25 26 27 28 29
node 4 size: 16384 MB
node 4 free: 16089 MB
node 5 cpus: 30 31 32 33 34 35
node 5 size: 16384 MB
node 5 free: 16112 MB
node 6 cpus: 36 37 38 39 40 41
node 6 size: 16384 MB
node 6 free: 16072 MB
node 7 cpus: 42 43 44 45 46 47
node 7 size: 16384 MB
node 7 free: 16114 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  16  16  22  16  22  16  22
  1:  16  10  22  16  22  16  22  16
  2:  16  22  10  16  16  22  16  22
  3:  22  16  16  10  22  16  22  16
  4:  16  22  16  22  10  16  16  22
  5:  22  16  22  16  16  10  22  16
  6:  16  22  16  22  16  22  10  16
  7:  22  16  22  16  22  16  16  10


On 10/14/11 17:50, Matthieu Hautreux wrote:
I think that using "-n 1 -c 4" is better in your case.

Concerning the strange behavior, you should take a look at the
non-overlapping lists to see what is the repartition of the cores when
you have bad performances.
if you can send to me the different cpuids list for your different
jobs as well as the physical mapping of your node, it would be easier
to understand the dispatch made by SLURM and see if something can be
explained because of that. The physical dispatch can be obtained using
"numactl --hardware" :

[hautreuxm@leaf ~]$ numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 4 8 12 16 20 24 28
node 0 size: 32748 MB
node 0 free: 30922 MB
node 1 cpus: 1 5 9 13 17 21 25 29
node 1 size: 32768 MB
node 1 free: 30642 MB
node 2 cpus: 2 6 10 14 18 22 26 30
node 2 size: 32768 MB
node 2 free: 30839 MB
node 3 cpus: 3 7 11 15 19 23 27 31
node 3 size: 32766 MB
node 3 free: 31363 MB
node distances:
node   0   1   2   3
   0:  10  15  15  15
   1:  15  10  15  15
   2:  15  15  10  15
   3:  15  15  15  10
[hautreuxm@leaf ~]$


The CR_CORE_DEFAULT_DIST_BLOCK is interesting as it ensures that cores
are allocated by socket first not in a round-robin maner on the
available sockets.
It could be better for you to have this option set if your
applications are not memory bound.


Matthieu

2011/10/14 Matteo Guglielmi<matteo.guglie...@epfl.ch>:
Ok, I don't have all those extra parameters set as you do but
here is thing:

for loop (+) #SBATCH -N 1   (+) #SBATCH -n 4

does produce non-overlapping lists but some jobs were nontheless
still running at<= 300% CPU utilization

for loop (+) #SBATCH -N 1-1 (+) #SBATCH -n 1 (+) #SBATCH -c 4

does still produce non-overlapping lists + all the jobs do run
at 400%.

Was it a wrong jobfile then?

should I also replicate your config parameters into my slurm.conf?


On 10/14/11 15:19, HAUTREUX Matthieu wrote:

Our conf is like :

SelectType=select/cons_res

SelectTypeParameters=CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK,CR_ONE_TASK_PER_CORE

TaskPlugin=task/affinity
TaskPluginParam=Cpusets,Cores

You should be able to read the Cpus_allowed_list value has soon has your
job are started and see if it contains a coherent value (a list of 4
integers per job).

Matthieu

Matteo Guglielmi a écrit :

I believe so:

SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

Running the fast loop tests now...

On 10/14/11 14:38, HAUTREUX Matthieu wrote:

Have you configured task/affinity to do a core binding by default ?

Can you try a modified version of your script like the following give me
the output for each of your job :

### jobfile ###
#SBATCH -n 4
#SBATCH -N 1

export OMP_NUM_THREADS=4

cat /proc/self/status | grep Cpus_allowed_list
mpc --L=32 --out=./data --dt=0.05 ...etc
###############

You should have only 4 cores associated to each job, and each list of
cores should be different. If you do not have configured the default
binding, you will certainly have the same complete list of cores
available to each job.

Matthieu

Matteo Guglielmi a écrit :

Lets say you got a full dollar!

Yes, I'm using task/affinity and not task/cgroup....

Should I use task/cgroup then?

On 10/14/11 13:55, HAUTREUX Matthieu wrote:

Dear Matteo,

are you using the task/affinity (or task/cgroup) plugin on your
system ?

The only way to ensure that your jobs have exclusive access to their
allocated resources is to do that. Indeed, select/cons_res reserve a
part of the cores to each of your job but do not guarantee that
each job
will only be able to use the associated set of cores. This is the role
of the task/affinity or task/cgroup (option ConstrainCores=yes in
cgroup.conf). In your current scenario, if you are not currently using
such a plugin, it could possible that due to memory access
optimization
in the OpenMP library, applications started on a particular socket,
try
to stay on that socket. As a result, if more than 4 applications
primarily start on a same socket, you will have bad performances
due to
threads congestion.

My 2 cents,
Matthieu


Matteo Guglielmi a écrit :

Dar Community,

I'm facing a problem when I submit a series
of (openmp) jobs using a simple for loop.

Our (fat)nodes have 4 sockets which host 4
AMD 6176 SE cpus (12-core per cpu).

The relevant part of the jobfile is outlined
here below:

### jobfile ###
#SBATCH -n 4
#SBATCH -N 1

export OMP_NUM_THREADS=4

mpc --L=32 --out=./data --dt=0.05 ...etc
###############

The way I submit a series of 12 jobs is:

for i in {0..11}; do sbatch jobfile; done

Slurm is configured as follow:

SelectType=select/cons_res

As you can see I basically reserve 4 cores
per job; each mpc job will start 4 threads.

Now, If i submit the 12 jobs "by hand" so
to speak I get what I expect to have namely
12 jobs running at 400%... perfect.

But if I submit the 12 jobs via a for cycle
as outlined above I always get 2 or 3 jobs
out of 12 running at 300%.

To me it seems a racing problem which
ultimately leads to more than one thread
being "assigned" to the very same core.

Question)

Can this be possible?

How to avoid it?


Of course inserting a "sleep 0.5" into the
for cycle does fix the problem... but I'm
still worried about what will happen when
hundreds of users will be submitting jobs
at the same time.

I'm still testing slurm and I'd like to make
sure that this problem will not occur when
I will set it as the default batch system.

Thanks,

--matt




.



.




.


Reply via email to