Thank you for the details, I never used such an AMD machine, but it
seems that the topology is quite complex. I guess that a socket
corerspond to 2 consecutive memory nodes and the associated CPUs. (If
not the following are wrongs...). The bad performaces list seems to
corresond to a cyclic distribution across sockets.

You should try to add verbosity in the OpenMP layer to get infromation
on how the OpenMP layer is dispatching the load across the socket. You
should compare the bad lists with the good one to find the difference
and try to use a block distribution to see if it works better.

SLURM seems to do its job coherently with what you asked for, a cyclic
distribution, 4 cores per job. The problem seems to be in the
underlying layer. You should try to use the cpuset logic of slurm. On
some systems, we saw that sometimes the OpenMP layer is able to modify
affinities set by the sched_setschedaffinity call but not affinities
managed by cpuset and thus running with cpuset confinement brings
better performances than with basic set affinity logic avoiding some
overlapping of OpenMP threads.

HTH
Matthieu


2011/10/15 Matteo Guglielmi <[email protected]>:
> This is the repartition of the cores when the user has
> bad performance on a node (-N 1 -n 4):
>
> out4test.foff09-8864.txt:Cpus_allowed_list: 0,12,24,36
> out4test.foff09-8865.txt:Cpus_allowed_list: 6,18,30,42
> out4test.foff09-8866.txt:Cpus_allowed_list: 1,13,25,37
> out4test.foff09-8867.txt:Cpus_allowed_list: 7,19,31,43
> out4test.foff09-8868.txt:Cpus_allowed_list: 2,14,26,38
> out4test.foff09-8869.txt:Cpus_allowed_list: 8,20,32,44
> out4test.foff09-8870.txt:Cpus_allowed_list: 3,15,27,39
> out4test.foff09-8871.txt:Cpus_allowed_list: 9,21,33,45
> out4test.foff09-8872.txt:Cpus_allowed_list: 4,16,28,40
> out4test.foff09-8873.txt:Cpus_allowed_list: 10,22,34,46
> out4test.foff09-8874.txt:Cpus_allowed_list: 5,17,29,41
> out4test.foff09-8875.txt:Cpus_allowed_list: 11,23,35,47
>
> And this is the output of "numactl --hardware" on the
> same node:
>
> available: 8 nodes (0-7)
> node 0 cpus: 0 1 2 3 4 5
> node 0 size: 16349 MB
> node 0 free: 15854 MB
> node 1 cpus: 6 7 8 9 10 11
> node 1 size: 16384 MB
> node 1 free: 16036 MB
> node 2 cpus: 12 13 14 15 16 17
> node 2 size: 16384 MB
> node 2 free: 16075 MB
> node 3 cpus: 18 19 20 21 22 23
> node 3 size: 16384 MB
> node 3 free: 16111 MB
> node 4 cpus: 24 25 26 27 28 29
> node 4 size: 16384 MB
> node 4 free: 16089 MB
> node 5 cpus: 30 31 32 33 34 35
> node 5 size: 16384 MB
> node 5 free: 16112 MB
> node 6 cpus: 36 37 38 39 40 41
> node 6 size: 16384 MB
> node 6 free: 16072 MB
> node 7 cpus: 42 43 44 45 46 47
> node 7 size: 16384 MB
> node 7 free: 16114 MB
> node distances:
> node   0   1   2   3   4   5   6   7
>  0:  10  16  16  22  16  22  16  22
>  1:  16  10  22  16  22  16  22  16
>  2:  16  22  10  16  16  22  16  22
>  3:  22  16  16  10  22  16  22  16
>  4:  16  22  16  22  10  16  16  22
>  5:  22  16  22  16  16  10  22  16
>  6:  16  22  16  22  16  22  10  16
>  7:  22  16  22  16  22  16  16  10
>
>
> On 10/14/11 17:50, Matthieu Hautreux wrote:
>>
>> I think that using "-n 1 -c 4" is better in your case.
>>
>> Concerning the strange behavior, you should take a look at the
>> non-overlapping lists to see what is the repartition of the cores when
>> you have bad performances.
>> if you can send to me the different cpuids list for your different
>> jobs as well as the physical mapping of your node, it would be easier
>> to understand the dispatch made by SLURM and see if something can be
>> explained because of that. The physical dispatch can be obtained using
>> "numactl --hardware" :
>>
>> [hautreuxm@leaf ~]$ numactl --hardware
>> available: 4 nodes (0-3)
>> node 0 cpus: 0 4 8 12 16 20 24 28
>> node 0 size: 32748 MB
>> node 0 free: 30922 MB
>> node 1 cpus: 1 5 9 13 17 21 25 29
>> node 1 size: 32768 MB
>> node 1 free: 30642 MB
>> node 2 cpus: 2 6 10 14 18 22 26 30
>> node 2 size: 32768 MB
>> node 2 free: 30839 MB
>> node 3 cpus: 3 7 11 15 19 23 27 31
>> node 3 size: 32766 MB
>> node 3 free: 31363 MB
>> node distances:
>> node   0   1   2   3
>>   0:  10  15  15  15
>>   1:  15  10  15  15
>>   2:  15  15  10  15
>>   3:  15  15  15  10
>> [hautreuxm@leaf ~]$
>>
>>
>> The CR_CORE_DEFAULT_DIST_BLOCK is interesting as it ensures that cores
>> are allocated by socket first not in a round-robin maner on the
>> available sockets.
>> It could be better for you to have this option set if your
>> applications are not memory bound.
>>
>>
>> Matthieu
>>
>> 2011/10/14 Matteo Guglielmi<[email protected]>:
>>>
>>> Ok, I don't have all those extra parameters set as you do but
>>> here is thing:
>>>
>>> for loop (+) #SBATCH -N 1   (+) #SBATCH -n 4
>>>
>>> does produce non-overlapping lists but some jobs were nontheless
>>> still running at<= 300% CPU utilization
>>>
>>> for loop (+) #SBATCH -N 1-1 (+) #SBATCH -n 1 (+) #SBATCH -c 4
>>>
>>> does still produce non-overlapping lists + all the jobs do run
>>> at 400%.
>>>
>>> Was it a wrong jobfile then?
>>>
>>> should I also replicate your config parameters into my slurm.conf?
>>>
>>>
>>> On 10/14/11 15:19, HAUTREUX Matthieu wrote:
>>>>
>>>> Our conf is like :
>>>>
>>>> SelectType=select/cons_res
>>>>
>>>>
>>>> SelectTypeParameters=CR_Core_Memory,CR_CORE_DEFAULT_DIST_BLOCK,CR_ONE_TASK_PER_CORE
>>>>
>>>> TaskPlugin=task/affinity
>>>> TaskPluginParam=Cpusets,Cores
>>>>
>>>> You should be able to read the Cpus_allowed_list value has soon has your
>>>> job are started and see if it contains a coherent value (a list of 4
>>>> integers per job).
>>>>
>>>> Matthieu
>>>>
>>>> Matteo Guglielmi a écrit :
>>>>>
>>>>> I believe so:
>>>>>
>>>>> SelectType=select/cons_res
>>>>> SelectTypeParameters=CR_Core_Memory
>>>>>
>>>>> Running the fast loop tests now...
>>>>>
>>>>> On 10/14/11 14:38, HAUTREUX Matthieu wrote:
>>>>>>
>>>>>> Have you configured task/affinity to do a core binding by default ?
>>>>>>
>>>>>> Can you try a modified version of your script like the following give
>>>>>> me
>>>>>> the output for each of your job :
>>>>>>
>>>>>> ### jobfile ###
>>>>>> #SBATCH -n 4
>>>>>> #SBATCH -N 1
>>>>>>
>>>>>> export OMP_NUM_THREADS=4
>>>>>>
>>>>>> cat /proc/self/status | grep Cpus_allowed_list
>>>>>> mpc --L=32 --out=./data --dt=0.05 ...etc
>>>>>> ###############
>>>>>>
>>>>>> You should have only 4 cores associated to each job, and each list of
>>>>>> cores should be different. If you do not have configured the default
>>>>>> binding, you will certainly have the same complete list of cores
>>>>>> available to each job.
>>>>>>
>>>>>> Matthieu
>>>>>>
>>>>>> Matteo Guglielmi a écrit :
>>>>>>>
>>>>>>> Lets say you got a full dollar!
>>>>>>>
>>>>>>> Yes, I'm using task/affinity and not task/cgroup....
>>>>>>>
>>>>>>> Should I use task/cgroup then?
>>>>>>>
>>>>>>> On 10/14/11 13:55, HAUTREUX Matthieu wrote:
>>>>>>>>
>>>>>>>> Dear Matteo,
>>>>>>>>
>>>>>>>> are you using the task/affinity (or task/cgroup) plugin on your
>>>>>>>> system ?
>>>>>>>>
>>>>>>>> The only way to ensure that your jobs have exclusive access to their
>>>>>>>> allocated resources is to do that. Indeed, select/cons_res reserve a
>>>>>>>> part of the cores to each of your job but do not guarantee that
>>>>>>>> each job
>>>>>>>> will only be able to use the associated set of cores. This is the
>>>>>>>> role
>>>>>>>> of the task/affinity or task/cgroup (option ConstrainCores=yes in
>>>>>>>> cgroup.conf). In your current scenario, if you are not currently
>>>>>>>> using
>>>>>>>> such a plugin, it could possible that due to memory access
>>>>>>>> optimization
>>>>>>>> in the OpenMP library, applications started on a particular socket,
>>>>>>>> try
>>>>>>>> to stay on that socket. As a result, if more than 4 applications
>>>>>>>> primarily start on a same socket, you will have bad performances
>>>>>>>> due to
>>>>>>>> threads congestion.
>>>>>>>>
>>>>>>>> My 2 cents,
>>>>>>>> Matthieu
>>>>>>>>
>>>>>>>>
>>>>>>>> Matteo Guglielmi a écrit :
>>>>>>>>>
>>>>>>>>> Dar Community,
>>>>>>>>>
>>>>>>>>> I'm facing a problem when I submit a series
>>>>>>>>> of (openmp) jobs using a simple for loop.
>>>>>>>>>
>>>>>>>>> Our (fat)nodes have 4 sockets which host 4
>>>>>>>>> AMD 6176 SE cpus (12-core per cpu).
>>>>>>>>>
>>>>>>>>> The relevant part of the jobfile is outlined
>>>>>>>>> here below:
>>>>>>>>>
>>>>>>>>> ### jobfile ###
>>>>>>>>> #SBATCH -n 4
>>>>>>>>> #SBATCH -N 1
>>>>>>>>>
>>>>>>>>> export OMP_NUM_THREADS=4
>>>>>>>>>
>>>>>>>>> mpc --L=32 --out=./data --dt=0.05 ...etc
>>>>>>>>> ###############
>>>>>>>>>
>>>>>>>>> The way I submit a series of 12 jobs is:
>>>>>>>>>
>>>>>>>>> for i in {0..11}; do sbatch jobfile; done
>>>>>>>>>
>>>>>>>>> Slurm is configured as follow:
>>>>>>>>>
>>>>>>>>> SelectType=select/cons_res
>>>>>>>>>
>>>>>>>>> As you can see I basically reserve 4 cores
>>>>>>>>> per job; each mpc job will start 4 threads.
>>>>>>>>>
>>>>>>>>> Now, If i submit the 12 jobs "by hand" so
>>>>>>>>> to speak I get what I expect to have namely
>>>>>>>>> 12 jobs running at 400%... perfect.
>>>>>>>>>
>>>>>>>>> But if I submit the 12 jobs via a for cycle
>>>>>>>>> as outlined above I always get 2 or 3 jobs
>>>>>>>>> out of 12 running at 300%.
>>>>>>>>>
>>>>>>>>> To me it seems a racing problem which
>>>>>>>>> ultimately leads to more than one thread
>>>>>>>>> being "assigned" to the very same core.
>>>>>>>>>
>>>>>>>>> Question)
>>>>>>>>>
>>>>>>>>> Can this be possible?
>>>>>>>>>
>>>>>>>>> How to avoid it?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Of course inserting a "sleep 0.5" into the
>>>>>>>>> for cycle does fix the problem... but I'm
>>>>>>>>> still worried about what will happen when
>>>>>>>>> hundreds of users will be submitting jobs
>>>>>>>>> at the same time.
>>>>>>>>>
>>>>>>>>> I'm still testing slurm and I'd like to make
>>>>>>>>> sure that this problem will not occur when
>>>>>>>>> I will set it as the default batch system.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> --matt
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> .
>>>>>>
>>>>>
>>>>
>>>> .
>>>>
>>>
>>>
>>
>> .
>>
>
>

Reply via email to