I think we've seen this problem as well. When using ThreadsPerCore along 
with DefMemPerCpu or --mem-per-cpu, the worker node doesn't set the 
memory limit correctly. Setting --mem instead should work OK.

I have a lightly tested patch that appears to address the issue, but 
I've only tried it on a single-node development system, and not yet in 
production. It's attached to the following bugzilla ticket: 
http://bugs.schedmd.com/show_bug.cgi?id=309

Regards,
John

On 05/23/2013 01:00 PM, Paul Edmon wrote:
> I use 2.5.6.
>
> -Paul Edmon-
>
> On 05/23/2013 01:58 PM, S. Aravindan wrote:
>> I guess so. The slurm version I use is 2.5.4. I have attached my
>> slurm.conf with this mail.
>>
>> --Semparithi
>>
>> +++ On 10:34 23 May Paul Edmon wrote:
>>> Hmm, maybe its the ThreadsPerCore?  Perhaps its thinks there are half as
>>> many core as there really are due to the ThreadsPerCore. Thus if you do
>>> the --mem-per-cpu it will only give you half, as it only counts cores
>>> not threads*cores?
>>>
>>> -Paul Edmon-
>>>
>>> On 05/23/2013 01:31 PM, S. Aravindan wrote:
>>>> I was about to post a similar query. Gaussian 09 job is killed when the
>>>> memory consumption exceeds half the amount of memory available on a node
>>>> when --mem-per-cpu is used but the job runs when --mem is used.  The
>>>> relevant lines from slurm.conf is below.
>>>>
>>>> NodeName=node[01-15] RealMemory=48228 Sockets=2 CoresPerSocket=6 
>>>> ThreadsPerCore=2 CPUs=24 State=UNKNOWN TmpDisk=1850000
>>>> NodeName=node[16-30] RealMemory=96705 Sockets=2 CoresPerSocket=6 
>>>> ThreadsPerCore=2 CPUs=24 State=UNKNOWN TmpDisk=1850000 Feature=96g
>>>>
>>>> Any suggestion is welcome.
>>>>
>>>> --Semparithi
>>>>
>>>>
>>>> +++ On 09:41 23 May Paul Edmon wrote:
>>>>> I have a user that is running a problem which uses 512 GB of memory. She
>>>>> request this from SLURM on a node which has this much.  However her code
>>>>> dies:
>>>>>
>>>>> slurmd[holy2b09101]: error: Job 6497 exceeded 268435456 KB memory limit, 
>>>>> being killed
>>>>> slurmd[holy2b09101]: error: Exceeded job memory limit
>>>>> slurmd[holy2b09101]: error: *** JOB 6497 CANCELLED AT 2013-05-23T00:53:31 
>>>>> ***
>>>>>
>>>>> This is half of the 512 GB which was requested.  Is there something I am 
>>>>> missing?  The nodes in question have:
>>>>>
>>>>> NodeName=DEFAULT CPUs=64 RealMemory=529247 Sockets=4 CoresPerSocket=8 
>>>>> ThreadsPerCore=2 State=UNKNOWN
>>>>>
>>>>> These are AMD Abu Dhabi processors with 8 GB per core, so 512 GB total.  
>>>>> She is requesting 8 GB per cpu and is asking for 64 cores.  Thoughts?
>>>>>
>>>>> -Paul Edmon-
>>>> -- Semparithi Aravindan

Reply via email to