Oh! Thanks, my mistake.
So it seems SGE is correct with the memory measurement, it reports
the same values as what we see if we launch things directly on the
nodes. However these values are still surprisingly high.
We'll investigate further if something is wrong with our kernel.

Thanks,

Jérémie


2012/9/25 Reuti <[email protected]>:
> Am 25.09.2012 um 14:26 schrieb Jérémie Dubois-Lacoste:
>
>> Hi All,
>>
>> We recently reinstalled our cluster and we have some serious issues.
>> Contrary to our previous installation, we now installed a fully 64bits
>> system. We use Rocks cluster 6\CentOS  6.3,
>> and SGE 6.2u5.
>>
>> The memory values reported by SGE are very high compared
>> to the actual need of every jobs, and many get killed because
>> they exceed the limit, while they should not.
>> I found this thread about too low memory reports:
>> http://comments.gmane.org/gmane.comp.clustering.gridengine.users/19303
>>
>> But I didn't find anything about too high memory reports...
>>
>>
>> Here is a simple test to make it clear:
>>
>> I submit a very stupid python script "minimal.py", wich is just:
>> -----
>> import time
>>
>> time.sleep(30)
>> print("done")
>> -----
>>
>> * I tried to run it directly to check the memory consumption with:
>> $ /usr/bin/time -v python minimal.py
>> And I get: Maximum resident set size (kbytes): 15376
>>
>>
>> * Then, when submitting the jobs with:
>> qsub -m ase -M <my_mail> -b y -N memTest -o test.out -e test.err -cwd
>> "python minimal.py"
>> I go checking on the computation node where it gets scheduled and I "top":
>> PID  USER     PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> 20240 myName   23   3  114m 3844 1832 S  0.0  0.0   0:00.14 python minimal.py
>
> The virtual size is listed here as 114m as well.
>
> -- Reuti
>
>
>> So I understand it uses 3.8Mb of RAM.
>>
>>
>> * But from the e-mail I get when the jobs terminate:
>> Job 1879536 (memTest) Complete
>> User = myName
>> Queue = [email protected]
>> Host = compute-3-0.local
>> Start Time = 09/25/2012 13:46:45
>> End Time = 09/25/2012 13:47:15
>> User Time = 00:00:00
>> System Time = 00:00:00
>> Wallclock Time = 00:00:30
>> CPU = 00:00:00
>> Max vmem = 114.441M
>> Exit Status = 0
>>
>>
>> It says 114Mb, I don't understand this huge difference.
>>
>>
>> The consequence is that most of the jobs get killed by "fakely" (I presume)
>> exceeding the hard memory limit. Any clue is welcome!
>>
>>
>> Sincerely,
>>
>>    Jérémie
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>>
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to