Hi All,

We recently reinstalled our cluster and we have some serious issues.
Contrary to our previous installation, we now installed a fully 64bits
system. We use Rocks cluster 6\CentOS  6.3,
and SGE 6.2u5.

The memory values reported by SGE are very high compared
to the actual need of every jobs, and many get killed because
they exceed the limit, while they should not.
I found this thread about too low memory reports:
http://comments.gmane.org/gmane.comp.clustering.gridengine.users/19303

But I didn't find anything about too high memory reports...


Here is a simple test to make it clear:

I submit a very stupid python script "minimal.py", wich is just:
-----
import time

time.sleep(30)
print("done")
-----

* I tried to run it directly to check the memory consumption with:
$ /usr/bin/time -v python minimal.py
And I get: Maximum resident set size (kbytes): 15376


* Then, when submitting the jobs with:
qsub -m ase -M <my_mail> -b y -N memTest -o test.out -e test.err -cwd
"python minimal.py"
I go checking on the computation node where it gets scheduled and I "top":
 PID  USER     PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
20240 myName   23   3  114m 3844 1832 S  0.0  0.0   0:00.14 python minimal.py

So I understand it uses 3.8Mb of RAM.


* But from the e-mail I get when the jobs terminate:
Job 1879536 (memTest) Complete
User = myName
Queue = [email protected]
Host = compute-3-0.local
Start Time = 09/25/2012 13:46:45
End Time = 09/25/2012 13:47:15
User Time = 00:00:00
System Time = 00:00:00
Wallclock Time = 00:00:30
CPU = 00:00:00
Max vmem = 114.441M
Exit Status = 0


It says 114Mb, I don't understand this huge difference.


The consequence is that most of the jobs get killed by "fakely" (I presume)
exceeding the hard memory limit. Any clue is welcome!


Sincerely,

    Jérémie

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to