Oh! Thanks, my mistake. So it seems SGE is correct with the memory measurement, it reports the same values as what we see if we launch things directly on the nodes. However these values are still surprisingly high. We'll investigate further if something is wrong with our kernel.
Thanks, Jérémie 2012/9/25 Reuti <[email protected]>: > Am 25.09.2012 um 14:26 schrieb Jérémie Dubois-Lacoste: > >> Hi All, >> >> We recently reinstalled our cluster and we have some serious issues. >> Contrary to our previous installation, we now installed a fully 64bits >> system. We use Rocks cluster 6\CentOS 6.3, >> and SGE 6.2u5. >> >> The memory values reported by SGE are very high compared >> to the actual need of every jobs, and many get killed because >> they exceed the limit, while they should not. >> I found this thread about too low memory reports: >> http://comments.gmane.org/gmane.comp.clustering.gridengine.users/19303 >> >> But I didn't find anything about too high memory reports... >> >> >> Here is a simple test to make it clear: >> >> I submit a very stupid python script "minimal.py", wich is just: >> ----- >> import time >> >> time.sleep(30) >> print("done") >> ----- >> >> * I tried to run it directly to check the memory consumption with: >> $ /usr/bin/time -v python minimal.py >> And I get: Maximum resident set size (kbytes): 15376 >> >> >> * Then, when submitting the jobs with: >> qsub -m ase -M <my_mail> -b y -N memTest -o test.out -e test.err -cwd >> "python minimal.py" >> I go checking on the computation node where it gets scheduled and I "top": >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >> 20240 myName 23 3 114m 3844 1832 S 0.0 0.0 0:00.14 python minimal.py > > The virtual size is listed here as 114m as well. > > -- Reuti > > >> So I understand it uses 3.8Mb of RAM. >> >> >> * But from the e-mail I get when the jobs terminate: >> Job 1879536 (memTest) Complete >> User = myName >> Queue = [email protected] >> Host = compute-3-0.local >> Start Time = 09/25/2012 13:46:45 >> End Time = 09/25/2012 13:47:15 >> User Time = 00:00:00 >> System Time = 00:00:00 >> Wallclock Time = 00:00:30 >> CPU = 00:00:00 >> Max vmem = 114.441M >> Exit Status = 0 >> >> >> It says 114Mb, I don't understand this huge difference. >> >> >> The consequence is that most of the jobs get killed by "fakely" (I presume) >> exceeding the hard memory limit. Any clue is welcome! >> >> >> Sincerely, >> >> Jérémie >> >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users >> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
