Looks like I was overlooking the "exit status: 1", so I guess my script was being killed (as it should've when it exceeded the h_vmem limit). Any thoughts on why it was being killed before and not now?
Brett Taylor Systems Administrator Center for Systems and Computational Biology The Wistar Institute 3601 Spruce St. Room 214 Philadelphia PA 19104 Tel: 215-495-6914 Sending me a large file? Use my secure dropbox: https://cscb-filetransfer.wistar.upenn.edu/dropbox/[email protected] -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Brett Taylor Sent: Monday, February 18, 2013 4:17 PM To: [email protected] Subject: Re: [gridengine users] h_vmem not actually restricting memory usage? I was finally able to get some free time to run a few tests and some actual numbers to show that memory isn't actually being limited, for some reason. I submitted two jobs, $ qsub -pe smp 12 -l h_vmem=8G -q long.q test.sh $ qsub -pe smp 12 -l h_vmem=4G -q long.q test.sh So the first should use 96G of ram, the second 48G, total. These are both well under the actual vmem on the systems, 142G (physically there is 144G of ram per node). Here is the output stats from both jobs: 1:: User Time = 27:03:25:35 System Time = 02:09:29 Wallclock Time = 2:20:37:22 CPU = 27:05:35:05 Max vmem = 35.449G 2:: User Time = 26:21:55:44 System Time = 02:10:08 Wallclock Time = 2:20:08:33 CPU = 27:00:05:52 Max vmem = 35.449G Exit Status = 0 And my settings for h_vmem are as follows: # qconf -se compute-0-0 [all nodes are the same] ... complex_values slots=36,h_vmem=142G ... # qconf -sc ... h_vmem h_vmem MEMORY <= YES YES 3.95G 0 ... # qconf -sq long.q ... h_vmem 142G ... I just changed the queue config from INFINITY to 142G before this test. when it was INFINITY I was getting the same results. I have users actively running things at the moment so I don't want to be fiddling with too many settings to test out theories, but the only thing I can think of is that I need to change the complex value back to JOB. I can't think of anything else that might be affecting this because these are the only settings I've changed ever since I started trying to figure out how to limit the memory usage. Here's some data from some tests I ran a few months ago, with the complex set to YES, just trying to figure out if more cores or more ram helped this script: [h_vmem is the '-l h_vmem=x' amount, default means none specified which I hoped would limit to the "default" configured in the complex config (3.5G)], at this time our nodes had 96G physical RAM installed. h_vmem threads hours max vmem 8 12 67.26 35.342 7.5 12 32.39 7.878 7.5 12 32.03 7.878 default 12 67.03 35.342 default 24 68.92 35.344 3 24 19.86 3.932 3.5 24 18.29 3.989 So you can see I was able to limit the ram on some runs, even if it was never limited to the exact amount requested. I am not sure what else has changed since then in my queue configuration. Any other ideas? Brett Taylor Systems Administrator Center for Systems and Computational Biology The Wistar Institute 3601 Spruce St. Room 214 Philadelphia PA 19104 Tel: 215-495-6914 Sending me a large file? Use my secure dropbox: https://cscb-filetransfer.wistar.upenn.edu/dropbox/[email protected] -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Brett Taylor Sent: Friday, February 08, 2013 4:40 PM To: Alex Chekholko; [email protected] Subject: Re: [gridengine users] h_vmem not actually restricting memory usage? Thanks for the tip. I just tried that with h_vmem set to 71G in the queue definition, and set to INFINITY, submitting jobs with different-l h_vmem= limits. Both had the same effect so that doesn't seem to explain my mixed results with my tests. Brett Taylor Systems Administrator Center for Systems and Computational Biology The Wistar Institute 3601 Spruce St. Room 214 Philadelphia PA 19104 Tel: 215-495-6914 Sending me a large file? Use my secure dropbox: https://cscb-filetransfer.wistar.upenn.edu/dropbox/[email protected] -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Alex Chekholko Sent: Friday, February 08, 2013 2:20 PM To: [email protected] Subject: Re: [gridengine users] h_vmem not actually restricting memory usage? Hi Brett, I saw the issue with JOB vs YES when the jobs were requesting more than one complex. e.g. qsub -l h_vmem=10G,h_stack=10M was not hitting memory limits when "JOB" was set. If your jobs are only requesting one complex, the JOB setting should work as expected, and you can try both ways and test the results. For troubleshooting, I also usually just make a super-short script that outputs the output of 'ulimit -a' from inside the job environment, check that the 'ulimit -v' value matches your h_vmem value. Also check your 'qhost -F h_vmem' output to see that it looks as you expect. Regards, Alex On 2/7/13 12:42 PM, Brett Taylor wrote: > Hello, > > I've been testing out the h_vmem settings for a while now, and currently I > have this setup: > > Exec host > complex_values slots=36,h_vmem=142G > high_priority.q > h_vmem INFINITY > slots 24 > priority 0 > low_priority.q > h_vmem INFINITY > priority 18 > slots 12 > qconf -sc > h_vmem h_vmem MEMORY <= YES YES > 3.95G 0 > > I know that there has been discussion of a bug with respect to setting the > complex to JOB, which is why I settled on this configuration a few months ago > in order to have two queues without oversubscribing the memory. However, > this doesn't seem to actually limit the memory usage during run time, like I > have seen GE do before. > > I have one script that I have been using to benchmark my cluster and figure > out the queue stats. It runs tophat and bowtie and my metrics for knowing if > the memory is being limited are the "Max vmem:" and "Wall clock time:" stats. > If the memory isn't limited, then if I submit the job using 24 cores, I'll > see "Max vmem: 35.342G" and a wall clock time around 2:20:00:00. When I was > able to limit the vmem, I saw stats more like " Wallclock Time = > 19:51:49... Max vmem = 3.932G". As you can see, 19 hours is a lot > quicker than 2 days. > > I don't have definitive proof, but I think changing to JOB and setting a > limit in the queue definition, instead of INFINITY, might restore the actual > runtime limit. But, then I wouldn't be able to have two queues in the way I > have them now. I'd like to test this myself but my tiny cluster is full at > the moment. Can anyone confirm these settings for me? > > Thanks, > Brett > > > Brett Taylor > Systems Administrator > Center for Systems and Computational Biology > > The Wistar Institute > 3601 Spruce St. > Room 214 > Philadelphia PA 19104 > Tel: 215-495-6914 > Sending me a large file? Use my secure dropbox: > https://cscb-filetransfer.wistar.upenn.edu/dropbox/[email protected] _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users -- This email was Anti Virus checked by Astaro Security Gateway. http://www.astaro.com _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users -- This email was Anti Virus checked by Astaro Security Gateway. http://www.astaro.com _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users -- This email was Anti Virus checked by Astaro Security Gateway. http://www.astaro.com _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
