I was finally able to get some free time to run a few tests and some actual 
numbers to show that memory isn't actually being limited, for some reason.  I 
submitted two jobs,

$ qsub -pe smp 12 -l h_vmem=8G -q long.q test.sh
$ qsub -pe smp 12 -l h_vmem=4G -q long.q test.sh

So the first should use 96G of ram, the second 48G, total.  These are both well 
under the actual vmem on the systems, 142G (physically there is 144G of ram per 
node).  Here is the output stats from both jobs:

1::
User Time        = 27:03:25:35
 System Time      = 02:09:29
 Wallclock Time   = 2:20:37:22
 CPU              = 27:05:35:05
 Max vmem         = 35.449G

2::
User Time        = 26:21:55:44
 System Time      = 02:10:08
 Wallclock Time   = 2:20:08:33
 CPU              = 27:00:05:52
 Max vmem         = 35.449G
 Exit Status      = 0

And my settings for h_vmem are as follows:
# qconf -se compute-0-0  [all nodes are the same]
...
complex_values        slots=36,h_vmem=142G
...

# qconf -sc
...
h_vmem              h_vmem     MEMORY      <=    YES         YES        3.95G   
 0
...

# qconf -sq long.q
...
h_vmem                142G
...

I just changed the queue config from INFINITY to 142G before this test. when it 
was INFINITY I was getting the same results.

I have users actively running things at the moment so I don't want to be 
fiddling with too many settings to test out theories, but the only thing I can 
think of is that I need to change the complex value back to JOB.  I can't think 
of anything else that might be affecting this because these are the only 
settings I've changed ever since I started trying to figure out how to limit 
the memory usage.

Here's some data from some tests I ran a few months ago, with the complex set 
to YES, just trying to figure out if more cores or more ram helped this script: 
[h_vmem is the '-l h_vmem=x' amount, default means none specified which I hoped 
would limit to the "default" configured in the complex config (3.5G)], at this 
time our nodes had 96G physical RAM installed.

h_vmem  threads hours   max vmem
8               12              67.26   35.342
7.5             12              32.39   7.878
7.5             12              32.03   7.878
default         12              67.03   35.342
default         24              68.92   35.344
3               24              19.86   3.932
3.5             24              18.29   3.989

So you can see I was able to limit the ram on some runs, even if it was never 
limited to the exact amount requested.  I am not sure what else has changed 
since then in my queue configuration.

Any other ideas?

Brett Taylor
Systems Administrator
Center for Systems and Computational Biology

The Wistar Institute
3601 Spruce St.
Room 214
Philadelphia PA 19104
Tel: 215-495-6914
Sending me a large file? Use my secure dropbox:
https://cscb-filetransfer.wistar.upenn.edu/dropbox/[email protected]

-----Original Message-----
From: [email protected] [mailto:[email protected]] On 
Behalf Of Brett Taylor
Sent: Friday, February 08, 2013 4:40 PM
To: Alex Chekholko; [email protected]
Subject: Re: [gridengine users] h_vmem not actually restricting memory usage?

Thanks for the tip.  I just tried that with h_vmem set to 71G in the queue 
definition, and set to INFINITY, submitting jobs with different-l h_vmem= 
limits.  Both had the same effect so that doesn't seem to explain my mixed 
results with my tests.

Brett Taylor
Systems Administrator
Center for Systems and Computational Biology

The Wistar Institute
3601 Spruce St.
Room 214
Philadelphia PA 19104
Tel: 215-495-6914
Sending me a large file? Use my secure dropbox:
https://cscb-filetransfer.wistar.upenn.edu/dropbox/[email protected]


-----Original Message-----
From: [email protected] [mailto:[email protected]] On 
Behalf Of Alex Chekholko
Sent: Friday, February 08, 2013 2:20 PM
To: [email protected]
Subject: Re: [gridengine users] h_vmem not actually restricting memory usage?

Hi Brett,

I saw the issue with JOB vs YES when the jobs were requesting more than one 
complex. e.g. qsub -l h_vmem=10G,h_stack=10M was not hitting memory limits when 
"JOB" was set.

If your jobs are only requesting one complex, the JOB setting should work as 
expected, and you can try both ways and test the results.

For troubleshooting, I also usually just make a super-short script that outputs 
the output of 'ulimit -a' from inside the job environment, check that the 
'ulimit -v' value matches your h_vmem value.

Also check your 'qhost -F h_vmem' output to see that it looks as you expect.

Regards,
Alex

On 2/7/13 12:42 PM, Brett Taylor wrote:
> Hello,
>
> I've been testing out the h_vmem settings for a while now, and currently I 
> have this setup:
>
> Exec host
>       complex_values        slots=36,h_vmem=142G
> high_priority.q
>       h_vmem INFINITY
>       slots                 24
>       priority              0
> low_priority.q
>       h_vmem INFINITY
>       priority              18
>       slots                 12
> qconf -sc
>       h_vmem              h_vmem     MEMORY      <=    YES         YES        
> 3.95G    0
>
> I know that there has been discussion of a bug with respect to setting the 
> complex to JOB, which is why I settled on this configuration a few months ago 
> in order to have two queues without oversubscribing the memory.  However, 
> this doesn't seem to actually limit the memory usage during run time, like I 
> have seen GE do before.
>
> I have one script that I have been using to benchmark my cluster and figure 
> out the queue stats.  It runs tophat and bowtie and my metrics for knowing if 
> the memory is being limited are the "Max vmem:" and "Wall clock time:" stats. 
>  If the memory isn't limited, then if I submit the job using 24 cores, I'll 
> see "Max vmem: 35.342G" and a wall clock time around 2:20:00:00.  When I was 
> able to limit the vmem, I saw stats more like " Wallclock Time   = 
> 19:51:49... Max vmem         = 3.932G".  As you can see, 19 hours is a lot 
> quicker than 2 days.
>
> I don't have definitive proof, but I think changing to JOB and setting a 
> limit in the queue definition, instead of INFINITY, might restore the actual 
> runtime limit. But, then I wouldn't be able to have two queues in the way I 
> have them now.  I'd like to test this myself but my tiny cluster is full at 
> the moment. Can anyone confirm these settings for me?
>
> Thanks,
> Brett
>
>
> Brett Taylor
> Systems Administrator
> Center for Systems and Computational Biology
>
> The Wistar Institute
> 3601 Spruce St.
> Room 214
> Philadelphia PA 19104
> Tel: 215-495-6914
> Sending me a large file? Use my secure dropbox:
> https://cscb-filetransfer.wistar.upenn.edu/dropbox/[email protected]
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

--
This email was Anti Virus checked by Astaro Security Gateway. 
http://www.astaro.com

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

-- 
This email was Anti Virus checked by Astaro Security Gateway. 
http://www.astaro.com

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to