Hi Alex, Try setting the consumable to YES instead of JOB. Setting it to YES means applying it per slot, and requires the exec_host to have an h_vmem limit defined (in your case, 46G). Setting it to JOB, the resource is debited from the queue resource, which in your case, I suspect has no limit (h_vmemis set to INFINITY).
According to Reuti, this may be a bug. The expected behaviour is that setting it to JOB, even with the queue's default h_vmem set to INFINITY, the node's tighter h_vmem limit should still take effect. However, it seems that is not happening (Brett Taylor had the same issue last month). Ian On Wed, Jan 9, 2013 at 12:29 AM, Alex Chekholko <[email protected]> wrote: > Hi all, > > I have what seems like a straightforward problem. > > This is on Open Grid Scheduler 2011.11... > > h_vmem is configured as consumable: > # qconf -sc | grep h_vmem > h_vmem h_vmem MEMORY <= YES JOB 4G 0 > > The exec host is configured to have 46G of h_vmem: > # qconf -se scg1-4-8 | grep h_vmem > complex_values h_vmem=46G,slots=12 > > The user requests 16G of h_vmem in his job: > # qstat -f -j 480039 |grep h_vmem > hard resource_list: h_vmem=16G > > > > But the scheduler puts a whole bunch of these on the same node! > > # qhost -j -F h_vmem > ... > > scg1-4-8 linux-x64 24 - 47.3G - 9.8G > - > Host Resource(s): hc:h_vmem=-130.000G > 480039 0.50074 a_STL002-O shinlin r 01/08/2013 20:21:18 > standard@s MASTER > 480041 0.50074 a_STL002-O shinlin r 01/08/2013 20:21:18 > standard@s MASTER > 480042 0.50074 a_STL002-O shinlin r 01/08/2013 20:21:18 > standard@s MASTER > 480043 0.50074 a_STL002-O shinlin r 01/08/2013 20:21:18 > standard@s MASTER > 480044 0.50074 a_STL002-O shinlin r 01/08/2013 20:21:18 > standard@s MASTER > 480045 0.50074 a_STL002-O shinlin r 01/08/2013 20:21:18 > standard@s MASTER > 480046 0.50074 a_STL002-O shinlin r 01/08/2013 20:21:18 > standard@s MASTER > 480047 0.50074 a_STL002-O shinlin r 01/08/2013 20:21:18 > standard@s MASTER > 480048 0.50074 a_STL002-O shinlin r 01/08/2013 20:21:18 > standard@s MASTER > 480049 0.50074 a_STL002-O shinlin r 01/08/2013 20:21:18 > standard@s MASTER > 480050 0.50074 a_STL002-O shinlin r 01/08/2013 20:21:18 > standard@s MASTER > > > > How do I go about troubleshooting this? It seems to have been working > fine for a while (months), it just started doing this a few days ago. > > I did have the same problem earlier, but as I understand it, there is a > bug related to either multiple queue instances on a host or multiple > consumable requests for the job, but in this case it is neither. The host > only has this one queue instance, and the job only requests this one > complex. > > Suggestions? > > Regards, > -- > Alex Chekholko [email protected] > ______________________________**_________________ > users mailing list > [email protected] > https://gridengine.org/**mailman/listinfo/users<https://gridengine.org/mailman/listinfo/users> > -- Ian Kaufman Research Systems Administrator UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
