Am 09.04.2014 um 15:43 schrieb Fan Dong: > Here comes another question. comp02 is one exec nodes, it has 32GB memory. > bigmem_16.q has 2 slots on this node. A bunch of jobs are submitted with -l > h_vmem=16G. At the first a few hours, both 2 slots on comp92 are used just > as expected until this morning I found the following numbers that I donot > understand. First off, only one slot on comp92 is used (the same is true for > rest of our nodes similar to comp92). The h_vmem on comp92 shows 13.802G and > that's why jobs are held in the queue because they request h_vmem=16G. What > might be the cause for this 'weird' h_vmem value? > > > The is taken from running 'qhost -F h_vmem' > > HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO > SWAPUS > comp92 linux-x64 8 1.00 31.4G 1.9G 2.0G > 0.0 > Host Resource(s): hc:h_vmem=13.802G > > > This is taken from running 'qstat -f -q bigmem_16.q' > > queuename qtype resv/used/tot. load_avg arch > states > --------------------------------------------------------------------------------- > [email protected] BI 0/1/2 1.00 linux-x64 > 44946 0.55500 pipeline_w training r 04/09/2014 07:19:30 1
What was defined in `qconf -se comp92` for "h_vmem"? -- Reuti > On 04/08/2014 12:13 PM, Reuti wrote: >> Am 08.04.2014 um 16:45 schrieb Fan Dong: >> >> >>> On 04/08/2014 10:41 AM, Reuti wrote: >>> >>>> Am 08.04.2014 um 16:34 schrieb Fan Dong: >>>> >>>> >>>>> On 04/08/2014 10:17 AM, Reuti wrote: >>>>> >>>>>> Am 08.04.2014 um 16:02 schrieb Fan Dong: >>>>>> >>>>>> >>>>>>> Thanks for the help. I guess part of my original question was 'will >>>>>>> h_vmem help the scheduler to hold off the job if the node does not have >>>>>>> enough h_vmem left?' >>>>>>> >>>>>>> Say, we have >>>>>>> • a consumable h_vmem (qconf -mc) with default value 4GB, >>>>>>> • the exec host h1 and h2 both have h_vmem = 32GB (qconf -me), >>>>>>> • the queue a.q is configured with 18GB h_vmem (qconf -mq). >>>>>>> >>>>>>> What happens a user sends 3 jobs to a.q, assuming there are more than >>>>>>> two slots on each of the host ? -- will >>>>>>> • 3 jobs get to run simultaneously? >>>>>>> >>>>>> Yes. 4 GB times 3 will fit into the available 32 GB. >>>>>> >>>>>> >>>>> Then what is the use of h_vmem setup in the queue??? h_vmem has the value >>>>> of 18GB in a.q, how does that come into the play? >>>>> >>>> It is the maximum a user can request per job. >>>> >>>> They get 4 GB by default, but they can request more - up to 18 GB for a >>>> particular job. In case they request more than 18GB , the job will never >>>> start. >>>> >>>> Nevertheless, the overall consumption of memory will be restricted by the >>>> definition on the host level, i.e. that all jobs in total on an exechost >>>> will never exceed 32 GB. >>>> >>> Excellent! but just to double check - if a user does not explicitly use >>> qsub -l h_vmem in the submission script, the default 4GB will be used. Is >>> that correct? >>> >> As long as the "h_vmem" complex is assigned to each exechost with a suitable >> value: yes. >> >> -- Reuti >> >> >> >>>> -- Reuti >>>> >>>> >>>> >>>>> Shouldn't the h_vmem in the queue override the default global consumable >>>>> value?? You suggested earlier that the h_vmem attached to the queue is >>>>> enforced per job but your calculation '4GB times 3' seems ignore the >>>>> h_vmem in the queue. Could you please clarify? Thank you. >>>>> >>>>> >>>>> >>>>> >>>>>> In case the user requests more memory, like 18 GB for each of them, it >>>>>> will be different of course. >>>>>> >>>>>> -- Reuti >>>>>> >>>>>> >>>>>> >>>>>>> • or there is a job has to be held off ? (because h_vmem on >>>>>>> each of the host will decrease to 32-18=14G, not enough for the third >>>>>>> job) >>>>>>> >>>>>>> >>>>>>> On 04/07/2014 11:29 AM, Reuti wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Am 07.04.2014 um 17:10 schrieb Fan Dong: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> I am a little confused about the consumable h_vmem setup on the node >>>>>>>>> and the queue. Let's say we have one queue, called a.q, spans two >>>>>>>>> host, h1 and h2. h1 has 32GB of ram and h2 has 128GB. >>>>>>>>> >>>>>>>>> I attached h_vmem to both hosts, using the value of actual physical >>>>>>>>> ram, >>>>>>>>> >>>>>>>>> >>>>>>>> You defined this value `qconf -me ...` => "complex_values"? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> also a.q has default h_vmem value of 18GB, which is the peak memory >>>>>>>>> usage of the job. >>>>>>>>> >>>>>>>>> >>>>>>>> Yes, the setting in the queue is per job, while in the exechost >>>>>>>> definition it's across all jobs. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> Here is how I understand the way h_vmem works. When the first job in >>>>>>>>> a.q is sent to node h1, the h_vmem on the node will decrease to >>>>>>>>> 32-18=14GB, >>>>>>>>> >>>>>>>>> >>>>>>>> Did you make the "h_vmem" complex consumable in `qconf -mc`? What is >>>>>>>> the default value specified there for it? >>>>>>>> >>>>>>>> You check with `qhost -F h_vmem` and the values are not right? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> the h_vmem attached to queue will make sure that job won't use memory >>>>>>>>> more than 18GB. When the second job comes in, it will be sent to >>>>>>>>> node h2 because there is no enough h_vmem on node h1 left. >>>>>>>>> >>>>>>>>> >>>>>>>> ...as the value was subtracted on a host level. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> I am not sure if I am correct about the h_vmem as I have an >>>>>>>>> impression h_vmem won't stop jobs from being sent to a node but >>>>>>>>> virtual_free does. Any suggestions? >>>>>>>>> >>>>>>>>> >>>>>>>> Keep in mind, the "h_vmem" is a hard limit, while "virtual_free" is a >>>>>>>> hint for SGE how to distribute jobs while it allows to consume more >>>>>>>> than requested. It depends on the workflow what fits best. >>>>>>>> >>>>>>>> -- Reuti >>>>>>>> >>>>>>>> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
