Am 14.06.2011 um 09:45 schrieb Javier Lopez Cacheiro:

> Hi Reuti,
> 
> El 13/06/11 15:30, Reuti escribió:
>> Hi,
>> 
>> Am 13.06.2011 um 15:12 schrieb Javier Lopez Cacheiro:
>> 
>>> We have found a strange situation where GE 6.2u5 has allocated more 
>>> resources in a node than available, leaving a consumable with a value lower 
>>> than 0 (in this case the consumable is num_proc).
>>> 
>>> This is somehow similar to an issue that was found some time ago in SGE 6.2 
>>> (issue 2091) but in that case it was related to mpi jobs with fillup 
>>> allocation rule, and it was already solved in 6.2u3.
>>> 
>>> Now this is somehow different because it is not affecting mpi jobs but a 
>>> non-mpi job and it is occurring only in certain circumstances that are 
>>> still not clear.
>>> 
>>> In this case the situation was that at 06:13:57 the node had already 7 jobs 
>>> running, consuming 24 units of num_proc. Num_proc it is configured as a 
>>> consumable with a value of 24. So at that time the value of num_proc was 0. 
>>> But 4 seconds later, at 06:14:01, a new job was started in the node that 
>>> requested 24 num_proc, leaving the node with a value of -24 for num_proc.
>> num_proc is (fixed) feature of a node and shouldn't be made consumable. Is 
>> there any reason why you don't use slots?
>> 
> num_proc is used for historical reasons, not sure why slots was not chosen 
> instead.
> 
> In the other case we found num_proc < 0 we also did some tests using a new 
> complex instead of num_proc with the same results.
> 
> In this case it is difficult to reproduce the problem using a new complex 
> because it has been an uncommon situation and it is not clear what were the 
> the circumstances that lead to it.
> 
> For example it is quite strange that all the jobs entered in the node in a 
> period shorter than 1 minute. The only warnings that appear in the log of the 
> node at that time are related to core binding:
> 
> 06/10/2011 06:12:35|  main|compute-5-11|W|Core binding: Couldn't determine 
> core binding string for config file!
> 06/10/2011 06:13:01|  main|compute-5-11|W|Core binding: Couldn't determine 
> core binding string for config file!
> 06/10/2011 06:13:30|  main|compute-5-11|W|Core binding: Couldn't determine 
> core binding string for config file!
> 06/10/2011 06:13:41|  main|compute-5-11|W|Core binding: Couldn't determine 
> core binding string for config file!
> 06/10/2011 06:13:54|  main|compute-5-11|W|Core binding: Couldn't determine 
> core binding string for config file!
> 06/10/2011 06:13:57|  main|compute-5-11|W|Core binding: Couldn't determine 
> core binding string for config file!
> 06/10/2011 06:14:01|  main|compute-5-11|W|Core binding: Couldn't determine 
> core binding string for config file!

These can safely be ignored.


>> Nevertheless: do you request anything else with the -l option?
> Yes, several other complexes are also requested: h_fsize, s_vmem and s_rt

Then it looks like the issue I posted, although I referred more to limits.


> I can not tell now if the other consumable complexes (h_fsize

You made h_fsize consumable? It's a limit per process, and so the total amount 
can be bypassed by several processes of the same job anyway.


> and s_vmem)

I think that this doesn't need to be consumable, as you made h_vmem consumable 
already. It tells SGE when to send the SIGXCPU warning.

-- Reuti


> had also negative values but I guess no because disk and memory consumption 
> in the node was far below the available resources.
> 
> Cheers,
> Javier
>> -- Reuti
>> 
>> 
>>> I don't know if anyone else has come over this same problem with 6.2u5 and 
>>> if there is a workaround for it.
>>> 
>>> [jlopez@svgd ~]$ qhost -q -j -h c5-11
>>> HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO
>>> SWAPUS
>>> -------------------------------------------------------------------------------
>>> global - - - - - - -
>>> compute-5-11 x86_64 -24 47.92 31.5G 9.0G 8.0G 0.0
>>> GRID_large BP 0/4/24
>>> 6667492 1.92242 STDIN compchem015 r 06/10/2011 06:13:30 MASTER
>>> 6667493 1.92241 STDIN compchem015 r 06/10/2011 06:13:41 MASTER
>>> 6667494 1.92241 STDIN compchem015 r 06/10/2011 06:13:47 MASTER
>>> 6667495 1.92241 STDIN compchem015 r 06/10/2011 06:13:57 MASTER
>>> GRID_small BP 0/0/24
>>> small BPC 0/10/24
>>> 6652641 11.27961 p1761-7 csebdmfa r 06/10/2011 06:14:01 MASTER
>>> 6655259 10.43999 p577-16 csebdmfa r 06/10/2011 06:12:26 MASTER
>>> 6667942 3.93900 AuLJ139 csmyslfs r 06/10/2011 06:12:46 MASTER
>>> SLAVE
>>> SLAVE
>>> SLAVE
>>> SLAVE
>>> SLAVE
>>> SLAVE
>>> SLAVE
>>> SLAVE
>>> g0-mem_small BPC 0/0/24
>>> offline BP 0/0/24
>>> 
>>> 
>>> Thanks in advance,
>>> Javier
>>> 
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
> <jlopez.vcf>_______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to