Am 14.06.2011 um 09:45 schrieb Javier Lopez Cacheiro: > Hi Reuti, > > El 13/06/11 15:30, Reuti escribió: >> Hi, >> >> Am 13.06.2011 um 15:12 schrieb Javier Lopez Cacheiro: >> >>> We have found a strange situation where GE 6.2u5 has allocated more >>> resources in a node than available, leaving a consumable with a value lower >>> than 0 (in this case the consumable is num_proc). >>> >>> This is somehow similar to an issue that was found some time ago in SGE 6.2 >>> (issue 2091) but in that case it was related to mpi jobs with fillup >>> allocation rule, and it was already solved in 6.2u3. >>> >>> Now this is somehow different because it is not affecting mpi jobs but a >>> non-mpi job and it is occurring only in certain circumstances that are >>> still not clear. >>> >>> In this case the situation was that at 06:13:57 the node had already 7 jobs >>> running, consuming 24 units of num_proc. Num_proc it is configured as a >>> consumable with a value of 24. So at that time the value of num_proc was 0. >>> But 4 seconds later, at 06:14:01, a new job was started in the node that >>> requested 24 num_proc, leaving the node with a value of -24 for num_proc. >> num_proc is (fixed) feature of a node and shouldn't be made consumable. Is >> there any reason why you don't use slots? >> > num_proc is used for historical reasons, not sure why slots was not chosen > instead. > > In the other case we found num_proc < 0 we also did some tests using a new > complex instead of num_proc with the same results. > > In this case it is difficult to reproduce the problem using a new complex > because it has been an uncommon situation and it is not clear what were the > the circumstances that lead to it. > > For example it is quite strange that all the jobs entered in the node in a > period shorter than 1 minute. The only warnings that appear in the log of the > node at that time are related to core binding: > > 06/10/2011 06:12:35| main|compute-5-11|W|Core binding: Couldn't determine > core binding string for config file! > 06/10/2011 06:13:01| main|compute-5-11|W|Core binding: Couldn't determine > core binding string for config file! > 06/10/2011 06:13:30| main|compute-5-11|W|Core binding: Couldn't determine > core binding string for config file! > 06/10/2011 06:13:41| main|compute-5-11|W|Core binding: Couldn't determine > core binding string for config file! > 06/10/2011 06:13:54| main|compute-5-11|W|Core binding: Couldn't determine > core binding string for config file! > 06/10/2011 06:13:57| main|compute-5-11|W|Core binding: Couldn't determine > core binding string for config file! > 06/10/2011 06:14:01| main|compute-5-11|W|Core binding: Couldn't determine > core binding string for config file!
These can safely be ignored. >> Nevertheless: do you request anything else with the -l option? > Yes, several other complexes are also requested: h_fsize, s_vmem and s_rt Then it looks like the issue I posted, although I referred more to limits. > I can not tell now if the other consumable complexes (h_fsize You made h_fsize consumable? It's a limit per process, and so the total amount can be bypassed by several processes of the same job anyway. > and s_vmem) I think that this doesn't need to be consumable, as you made h_vmem consumable already. It tells SGE when to send the SIGXCPU warning. -- Reuti > had also negative values but I guess no because disk and memory consumption > in the node was far below the available resources. > > Cheers, > Javier >> -- Reuti >> >> >>> I don't know if anyone else has come over this same problem with 6.2u5 and >>> if there is a workaround for it. >>> >>> [jlopez@svgd ~]$ qhost -q -j -h c5-11 >>> HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO >>> SWAPUS >>> ------------------------------------------------------------------------------- >>> global - - - - - - - >>> compute-5-11 x86_64 -24 47.92 31.5G 9.0G 8.0G 0.0 >>> GRID_large BP 0/4/24 >>> 6667492 1.92242 STDIN compchem015 r 06/10/2011 06:13:30 MASTER >>> 6667493 1.92241 STDIN compchem015 r 06/10/2011 06:13:41 MASTER >>> 6667494 1.92241 STDIN compchem015 r 06/10/2011 06:13:47 MASTER >>> 6667495 1.92241 STDIN compchem015 r 06/10/2011 06:13:57 MASTER >>> GRID_small BP 0/0/24 >>> small BPC 0/10/24 >>> 6652641 11.27961 p1761-7 csebdmfa r 06/10/2011 06:14:01 MASTER >>> 6655259 10.43999 p577-16 csebdmfa r 06/10/2011 06:12:26 MASTER >>> 6667942 3.93900 AuLJ139 csmyslfs r 06/10/2011 06:12:46 MASTER >>> SLAVE >>> SLAVE >>> SLAVE >>> SLAVE >>> SLAVE >>> SLAVE >>> SLAVE >>> SLAVE >>> g0-mem_small BPC 0/0/24 >>> offline BP 0/0/24 >>> >>> >>> Thanks in advance, >>> Javier >>> >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users > <jlopez.vcf>_______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
