Hi,

Am 23.10.2013 um 08:59 schrieb Arnau Bria:

> In our cluster we use virtual_Free and h_vmmem as consumable resources
> per job:
> 
> # qconf -sc|egrep 'virtual_free|h_vmem|^#'
> #name               shortcut     type        relop requestable consumable 
> default  urgency 
> #------------------------------------------------------------------------------------------
> h_vmem              h_vmem       MEMORY      <=    YES         JOB        0   
>      0
> virtual_free        vf           MEMORY      <=    YES         JOB        0   
>      0
> 
> 
> yesterday I found a paralle job that asked for 64GB of h_vmem that was
> using more than 100GB of mem but SGE did not kill it :

More than 100G in total or per slot (as the limit is multiplied)?


> # qstat -j 2098938|grep vmem
> hard resource_list:         virtual_free=64G,h_vmem=64G,h_rt=172800
> usage    1:                 cpu=18:26:24, mem=111455.48587 GBs, 
> io=1735.61545, vmem=196.038G, maxvmem=197.132G

Can you please `grep` the messages file for the executing node for other 
entries of job "2098938".

-- Reuti


> the node ran out of memory and it killed some processes, and finally we
> killed (qdel) the job:
> 
> # grep 2098938 messages
> 10/22/2013 18:20:49|worker|ant-master2|W|job 2098938.1 failed on host YY 
> assumedly after job because: job 2098938.1 died through signal KILL (9)
> 
> 
> # qacct -j 2098938 -f joao 
> ==============================================================
> qname        rg-el6              
> hostname     YY
> group        XX
> owner        jcurado             
> project      NONE                
> department   defaultdepartment   
> jobname      ZZ           
> jobnumber    2098938             
> taskid       undefined
> account      sge                 
> priority     0                   
> qsub_time    Tue Oct 22 12:55:58 2013
> start_time   Tue Oct 22 12:59:01 2013
> end_time     Tue Oct 22 18:20:48 2013
> granted_pe   smp                 
> slots        8                   
> failed       100 : assumedly after job
> exit_status  137                 
> ru_wallclock 19307        
> ru_utime     0.058        
> ru_stime     1.662        
> ru_maxrss    5412                
> ru_ixrss     0                   
> ru_ismrss    0                   
> ru_idrss     0                   
> ru_isrss     0                   
> ru_minflt    14819               
> ru_majflt    2                   
> ru_nswap     0                   
> ru_inblock   967416              
> ru_oublock   1298344             
> ru_msgsnd    0                   
> ru_msgrcv    0                   
> ru_nsignals  0                   
> ru_nvcsw     2324                
> ru_nivcsw    15165               
> cpu          67178.120    
> mem          125116.602        
> io           1745.077          
> iow          0.000             
> maxvmem      197.184G
> arid         undefined
> 
> I'm looking for some extra info in node YY, but I find nothing in
> messages.
> That node did kill other jobs becaue the used more memory than
> requested in h_vmem:
> 
> main|YY|W|job 1993603 exceeds job hard limit "h_vmem" of queue "rg-el6@YY" 
> (53771632640.00000 > limit:53687091200.00000) - sending SIGKILL
> 
> So, why it did not kill that job? how may I start debugging the problem? (I'm 
> submiting the exact same job)
> 
> 
> TIA,
> Arnau
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to