Hi, Am 23.10.2013 um 08:59 schrieb Arnau Bria:
> In our cluster we use virtual_Free and h_vmmem as consumable resources > per job: > > # qconf -sc|egrep 'virtual_free|h_vmem|^#' > #name shortcut type relop requestable consumable > default urgency > #------------------------------------------------------------------------------------------ > h_vmem h_vmem MEMORY <= YES JOB 0 > 0 > virtual_free vf MEMORY <= YES JOB 0 > 0 > > > yesterday I found a paralle job that asked for 64GB of h_vmem that was > using more than 100GB of mem but SGE did not kill it : More than 100G in total or per slot (as the limit is multiplied)? > # qstat -j 2098938|grep vmem > hard resource_list: virtual_free=64G,h_vmem=64G,h_rt=172800 > usage 1: cpu=18:26:24, mem=111455.48587 GBs, > io=1735.61545, vmem=196.038G, maxvmem=197.132G Can you please `grep` the messages file for the executing node for other entries of job "2098938". -- Reuti > the node ran out of memory and it killed some processes, and finally we > killed (qdel) the job: > > # grep 2098938 messages > 10/22/2013 18:20:49|worker|ant-master2|W|job 2098938.1 failed on host YY > assumedly after job because: job 2098938.1 died through signal KILL (9) > > > # qacct -j 2098938 -f joao > ============================================================== > qname rg-el6 > hostname YY > group XX > owner jcurado > project NONE > department defaultdepartment > jobname ZZ > jobnumber 2098938 > taskid undefined > account sge > priority 0 > qsub_time Tue Oct 22 12:55:58 2013 > start_time Tue Oct 22 12:59:01 2013 > end_time Tue Oct 22 18:20:48 2013 > granted_pe smp > slots 8 > failed 100 : assumedly after job > exit_status 137 > ru_wallclock 19307 > ru_utime 0.058 > ru_stime 1.662 > ru_maxrss 5412 > ru_ixrss 0 > ru_ismrss 0 > ru_idrss 0 > ru_isrss 0 > ru_minflt 14819 > ru_majflt 2 > ru_nswap 0 > ru_inblock 967416 > ru_oublock 1298344 > ru_msgsnd 0 > ru_msgrcv 0 > ru_nsignals 0 > ru_nvcsw 2324 > ru_nivcsw 15165 > cpu 67178.120 > mem 125116.602 > io 1745.077 > iow 0.000 > maxvmem 197.184G > arid undefined > > I'm looking for some extra info in node YY, but I find nothing in > messages. > That node did kill other jobs becaue the used more memory than > requested in h_vmem: > > main|YY|W|job 1993603 exceeds job hard limit "h_vmem" of queue "rg-el6@YY" > (53771632640.00000 > limit:53687091200.00000) - sending SIGKILL > > So, why it did not kill that job? how may I start debugging the problem? (I'm > submiting the exact same job) > > > TIA, > Arnau > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users