Excerpts from Reuti's message of 2013-11-26 19:37:34 +0100: > > But the process is also gone from the node, and not in some uninterruptible > kernel sleep? >
It is gone. > > What's in the script: /scripts/sgeepilog.sh - anything what could hang? > Please find it attached. However, the wait does not always return -1 in the epilog but sometimes also in the main script. > Are you using -notify and s_rt at the same time? At least for the CPU time I > spot 36000 as s_cpu which I suggest to remove. It has no direct effect as you > have a h_cpu in addition anyway. Having -notify and a soft warning at the > same time could result in a warning for the warning and the job is never > killed but warned every 90 seconds or so. Maybe something similar is > happening when you have s_cpu and s_rt being triggered almost at the same > time. > We are not using those two options. This is what the typical qstat of a process loooks like ============================================================== job_number: 294730 exec_file: job_scripts/294730 submission_time: Tue Nov 19 17:30:06 2013 owner: adgipas uid: 3155 group: 20040059 gid: 3091 sge_o_home: /h/adgipas sge_o_log_name: adgipas sge_o_path: /home/adgipas/proves_cart/share_tlk/tLtask-train/scripts:/h/adgipas/proves_cart/bin:/home/apps/ompi/1.6.4/gnu/bin:/bin:/usr/bin:/home/apps/oge/bin/linux-x64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/h/adgipas/bin:/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts sge_o_shell: /bin/bash sge_o_workdir: /home/adgipas/proves_cart sge_o_host: mainnode account: sge cwd: /h/adgipas/proves_cart reserve: y merge: y hard resource_list: h_cpu=72000,h_rt=72000,h_vmem=5120M mail_list: [email protected] notify: FALSE job_name: cart_700.standard.triphoneme.train-em-MIX01-ITER1-estimate jobshare: 0 shell_list: NONE:/bin/bash env_list: PATH=/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts:/h/adgipas/proves_cart/bin:/home/apps/ompi/1.6.4/gnu/bin:/bin:/usr/bin:/home/apps/oge/bin/linux-x64:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/h/adgipas/bin:/home/adgipas/proves_cart/share_tlk/tLtask-train/scripts script_file: STDIN jid_predecessor_list (req): cart_700.standard.triphoneme.train-init jid_successor_list: 294731 job-array tasks: 1-500:1 usage 334: cpu=10:08:02, mem=182410.00000 GBs, io=0.00000, vmem=5.000G, maxvmem=5.000G scheduling info: queue instance "[email protected]" dropped because it is disabled queue instance "[email protected]" dropped because it is disabled ----------------- Another peculiarity of the cluster is that all processes are submittion with -R y, could it cause also any problem? I read in one of your mails http://gridengine.org/pipermail/users/2012-October/005077.html but I don't think is related to this problem. > -- Reuti > > > until the process is deleted with "-f". > > > > In the <qmaster spool>/messages there are references to this jobs as: > > > > 11/25/2013 10:11:41|schedu|mainnode|W|job 312363.9 should have finished > > since 10483s > > > > Do you have any hint of what can be problem? > > > > Thanks in advance, > > > > -- > > NiCo > > <trace>_______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users -- NiCo _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
