Hi,

Am 26.11.2013 um 18:05 schrieb Nicolás Serrano Martínez-Santos:

> We are having a problem as described in the subject on gridengine 2011.11. 
> Some processes finish their execution but they still appear as running in the
> queue, and they keep consuming their slot. I have been looking for the source
> of this problem and this is what I have found so far:
> 
> In the execution host that executed this process, there is no shepperd for 
> this
> process and the trace file (which is deleted unless you set exec_params
> keep_active=true) in the <host_spool>/active_jobs/<jobid> is like the one I 
> have
> attached. The only common thing I have found is that there is a
> 
> wait3 returned -1
> 
> in the trace file that sets some kill command to be performed. As shown in 
> the 
> trace the process finish "correctly" but the <host_spool>/messages start 
> showing:

But the process is also gone from the node, and not in some uninterruptible 
kernel sleep?

> 11/25/2013 07:16:00|  main|xxx012|W|job 312363.9 exceeded hard wallclock time 
> - initiate terminate method
> 11/25/2013 07:16:00|  main|xxx012|W|failed to deliver signal 20 to job 
> 312363.9 for KILL (shepherd with pid 18734): No such file or directory

What's in the script: /scripts/sgeepilog.sh - anything what could hang?

Are you using -notify and s_rt at the same time? At least for the CPU time I 
spot 36000 as s_cpu which I suggest to remove. It has no direct effect as you 
have a h_cpu in addition anyway. Having -notify and a soft warning at the same 
time could result in a warning for the warning and the job is never killed but 
warned every 90 seconds or so. Maybe something similar is happening when you 
have s_cpu and s_rt being triggered almost at the same time.

-- Reuti


> until the process is deleted with "-f".
> 
> In the <qmaster spool>/messages there are references to this jobs as:
> 
> 11/25/2013 10:11:41|schedu|mainnode|W|job 312363.9 should have finished since 
> 10483s
> 
> Do you have any hint of what can be problem?
> 
> Thanks in advance,
> 
> -- 
> NiCo
> <trace>_______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to