Hi, Am 26.11.2013 um 18:05 schrieb Nicolás Serrano Martínez-Santos:
> We are having a problem as described in the subject on gridengine 2011.11. > Some processes finish their execution but they still appear as running in the > queue, and they keep consuming their slot. I have been looking for the source > of this problem and this is what I have found so far: > > In the execution host that executed this process, there is no shepperd for > this > process and the trace file (which is deleted unless you set exec_params > keep_active=true) in the <host_spool>/active_jobs/<jobid> is like the one I > have > attached. The only common thing I have found is that there is a > > wait3 returned -1 > > in the trace file that sets some kill command to be performed. As shown in > the > trace the process finish "correctly" but the <host_spool>/messages start > showing: But the process is also gone from the node, and not in some uninterruptible kernel sleep? > 11/25/2013 07:16:00| main|xxx012|W|job 312363.9 exceeded hard wallclock time > - initiate terminate method > 11/25/2013 07:16:00| main|xxx012|W|failed to deliver signal 20 to job > 312363.9 for KILL (shepherd with pid 18734): No such file or directory What's in the script: /scripts/sgeepilog.sh - anything what could hang? Are you using -notify and s_rt at the same time? At least for the CPU time I spot 36000 as s_cpu which I suggest to remove. It has no direct effect as you have a h_cpu in addition anyway. Having -notify and a soft warning at the same time could result in a warning for the warning and the job is never killed but warned every 90 seconds or so. Maybe something similar is happening when you have s_cpu and s_rt being triggered almost at the same time. -- Reuti > until the process is deleted with "-f". > > In the <qmaster spool>/messages there are references to this jobs as: > > 11/25/2013 10:11:41|schedu|mainnode|W|job 312363.9 should have finished since > 10483s > > Do you have any hint of what can be problem? > > Thanks in advance, > > -- > NiCo > <trace>_______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
