Re: [gridengine users] Rescheduled and held job running zombie in compute node

Reuti Wed, 26 Nov 2014 10:06:49 -0800

Am 26.11.2014 um 11:30 schrieb Guillermo Marco Puche:

> On 26/11/14 11:17, Reuti wrote:
>> Am 26.11.2014 um 08:23 schrieb Guillermo Marco Puche:
>> 
>>> On 26/11/14 00:42, Reuti wrote:
>>>> Hi,
>>>> 
>>>> Am 25.11.2014 um 23:28 schrieb Guillermo Marco Puche:
>>>> 
>>>>> I'm experiencing a very weird issue. I've no idea how to deal with it.
>>>>>   • I've submited multiple jobs ie: job1, job2, job3.
>>>>>   • Jobs are running in multiple compute nodes
>>>>>   • I've modified jobs to user hold and then rescheduled
>>>>>   • Jobs are now in a hqR state in SGE job pool (they're supposed to stay 
>>>>> there and free their slots and resources in their respective compute 
>>>>> nodes)
>>>>>   • Compute nodes that previously ran this jobs continue to execute the 
>>>>> job process and consuming resources (I can see them with htop inside 
>>>>> compute node)
>>>> But they are gone from `qstat` and not listed twice?
>>> Nope, they're listed once in qstat.
>>>> 
>>>>> So what's the correct way to pause/restart a job and hold it on SGE pool 
>>>>> without holding resources?
>>>> Are these processes still bound to the execd and the shepherd of SGE or 
>>>> did they jump out of the process tree compared to the time when they were 
>>>> running initially?
>>> Yest processes still bound to the execd and the shepherd of SGE.
>> Which version of SGE are you using? After issuing `qmod -rj <jobid>` they 
>> should be gone of course.
> GE 6.2u5


Can you please set the loglevel in SGE's configuration:

$ qconf -sconf
...
loglevel                     log_info

and have a look at the messages file of the node. There should be an entry like:

$ less /var/spool/sge/mypc/messages
11/26/2014 19:00:24|  main|mypc|I|SIGNAL jid: 11772 jatask: 1 signal: KILL

-- Reuti


> Guillermo.
>> 
>> -- Reuti
>> 
>>>> Do you use any `trap` inside the job script?
>>> No trap commands.
>>>> -- Reuti
>>> Regards,
>>> Guillermo.
>>> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Rescheduled and held job running zombie in compute node

Reply via email to