Am 26.11.2014 um 11:30 schrieb Guillermo Marco Puche: > On 26/11/14 11:17, Reuti wrote: >> Am 26.11.2014 um 08:23 schrieb Guillermo Marco Puche: >> >>> On 26/11/14 00:42, Reuti wrote: >>>> Hi, >>>> >>>> Am 25.11.2014 um 23:28 schrieb Guillermo Marco Puche: >>>> >>>>> I'm experiencing a very weird issue. I've no idea how to deal with it. >>>>> • I've submited multiple jobs ie: job1, job2, job3. >>>>> • Jobs are running in multiple compute nodes >>>>> • I've modified jobs to user hold and then rescheduled >>>>> • Jobs are now in a hqR state in SGE job pool (they're supposed to stay >>>>> there and free their slots and resources in their respective compute >>>>> nodes) >>>>> • Compute nodes that previously ran this jobs continue to execute the >>>>> job process and consuming resources (I can see them with htop inside >>>>> compute node) >>>> But they are gone from `qstat` and not listed twice? >>> Nope, they're listed once in qstat. >>>> >>>>> So what's the correct way to pause/restart a job and hold it on SGE pool >>>>> without holding resources? >>>> Are these processes still bound to the execd and the shepherd of SGE or >>>> did they jump out of the process tree compared to the time when they were >>>> running initially? >>> Yest processes still bound to the execd and the shepherd of SGE. >> Which version of SGE are you using? After issuing `qmod -rj <jobid>` they >> should be gone of course. > GE 6.2u5
Can you please set the loglevel in SGE's configuration: $ qconf -sconf ... loglevel log_info and have a look at the messages file of the node. There should be an entry like: $ less /var/spool/sge/mypc/messages 11/26/2014 19:00:24| main|mypc|I|SIGNAL jid: 11772 jatask: 1 signal: KILL -- Reuti > Guillermo. >> >> -- Reuti >> >>>> Do you use any `trap` inside the job script? >>> No trap commands. >>>> -- Reuti >>> Regards, >>> Guillermo. >>> _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
