Re: [gridengine users] User job fails silently

Feng Zhang Wed, 08 Aug 2018 20:08:18 -0700

I also did some test. If provide extra interrupting signal processing in
all these scripts(to catch any TERM signal from OS), it can kind of solve
the issue.


On Wed, Aug 8, 2018 at 11:04 PM, Feng Zhang <prod.f...@gmail.com> wrote:

>
> I am guessing it may be very similar to what I have met before. My issue
> was: one user used a bash script as a batch job script, and in it, it calls
> another script(python), and this script then calls a third script(and may
> be so on...).  For these kind of jobs, if there's anything wrong, it can
> cause some "defunct" processes( especialy for parallel jobs).  Theses jobs
> then actually corrupted, but since the defunct and other processes are not
> properly terminated(sometimes, even after the shephard process get
> terminated already), SGE still thinks it is running, and on the compute
> nodes, there are still some processes of the failed job running(hanging).
>
> What I did to resolve the issue is: write a script to check those jobs, if
> found any defunct processes, then kill the job related process(es), which
> will trigger SGE to kill the job automatically.
>
>
>
> On Wed, Aug 8, 2018 at 7:26 PM, Derrick Lin <klin...@gmail.com> wrote:
>
>> >  What state of the job you see in this line? Is it just hanging there
>> and doing nothing? They do not appear in `top`? And it never vanishes
>> automatically but you have to kill the job by hand?
>>
>> Sorry for the confusion. The job state is "r" according to SGE, but as
>> you mentioned qstat output is not related to any process.
>>
>> The line I coped is what it shown in top/htop. So basically, all his jobs
>> became:
>>
>> `- -bash /opt/gridengine/default/spool/omega-6-20/job_scripts/1187671
>> `- -bash /opt/gridengine/default/spool/omega-6-20/job_scripts/1187677
>> `- -bash /opt/gridengine/default/spool/omega-6-20/job_scripts/1187690
>>
>> Each of this scripts does copy & untar a file to the local XFS file
>> system, then a python script is called to operate on these untared files.
>>
>> The job log shows that untaring is done, but the python script has never
>> started and the job process stuck as shown above.
>>
>> We don't see any storage related contention.
>>
>> I am more interested in knowing where this process
>> bash /opt/gridengine/default/spool/omega-6-20/job_scripts/1187671 come
>> from?
>>
>> Cheers,
>>
>>
>> On Wed, Aug 8, 2018 at 6:53 PM, Reuti <re...@staff.uni-marburg.de> wrote:
>>
>>>
>>> > Am 08.08.2018 um 08:15 schrieb Derrick Lin <klin...@gmail.com>:
>>> >
>>> > Hi guys,
>>> >
>>> > I have a user reported his jobs stuck running for much longer than
>>> usual.
>>> >
>>> > So I go to the exec host, check the process and all processes owned by
>>> that user look like:
>>> >
>>> > `- -bash /opt/gridengine/default/spool/omega-6-20/job_scripts/1187671
>>>
>>> What state of the job you see in this line? Is it just hanging there and
>>> doing nothing? They do not appear in `top`? And it never vanishes
>>> automatically but you have to kill the job by hand?
>>>
>>>
>>> > In qstat, it still shows job is in running state.
>>>
>>> The `qstat`output is not really related to any running process. It's
>>> just what SGE granted and think it is running or granted to run. Especially
>>> with parallel jobs across nodes, the might or might not be any process on
>>> one of the granted slave nodes.
>>>
>>>
>>> > The user resubmitted the jobs and they ran and completed without an
>>> problem.
>>>
>>> Could it be a race condition with the shared file system?
>>>
>>> -- Reuti
>>>
>>>
>>> > I am wondering what may has caused this situation in general?
>>> >
>>> > Cheers,
>>> > Derrick
>>> > _______________________________________________
>>> > users mailing list
>>> > users@gridengine.org
>>> > https://gridengine.org/mailman/listinfo/users
>>>
>>>
>>
>> _______________________________________________
>> users mailing list
>> users@gridengine.org
>> https://gridengine.org/mailman/listinfo/users
>>
>>
>
>
> --
> Best,
>
> Feng
>



-- 
Best,

Feng

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] User job fails silently

Reply via email to