> Am 08.08.2018 um 08:15 schrieb Derrick Lin <klin...@gmail.com>:
> 
> Hi guys,
> 
> I have a user reported his jobs stuck running for much longer than usual.
> 
> So I go to the exec host, check the process and all processes owned by that 
> user look like:
> 
> `- -bash /opt/gridengine/default/spool/omega-6-20/job_scripts/1187671

What state of the job you see in this line? Is it just hanging there and doing 
nothing? They do not appear in `top`? And it never vanishes automatically but 
you have to kill the job by hand?


> In qstat, it still shows job is in running state.

The `qstat`output is not really related to any running process. It's just what 
SGE granted and think it is running or granted to run. Especially with parallel 
jobs across nodes, the might or might not be any process on one of the granted 
slave nodes.


> The user resubmitted the jobs and they ran and completed without an problem.

Could it be a race condition with the shared file system?

-- Reuti


> I am wondering what may has caused this situation in general?
> 
> Cheers,
> Derrick
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to