Hi,

Am 28.02.2014 um 00:28 schrieb Andrew Joplin:

> New member here with a couple questions - they're unrelated, so I'll make 
> separate posts.
> 
> First off, we're runnig grid engine version OGS/GE 2011.11.  I recently 
> finished setting up a hierarchy of three queues - high, medium, and low 
> priority.  Medium is subordinate to high, and low to medium.  The queues span 
> multiple hosts, but are all configured identically except for the 
> subordination (and a complex that I use to specify which queue to get into).
> 
> For the most part, this works great - I can submit a large number of long 
> jobs to the low priority queue, and they get suspended whenever someone else 
> uses the medium priority queue.  But the first problem I'm running into is 
> that occasionally, the suspended jobs don't seem to be restarted.  According 
> to qstat, they have been (status "r"), but when I check the corresponding 
> process on the execute host, I see a process status "T", as if the SIGCONT 
> signal was never sent.  I can manually send a SIGCONT to the job, and it 
> finishes processing, but otherwise it does nothing until I notice it (usually 
> next day).  Other times a job will show a status "r" in qstat, but I can't 
> even find the process on the host it's supposed to be on.
> 
> Has anyone seen this behavior before?  I've tried recreating the problem, but 
> I can't seem to reliably reproduce it.  It seems to just happen "sometimes" 
> when one of my long jobs gets suspended.

What can be done investigate it: setting a custom "resume_method" in the queue 
definition and record whether the it was called or not (therein the SIGCONT 
needs to be send to the complete process group:

kill -CONT -- $1

and parameter $1 is $job_pid from the pseudo variables for these interfaces.

-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to