I was about to ask a similar question; we have the same sort of setup - high, medium and low priority queues - and run into the same problem. Doesn't happen all the time, but occasionally a job will simply still sit there suspended although it should've gotten an SIGCONT.

Tina

On 01/03/14 12:23, Reuti wrote:
Hi,

Am 28.02.2014 um 00:28 schrieb Andrew Joplin:

New member here with a couple questions - they're unrelated, so I'll make 
separate posts.

First off, we're runnig grid engine version OGS/GE 2011.11.  I recently 
finished setting up a hierarchy of three queues - high, medium, and low 
priority.  Medium is subordinate to high, and low to medium.  The queues span 
multiple hosts, but are all configured identically except for the subordination 
(and a complex that I use to specify which queue to get into).

For the most part, this works great - I can submit a large number of long jobs to the low priority queue, and 
they get suspended whenever someone else uses the medium priority queue.  But the first problem I'm running 
into is that occasionally, the suspended jobs don't seem to be restarted.  According to qstat, they have been 
(status "r"), but when I check the corresponding process on the execute host, I see a process 
status "T", as if the SIGCONT signal was never sent.  I can manually send a SIGCONT to the job, and 
it finishes processing, but otherwise it does nothing until I notice it (usually next day).  Other times a 
job will show a status "r" in qstat, but I can't even find the process on the host it's supposed to 
be on.

Has anyone seen this behavior before?  I've tried recreating the problem, but I can't 
seem to reliably reproduce it.  It seems to just happen "sometimes" when one of 
my long jobs gets suspended.

What can be done investigate it: setting a custom "resume_method" in the queue 
definition and record whether the it was called or not (therein the SIGCONT needs to be 
send to the complete process group:

kill -CONT -- $1

and parameter $1 is $job_pid from the pseudo variables for these interfaces.

-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users



--
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

--
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom




_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to