Am 03.03.2014 um 12:59 schrieb Tina Friedrich:

> I was about to ask a similar question; we have the same sort of setup - high, 
> medium and low priority queues - and run into the same problem. Doesn't 
> happen all the time, but occasionally a job will simply still sit there 
> suspended although it should've gotten an SIGCONT.

Were the signals recorded as being sent? Like:

03/03/2014 13:44:52|  main|pc15370|I|SIGNAL jid: 11254 jatask: 1 signal: STOP
03/03/2014 13:44:57|  main|pc15370|I|SIGNAL jid: 11254 jatask: 1 signal: CONT

-- Reuti


> Tina
> 
> On 01/03/14 12:23, Reuti wrote:
>> Hi,
>> 
>> Am 28.02.2014 um 00:28 schrieb Andrew Joplin:
>> 
>>> New member here with a couple questions - they're unrelated, so I'll make 
>>> separate posts.
>>> 
>>> First off, we're runnig grid engine version OGS/GE 2011.11.  I recently 
>>> finished setting up a hierarchy of three queues - high, medium, and low 
>>> priority.  Medium is subordinate to high, and low to medium.  The queues 
>>> span multiple hosts, but are all configured identically except for the 
>>> subordination (and a complex that I use to specify which queue to get into).
>>> 
>>> For the most part, this works great - I can submit a large number of long 
>>> jobs to the low priority queue, and they get suspended whenever someone 
>>> else uses the medium priority queue.  But the first problem I'm running 
>>> into is that occasionally, the suspended jobs don't seem to be restarted.  
>>> According to qstat, they have been (status "r"), but when I check the 
>>> corresponding process on the execute host, I see a process status "T", as 
>>> if the SIGCONT signal was never sent.  I can manually send a SIGCONT to the 
>>> job, and it finishes processing, but otherwise it does nothing until I 
>>> notice it (usually next day).  Other times a job will show a status "r" in 
>>> qstat, but I can't even find the process on the host it's supposed to be on.
>>> 
>>> Has anyone seen this behavior before?  I've tried recreating the problem, 
>>> but I can't seem to reliably reproduce it.  It seems to just happen 
>>> "sometimes" when one of my long jobs gets suspended.
>> 
>> What can be done investigate it: setting a custom "resume_method" in the 
>> queue definition and record whether the it was called or not (therein the 
>> SIGCONT needs to be send to the complete process group:
>> 
>> kill -CONT -- $1
>> 
>> and parameter $1 is $job_pid from the pseudo variables for these interfaces.
>> 
>> -- Reuti
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>> 
> 
> 
> -- 
> Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
> Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
> 
> -- 
> This e-mail and any attachments may contain confidential, copyright and or 
> privileged material, and are for the use of the intended addressee only. If 
> you are not the intended addressee or an authorised recipient of the 
> addressee please notify us of receipt by returning the e-mail and do not use, 
> copy, retain, distribute or disclose the information in or attached to the 
> e-mail.
> Any opinions expressed within this e-mail are those of the individual and not 
> necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot 
> guarantee that this e-mail or any attachments are free from viruses and we 
> cannot accept liability for any damage which you may sustain as a result of 
> software viruses which may be transmitted in or with the message.
> Diamond Light Source Limited (company no. 4375679). Registered in England and 
> Wales with its registered office at Diamond House, Harwell Science and 
> Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
> 
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to