I'll see if I can find anything in the logs. Good point. Our assumption
so far was that SGE sent the signal (as the job's marked 'running'
again), and it either didn't reach the right process or what not - I was
about to start looking into it in more detail, as it is getting more and
more of a problem (likely simply more noticeable with higher cluster load).
Tina
On 03/03/14 12:46, Reuti wrote:
Am 03.03.2014 um 12:59 schrieb Tina Friedrich:
I was about to ask a similar question; we have the same sort of setup - high,
medium and low priority queues - and run into the same problem. Doesn't happen
all the time, but occasionally a job will simply still sit there suspended
although it should've gotten an SIGCONT.
Were the signals recorded as being sent? Like:
03/03/2014 13:44:52| main|pc15370|I|SIGNAL jid: 11254 jatask: 1 signal: STOP
03/03/2014 13:44:57| main|pc15370|I|SIGNAL jid: 11254 jatask: 1 signal: CONT
-- Reuti
Tina
On 01/03/14 12:23, Reuti wrote:
Hi,
Am 28.02.2014 um 00:28 schrieb Andrew Joplin:
New member here with a couple questions - they're unrelated, so I'll make
separate posts.
First off, we're runnig grid engine version OGS/GE 2011.11. I recently
finished setting up a hierarchy of three queues - high, medium, and low
priority. Medium is subordinate to high, and low to medium. The queues span
multiple hosts, but are all configured identically except for the subordination
(and a complex that I use to specify which queue to get into).
For the most part, this works great - I can submit a large number of long jobs to the low priority queue, and
they get suspended whenever someone else uses the medium priority queue. But the first problem I'm running
into is that occasionally, the suspended jobs don't seem to be restarted. According to qstat, they have been
(status "r"), but when I check the corresponding process on the execute host, I see a process
status "T", as if the SIGCONT signal was never sent. I can manually send a SIGCONT to the job, and
it finishes processing, but otherwise it does nothing until I notice it (usually next day). Other times a
job will show a status "r" in qstat, but I can't even find the process on the host it's supposed to
be on.
Has anyone seen this behavior before? I've tried recreating the problem, but I can't
seem to reliably reproduce it. It seems to just happen "sometimes" when one of
my long jobs gets suspended.
What can be done investigate it: setting a custom "resume_method" in the queue
definition and record whether the it was called or not (therein the SIGCONT needs to be
send to the complete process group:
kill -CONT -- $1
and parameter $1 is $job_pid from the pseudo variables for these interfaces.
-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
--
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
--
This e-mail and any attachments may contain confidential, copyright and or
privileged material, and are for the use of the intended addressee only. If you
are not the intended addressee or an authorised recipient of the addressee
please notify us of receipt by returning the e-mail and do not use, copy,
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not
necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot
guarantee that this e-mail or any attachments are free from viruses and we
cannot accept liability for any damage which you may sustain as a result of
software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and
Wales with its registered office at Diamond House, Harwell Science and
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
--
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
--
This e-mail and any attachments may contain confidential, copyright and or
privileged material, and are for the use of the intended addressee only. If you
are not the intended addressee or an authorised recipient of the addressee
please notify us of receipt by returning the e-mail and do not use, copy,
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and
Wales with its registered office at Diamond House, Harwell Science and
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users