Am 03.03.2014 um 19:28 schrieb Andrew Joplin: > That's a good idea - I'll try that next. But first here's some additional > info from the qmaster messages file. I'm running one of my longer jobs (with > qmake, and a couple hundred targets) now, and at the same time someone else > is running many jobs on the superordinate queue, which is suspending some of > my jobs (as it should). I'm waiting to see if any are *not* being restarted, > but in the mean time I'm seeing a lot of these messages in qmaster messages: > > [...] job failed on host assumedly after job because: can't read usage file > for job [...] > > and > > [...] job failed on host assumedly after job because: job died through signal > HUP (1)
Ui - this is something I never saw before, as I'm not aware that SGE will send SIGHUP anywhere by default. Did you redefine the sent signals? -- Reuti > Finally, I also see this message: > > [...] Jobs 3874441 & 3875448 dispatched to master/subordinated queues [...] > Suspend on subordinate to occur in same scheduling interval. Policy conflict! > > The later job is my main qmake job. > > Andrew Joplin > > > On 03/01/2014 06:23 AM, Reuti wrote: >> Hi, >> >> Am 28.02.2014 um 00:28 schrieb Andrew Joplin: >> >>> New member here with a couple questions - they're unrelated, so I'll make >>> separate posts. >>> >>> First off, we're runnig grid engine version OGS/GE 2011.11. I recently >>> finished setting up a hierarchy of three queues - high, medium, and low >>> priority. Medium is subordinate to high, and low to medium. The queues >>> span multiple hosts, but are all configured identically except for the >>> subordination (and a complex that I use to specify which queue to get into). >>> >>> For the most part, this works great - I can submit a large number of long >>> jobs to the low priority queue, and they get suspended whenever someone >>> else uses the medium priority queue. But the first problem I'm running >>> into is that occasionally, the suspended jobs don't seem to be restarted. >>> According to qstat, they have been (status "r"), but when I check the >>> corresponding process on the execute host, I see a process status "T", as >>> if the SIGCONT signal was never sent. I can manually send a SIGCONT to the >>> job, and it finishes processing, but otherwise it does nothing until I >>> notice it (usually next day). Other times a job will show a status "r" in >>> qstat, but I can't even find the process on the host it's supposed to be on. >>> >>> Has anyone seen this behavior before? I've tried recreating the problem, >>> but I can't seem to reliably reproduce it. It seems to just happen >>> "sometimes" when one of my long jobs gets suspended. >> What can be done investigate it: setting a custom "resume_method" in the >> queue definition and record whether the it was called or not (therein the >> SIGCONT needs to be send to the complete process group: >> >> kill -CONT -- $1 >> >> and parameter $1 is $job_pid from the pseudo variables for these interfaces. >> >> -- Reuti > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
