Re: [gridengine users] nodes not accepting jobs...

Reuti Wed, 12 Feb 2014 02:32:05 -0800

Hi,

Am 11.02.2014 um 23:56 schrieb Stephen Spencer:


> Yes, I can reach the troublesome nodes with "qrsh." 
> I was able to submit (via 'qsub') a simple job directly to one of those nodes.
> 
> Here's the output:
> 
> [root@frame ~]# qalter -w v 133068
> verification: found possible assignment with 8 slots
> [root@frame ~]# qalter -w p 133068
> Job 133068 cannot run in queue "vision.q" because it is not contained in its 
> hard queue list (-q)
> Job 133068 cannot run in queue "cuda.q" because it is not contained in its 
> hard queue list (-q)
> Job 133068 cannot run in PE "orte" because it only offers 0 slots
> verification: no suitable queues

Aha, so some parallel jobs are not starting is the issue. The error message 
"...because it only offers 0 slots" is often misleading (unfortunately). I 
assume the setting of slots in the PE and queue do not show any restriction.

- Were any resources like memory requested for the job?
- Are there any RQS setup?
- The nodes in question are completely free - nothing running in other queues?
- How many cores are in each machine and how many were requested?

-- Reuti


> It (133068) was submitted to the "all.q" queue.
> 
> Best,
> Stephen
> 
> 
> On Tue, Feb 11, 2014 at 2:48 PM, Reuti <[email protected]> wrote:
> Am 11.02.2014 um 23:37 schrieb Stephen Spencer:
> 
> > I did swap them initially, sorry.
> >
> > Yes, "qrsh -q all.q@n20 hostname" returns the appropriate FQDN.
> 
> So, you can reach the troublesome hosts now?
> 
> Next step is:
> 
> $ qalter -w v <job_id>
> $ qalter -w p <job_id>
> 
> with the waiting jobs.
> 
> -- Reuti
> 
> 
> >
> > Best,
> > Stephen
> >
> >
> > On Tue, Feb 11, 2014 at 2:33 PM, Reuti <[email protected]> wrote:
> > Am 11.02.2014 um 23:20 schrieb Stephen Spencer:
> >
> > > The definition of "qconf -sconf" is as you expected: all "builtin."
> > >
> > > Could you please be specific as to the commands you'd like me to try from 
> > > the next line?
> > >
> > > Any output when you use the "-q ..." for `qrsh` too? In addition, you can 
> > > try "-w v" and "-w p" too.
> >
> > I meant:
> >
> > $ qrsh -q all.q@n20 hostname
> >
> > (queue@host, did you swap them?)
> >
> > -- Reuti
> >
> >
> > >
> > > I tried "qrsh -w v" and "qrsh -w p" and both returned "verification: 
> > > found suitable queue(s)".
> > > "qrsh -q all.q" gave me a shell, surprisingly, on one of the troublesome 
> > > nodes. (Actually, was three for three.)
> > > All nodes have "BIP" for "qtype" - no limitations, there.
> > >
> > > Best,
> > > Stephen
> > >
> > >
> > > On Tue, Feb 11, 2014 at 1:57 PM, Reuti <[email protected]> wrote:
> > > Hi,
> > >
> > > Am 11.02.2014 um 22:37 schrieb Stephen Spencer:
> > >
> > > > I have a sixty-node cluster running SGE 6.2u5 (RHEL 6.5).
> > > >
> > > > The immediate issue is that a user has jobs in the "qw" state, and 
> > > > there are idle nodes in the cluster which appear to be able to accept 
> > > > the jobs.
> > > >
> > > > What works and doesn't work?
> > > >       • "qsub -q [email protected] job.sh" works - the job runs on "n20"
> > > >       • Repeated invocations of "qrsh hostname" will not, however, 
> > > > result in the job running on one of the troublesome hosts.
> > >
> > > What is the definition of:
> > >
> > > $ qconf -sconf
> > > ...
> > > qlogin_command               builtin
> > > qlogin_daemon                builtin
> > > rlogin_command               builtin
> > > rlogin_daemon                builtin
> > > rsh_command                  builtin
> > > rsh_daemon                   builtin
> > >
> > > Any output when you use the "-q ..." for `qrsh` too? In addition, you can 
> > > try "-w v" and "-w p" too.
> > >
> > >
> > > > Things I've tried, and know, so far:
> > > >       • I've restarted the troublesome nodes - no change.
> > > >       • "sge_execd" is running on the the troublesome nodes.
> > > >       • The troublesome nodes are in the execution host list and the 
> > > > submit host list.
> > > >       • Most of the rest of the cluster's pretty busy.
> > > >       • Interestingly, the troublesome nodes don't show up in the 
> > > > "scheduling info" list produced as part of the "qstat -j <jobid>" 
> > > > command's output.
> > > > Short of restarting the entire cluster, I'm at a loss as to what to 
> > > > look at next.
> > >
> > > Is "qtype INTERACTIVE" limited to certain nodes/queues?
> > >
> > > -- Reuti
> > >
> > >
> > > > --
> > > > Stephen Spencer
> > > > [email protected]
> > > > _______________________________________________
> > > > users mailing list
> > > > [email protected]
> > > > https://gridengine.org/mailman/listinfo/users
> > >
> > >
> > >
> > >
> > > --
> > > Stephen Spencer
> > > [email protected]
> >
> >
> >
> >
> > --
> > Stephen Spencer
> > [email protected]
> 
> 
> 
> 
> -- 
> Stephen Spencer
> [email protected]


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] nodes not accepting jobs...

Reply via email to