I have a sixty-node cluster running SGE 6.2u5 (RHEL 6.5).

The immediate issue is that a user has jobs in the "qw" state, and there
are idle nodes in the cluster which appear to be able to accept the jobs.

What works and doesn't work?

   - "qsub -q [email protected] job.sh" works - the job runs on "n20"
   - Repeated invocations of "qrsh hostname" will not, however, result in
   the job running on one of the troublesome hosts.

Things I've tried, and know, so far:

   - I've restarted the troublesome nodes - no change.
   - "sge_execd" is running on the the troublesome nodes.
   - The troublesome nodes are in the execution host list and the submit
   host list.
   - Most of the rest of the cluster's pretty busy.
   - Interestingly, the troublesome nodes don't show up in the "scheduling
   info" list produced as part of the "qstat -j <jobid>" command's output.

Short of restarting the entire cluster, I'm at a loss as to what to look at
next.
-- 
Stephen Spencer
[email protected]
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to