I'd be willing to bet the output of "qstat -f -u '*' " shows that all your compute nodes are in 'au' state
If there is no sge_execd process running on each compute node then Grid Engine won't work and it can't dispatch "work" to those nodes.
The errors you see and the jobs pending forever in wait state is just a symptom of the real problem -- you have no functional grid in which to dispatch the jobs.
Basically your compute nodes fell over; if you can restart SGE on those nodes and monitor via 'qstat -f' to confirm that the 'au' state goes away then your jobs should start flowing again
Chris Pat Haley wrote:
We have also noticed that there are no sge deamons running on any of the execution nodes (I don't know if that is normal or not). We have also collected the information below from qconf. Any help in resolving this would be greatly appreciated.
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users