I'd be willing to bet the output of "qstat -f -u '*' " shows that all your compute nodes are in 'au' state

If there is no sge_execd process running on each compute node then Grid Engine won't work and it can't dispatch "work" to those nodes.

The errors you see and the jobs pending forever in wait state is just a symptom of the real problem -- you have no functional grid in which to dispatch the jobs.

Basically your compute nodes fell over; if you can restart SGE on those nodes and monitor via 'qstat -f' to confirm that the 'au' state goes away then your jobs should start flowing again

Chris



Pat Haley wrote:

We have also noticed that there are no sge deamons running on any of the execution nodes (I don't know if that is normal or not). We have also collected the information below from qconf. Any help in resolving this would be greatly appreciated.

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to