Hi,

    We have had Grid Engine successfully running on ~70 machines
for several months.  However it recently crashed overnight, the
shadow daemons didn't seem to kick in and the master doesm't restart.


The qmaster message file has entries like:


02/23/2011 10:29:30|  main|node001|E|error reading file: 
"/usr/local/sge/default/spool/qmaster/qinstances/node011/node011"
02/23/2011 10:29:30|  main|node001|E|error reading file: 
"/usr/local/sge/default/spool/qmaster/qinstances/node024/node024"
02/23/2011 10:29:30|  main|node001|E|error reading file: 
"/usr/local/sge/default/spool/qmaster/qinstances/node040/node040"
02/23/2011 10:29:30|  main|node001|E|error reading file: 
"/usr/local/sge/default/spool/qmaster/qinstances/node022/node022"
02/23/2011 10:29:30|  main|node001|I|read job database with 35 entries in 0 
seconds
02/23/2011 10:29:30|  main|node001|E|can't find queue "node006@node006" 
referenced in job 5189


However, the files in question exist, have the correct ownership and permissions
and seem to have meaningful data (when compared to those from another, working
Grid Engine cluster).

Any ideas on how I can restart the Grid Engine master ??

Thanks in Advance

Dave



--
___________________________________________________
David Robson

CODAS & IT Department, Culham Centre for Fusion Energy,
Culham Science Centre Abingdon OX14 3DB
Voice: +44(0)1235-46-4569, Fax: 4404
Work email: [email protected]
Home email: [email protected]
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to