On Wed, 23 Feb 2011, Robson, David W wrote:
...
The qmaster message file has entries like:
02/23/2011 10:29:30| main|node001|E|error reading file:
"/usr/local/sge/default/spool/qmaster/qinstances/node011/node011"
02/23/2011 10:29:30| main|node001|E|error reading file:
"/usr/local/sge/default/spool/qmaster/qinstances/node024/node024"
02/23/2011 10:29:30| main|node001|E|error reading file:
"/usr/local/sge/default/spool/qmaster/qinstances/node040/node040"
02/23/2011 10:29:30| main|node001|E|error reading file:
"/usr/local/sge/default/spool/qmaster/qinstances/node022/node022"
02/23/2011 10:29:30| main|node001|I|read job database with 35 entries in 0
seconds
02/23/2011 10:29:30| main|node001|E|can't find queue "node006@node006"
referenced in job 5189
However, the files in question exist, have the correct ownership and permissions
and seem to have meaningful data (when compared to those from another, working
Grid Engine cluster).
Any ideas on how I can restart the Grid Engine master ??
...
Someone else might be able to give a much more specific answer, but...
I had a similar problem with the contents of .../spool/qmaster/users and
traced it down to two problems:
1) It read user definitions normally until it reached one with a
formatting error
2) All subsequent (correct) user definitions were not read and produced
similar messages to the above - it was as if an error flag was raised for
that first incorrect user and wasn't cleared.
My advice is to look for the first error message and investigate that
issue thoroughly. You might find that it is the source of all the others.
Mark
--
-----------------------------------------------------------------
Mark Dixon Email : [email protected]
HPC/Grid Systems Support Tel (int): 35429
Information Systems Services Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users