For the mailing list archives: I have more information about this problem and a workaround which seems to be helping.
To summarize: I'm seeing an issue where SGE appears to be killing all jobs with (in the qmaster messages file): 07/09/2011 02:14:07|worker|betsy-qmaster|E|[email protected] reports running job (16648.32/master) in queue "[email protected]" that was not supposed to be there - killing All jobs are killed on all nodes in the cluster. This occurs about 15 minutes after a node dies. I have (qconf -sconf) settings: load_report_time 00:00:40 max_unheard 00:05:00 reschedule_unknown 00:15:00 qmaster_params ENABLE_RESCHEDULE_KILL=true \ ENABLE_RESCHEDULE_SLAVE=true Other Notes: Running SUN SGE 6.2u5. Compute nodes are diskless and do not mount a shared sge_root. My partial solution was to restore reschedule_unknown and qmaster_params to their default values: reschedule_unknown 00:00:00 qmaster_params none This seems to have solved my immediate problem. I changed both variables and didn't attempt to see which specific setting was causing the problem. Remaining issue: What remains is still the original problem which caused me to set these variables in the first place. If a node dies or is rebooted SGE does not do anything about hung jobs when the node comes back online. The jobs continue to appear in the queue as if they where running. This may be related to my using diskless nodes where the local spool directory is cleared on reboot. I will be looking into putting the execd spool files on a shared directory in the future which may address this problem. Stuart Barkley _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
