The sge_execd process randomly stops on several Windows 2008 R2 x64 exec nodes while running jobs sent from the qmaster. The only message in the qmaster message file is a commlib error stating it lost connectivity and the exec node doesn't show any error in the message file.
My question is two fold. First, why is it crashing and second is how can I have sge_execd automatically restart if it crashes? I'm still trying to figure out how to get sge_execd to start automatically when the Windows exec node boots up. Apparently this is a know problem, at least according to the members on the Oracle Grid Engien forums. Thanks!
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
