Hi All. To expand a bit on what is going on. We are using Grid Engine 8.1.2 using Rocks 6.1 for the clustering software.
We have a program that is not behaving nicely with the amount of cores being requested, so the node easily goes over-loaded. To keep the node load from going through the roof, I have setup the threshold on the queue with: load_thresholds NONE suspend_thresholds np_load_avg=1.1 nsuspend 4 suspend_interval 00:01:00 And Grid Engine does a nice job of keeping the load in check by suspending ( T ) when the threshold goes over. However, what is happening is that the Mathew can no longer ssh to the node when the node hits a high load. So Mathew can ssh to any node just fine before his jobs start, however when the jobs start and the nodes are overloaded, he can no longer ssh. The error: $ ssh compute-2-1 Connection to compute-2-1 closed by remote host. Connection to compute-2-1 closed. I asked this same question to the Rocks community but nobody knows why. Is Grid Engine doing anything to ssh to keep the node from accepting any more new jobs and thus not allowing an ssh connection when overloaded? Joseph _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
