Hi All.

To expand a bit on what is going on.  We are using Grid Engine 8.1.2 using 
Rocks 6.1 for the clustering software.

We have a program that is not behaving nicely with the amount of cores being 
requested, so the node easily goes over-loaded.

To keep the node load from going through the roof, I have setup the threshold 
on the queue with:

   load_thresholds       NONE
   suspend_thresholds    np_load_avg=1.1
   nsuspend              4
   suspend_interval      00:01:00


And Grid Engine does a nice job of keeping the load in check by suspending ( T 
) when the threshold goes over.

However, what is happening is that the Mathew can no longer ssh to the node 
when the node hits a high load.

So Mathew can ssh to any node just fine before his jobs start, however when the 
jobs start and the nodes are overloaded, he can no longer ssh.

The error:

   $ ssh compute-2-1
   Connection to compute-2-1 closed by remote host.
   Connection to compute-2-1 closed.

I asked this same question to the Rocks community but nobody knows why.

Is Grid Engine doing anything to ssh to keep the node from accepting any more 
new jobs and thus not allowing an ssh connection when overloaded?

Joseph
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to