Last week, our cluster (SGE 6.2u5) was working fine....we've got 4
machines designated for interactive use and batch jobs, in a queue that
can be subordinated when needed, and many batch-only nodes.

We're using SSH for qlogin, with the qlogin command set to:

-------------------------------------
        #!/bin/sh
        HOST=$1
        PORT=$2
        exec /usr/bin/ssh -t -t -X -Y -p $PORT $HOST
-------------------------------------


On Friday, we changed datacenters and IP numbera. All hostnames (local and 
fqdn) stayed the same.

As of today:

        qlogin from the headnode to node interactive1 is fine

        qlogin from the headnode to nodes interactive[2-4] fail with
        a timeout

        qsub jobs from the headnode to all nodes (including
        interactive2-4) work fine

All IP changes were scripted, and seem to have been compelete. A simple
check (grep -lr old.IP.subnet /etc /opt/gridengine) reveals no files
that were not updated on interactive[2-4]

The $SGE_ROOT/$SGE_CELL/spool/qmaster/messages file contains entries like:

-----------------------
08/22/2011 19:00:35|worker|headnode|W|job 2148448.1 failed on host 
interactive2.fqdn assumedly after job because: job 2148448.1 died through 
signal KILL (9)
-----------------------

I've seen many discussions about debugging qlogin timeouts, but no common
threads or solutions.

Are there any suggestions about debugging this instance?

Thanks,

Mark

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to