Last week, our cluster (SGE 6.2u5) was working finewe've got 4
machines designated for interactive use and batch jobs, in a queue that
can be subordinated when needed, and many batch-only nodes.
We're using SSH for qlogin, with the qlogin command set to:
-
#!/bin/sh
HOST=$1
PORT=$2
exec /usr/bin/ssh -t -t -X -Y -p $PORT $HOST
-
On Friday, we changed datacenters and IP numbera. All hostnames (local and
fqdn) stayed the same.
As of today:
qlogin from the headnode to node interactive1 is fine
qlogin from the headnode to nodes interactive[2-4] fail with
a timeout
qsub jobs from the headnode to all nodes (including
interactive2-4) work fine
All IP changes were scripted, and seem to have been compelete. A simple
check (grep -lr old.IP.subnet /etc /opt/gridengine) reveals no files
that were not updated on interactive[2-4]
The $SGE_ROOT/$SGE_CELL/spool/qmaster/messages file contains entries like:
---
08/22/2011 19:00:35|worker|headnode|W|job 2148448.1 failed on host
interactive2.fqdn assumedly after job because: job 2148448.1 died through
signal KILL (9)
---
I've seen many discussions about debugging qlogin timeouts, but no common
threads or solutions.
Are there any suggestions about debugging this instance?
Thanks,
Mark
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users