Hi, Am 23.08.2011 um 01:18 schrieb [email protected]:
> Last week, our cluster (SGE 6.2u5) was working fine....we've got 4 > machines designated for interactive use and batch jobs, in a queue that > can be subordinated when needed, and many batch-only nodes. > > We're using SSH for qlogin, with the qlogin command set to: > > ------------------------------------- > #!/bin/sh > HOST=$1 > PORT=$2 > exec /usr/bin/ssh -t -t -X -Y -p $PORT $HOST > ------------------------------------- > > > On Friday, we changed datacenters and IP numbera. All hostnames (local and > fqdn) stayed the same. > > As of today: > > qlogin from the headnode to node interactive1 is fine > > qlogin from the headnode to nodes interactive[2-4] fail with > a timeout > > qsub jobs from the headnode to all nodes (including > interactive2-4) work fine > > All IP changes were scripted, and seem to have been compelete. A simple > check (grep -lr old.IP.subnet /etc /opt/gridengine) reveals no files > that were not updated on interactive[2-4] > > The $SGE_ROOT/$SGE_CELL/spool/qmaster/messages file contains entries like: > > ----------------------- > 08/22/2011 19:00:35|worker|headnode|W|job 2148448.1 failed on host > interactive2.fqdn assumedly after job because: job 2148448.1 died through > signal KILL (9) > ----------------------- > > I've seen many discussions about debugging qlogin timeouts, but no common > threads or solutions. > > Are there any suggestions about debugging this instance best would have had being to remove all machines beforehand and add them lateron again. There is some name caching in SGE which might block the readdressed nodes. You can try with the tools in $SGE_ROOT/utilbin/lx24-amd64 what SGE thinks about the nodes addresses. Would it help to remove them now and add them again? Regarding ssh: you set up hostbased authentication, passphraseless ssh-keys and/or an updated known_hosts file? -- Reuti > Thanks, > > Mark > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
