Am 23.08.2011 um 19:40 schrieb [email protected]:

> In the message dated: Tue, 23 Aug 2011 18:56:25 +0200,
> The pithy ruminations from Reuti on 
> <Re: [gridengine users] cluster IP change, now qlogin timeout (4 s) expired 
> while waiting 
> on socket fd 4> were:
> => Hi,
> => 
> => Am 23.08.2011 um 01:18 schrieb [email protected]:
> => 
> => > Last week, our cluster (SGE 6.2u5) was working fine....we've got 4
> => > machines designated for interactive use and batch jobs, in a queue that
> => > can be subordinated when needed, and many batch-only nodes.
> => > 
> => > We're using SSH for qlogin, with the qlogin command set to:
> => > 
> => > -------------------------------------
> => >  #!/bin/sh
> => >  HOST=$1
> => >  PORT=$2
> => >  exec /usr/bin/ssh -t -t -X -Y -p $PORT $HOST
> => > -------------------------------------
> => > 
> => > 
> => > On Friday, we changed datacenters and IP numbera. All hostnames (local 
> and fqdn) stayed the sa
> => me.
> => > 
> => > As of today:
> => > 
> => >  qlogin from the headnode to node interactive1 is fine
> => > 
> => >  qlogin from the headnode to nodes interactive[2-4] fail with
> => >  a timeout
> => > 
> => >  qsub jobs from the headnode to all nodes (including
> => >  interactive2-4) work fine
> => > 
> => 
> => best would have had being to remove all machines beforehand and add them 
> lateron again. There is
> 
> Oh. I guess I could do that now.
> 
> =>  some name caching in SGE which might block the readdressed nodes. You can 
> try with the tools in
> =>  $SGE_ROOT/utilbin/lx24-amd64 what SGE thinks about the nodes addresses.
> 
> Would that affect just qlogin and not qsub? Other tools (qhost) do see all the
> nodes.
> 
> The "gethostbyname" and "gethostbyaddr" show the correct names (unchanged) and
> the new IP numbers for both the interactive1 node that works and for
> interactive[2-4] where qlogin fails.
> 
> => 
> => Would it help to remove them now and add them again?
> 
> I could do that.
> 
> => 
> => Regarding ssh: you set up hostbased authentication, passphraseless 
> ssh-keys and/or an updated kn
> => own_hosts file?
> 
> We've been using passphraseless ssh-keys to access the interactive nodes for
> about 2 years.
> 
> User home directories are all NFS mounted, so the same ~/.ssh/ is availble
> when a user is successful at doing a qlogin to "interactive1" and when qlogin
> fails to connect to "interactive[2-4]".
> 
> I don't think this is an ssh problem...when I put debugging into the qlogin
> wrapper script that calls ssh (ie., logging the user, destination, port),
> there is no evidence that the wrapper is ever called for the qlogin attempts
> to nodes "interactive[2-4]". The qlogin command seems to fail at the step when
> it connects from the head node to the interactive node in order to launch a
> single instance of ssh listening on a high-numbered port.

Can you submit a batch job to this nodes with `qsub -now y file.sh` - is this 
working too; the rest of the cluster also?

For the know_hosts file the entries relating TCP/IP address and name might have 
changed. A plain `ssh` by hand is working as before?

-- Reuti 


> Thanks,
> 
> Mark
> 
> => 
> => -- Reuti
> => 
> => 
> => > Thanks,
> => > 
> => > Mark
> => > 
> => > _______________________________________________
> => > users mailing list
> => > [email protected]
> => > https://gridengine.org/mailman/listinfo/users
> => 
> => 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to