Am 23.08.2011 um 19:40 schrieb [email protected]: > In the message dated: Tue, 23 Aug 2011 18:56:25 +0200, > The pithy ruminations from Reuti on > <Re: [gridengine users] cluster IP change, now qlogin timeout (4 s) expired > while waiting > on socket fd 4> were: > => Hi, > => > => Am 23.08.2011 um 01:18 schrieb [email protected]: > => > => > Last week, our cluster (SGE 6.2u5) was working fine....we've got 4 > => > machines designated for interactive use and batch jobs, in a queue that > => > can be subordinated when needed, and many batch-only nodes. > => > > => > We're using SSH for qlogin, with the qlogin command set to: > => > > => > ------------------------------------- > => > #!/bin/sh > => > HOST=$1 > => > PORT=$2 > => > exec /usr/bin/ssh -t -t -X -Y -p $PORT $HOST > => > ------------------------------------- > => > > => > > => > On Friday, we changed datacenters and IP numbera. All hostnames (local > and fqdn) stayed the sa > => me. > => > > => > As of today: > => > > => > qlogin from the headnode to node interactive1 is fine > => > > => > qlogin from the headnode to nodes interactive[2-4] fail with > => > a timeout > => > > => > qsub jobs from the headnode to all nodes (including > => > interactive2-4) work fine > => > > => > => best would have had being to remove all machines beforehand and add them > lateron again. There is > > Oh. I guess I could do that now. > > => some name caching in SGE which might block the readdressed nodes. You can > try with the tools in > => $SGE_ROOT/utilbin/lx24-amd64 what SGE thinks about the nodes addresses. > > Would that affect just qlogin and not qsub? Other tools (qhost) do see all the > nodes. > > The "gethostbyname" and "gethostbyaddr" show the correct names (unchanged) and > the new IP numbers for both the interactive1 node that works and for > interactive[2-4] where qlogin fails. > > => > => Would it help to remove them now and add them again? > > I could do that. > > => > => Regarding ssh: you set up hostbased authentication, passphraseless > ssh-keys and/or an updated kn > => own_hosts file? > > We've been using passphraseless ssh-keys to access the interactive nodes for > about 2 years. > > User home directories are all NFS mounted, so the same ~/.ssh/ is availble > when a user is successful at doing a qlogin to "interactive1" and when qlogin > fails to connect to "interactive[2-4]". > > I don't think this is an ssh problem...when I put debugging into the qlogin > wrapper script that calls ssh (ie., logging the user, destination, port), > there is no evidence that the wrapper is ever called for the qlogin attempts > to nodes "interactive[2-4]". The qlogin command seems to fail at the step when > it connects from the head node to the interactive node in order to launch a > single instance of ssh listening on a high-numbered port.
Can you submit a batch job to this nodes with `qsub -now y file.sh` - is this working too; the rest of the cluster also? For the know_hosts file the entries relating TCP/IP address and name might have changed. A plain `ssh` by hand is working as before? -- Reuti > Thanks, > > Mark > > => > => -- Reuti > => > => > => > Thanks, > => > > => > Mark > => > > => > _______________________________________________ > => > users mailing list > => > [email protected] > => > https://gridengine.org/mailman/listinfo/users > => > => > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
