Hi,

Am 31.08.2012 um 11:57 schrieb P. Golik:

> I'm using SGE 8.0.0c for about 10 months now. Recently I observed execution 
> hosts failing randomly. The execd is loaded, but the jobs scheduled to this 
> host keep hanging forever in state "t" or "dt" once I try to delete them. The 
> log contains:
> 
> 08/26/2012 09:28:38|  main|my-exechost|W|can't register at qmaster 
> "my-masterhost": abort qmaster registration due to communication errors
> 08/26/2012 09:28:38|  main|my-exechost|E|commlib error: access denied (client 
> IP resolved to host name "my-gateway". This is not identical to clients host 
> name "my-exechost")
> 08/26/2012 09:31:10|  main|my-exechost|E|commlib error: endpoint is not 
> unique error (endpoint "my-masterhost/qmaster/1" is already connected)

Can you check in such a situation with the tools in $SGE_ROOT/utilbin/lx-amd64 
`gethostbyaddr -all ...` and `gethostbyname -all ...` whether the output is 
correct.


> after that no job can be successfully scheduled to this host. The master logs 
> the same:
> 
> 08/26/2012 09:28:21|listen|my-masterhost|E|commlib error: local host name 
> error (IP based host name resolving "my-gateway" doesn't match client host 
> name from connect message "my-exechost")
> 
> A "fix", or rather workaround, is to restart sge_execd.
> 
> Since the host name of the exechost is getting confused with the host name of 
> my network gateway, the reason appears to be some weird DNS setup.

I think the same.


>  Both my-exechost and my-masterhost are in the same network and don't need 
> the gateway to communicate (also checked with traceroute). The exechost 
> points to the masterhost to resolve DNS queries, and the masterhost has the 
> correct entries in his /etc/hosts, so that lookups are working fine on both 
> hosts.

Is the DNS request forwarded to any external resolver? Usually "named" has its 
own configuration file and doesn't look into /etc/hosts.


> Now, in order to further debug my problem, I have a couple of short questions:
> 
> 1) Does the error message indicate that the problem is the masterhost failing 
> to resolve the (real) IP of the exechost correctly?

Not the IP, the name. It uses the IP to get the name and the answer is 
"my-gateway", it should be "my-exechost".

-- Reuti


>  2) Or does the DNS lookup work fine, but the IP the master is getting is 
> really the one of the gateway host? If so, what might be the reason for that?
> 
> 3) The client doesn't do anything wrong, does it?
> 
> 
> Thanks for looking into my problem!
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to