Hi, Am 31.08.2012 um 11:57 schrieb P. Golik:
> I'm using SGE 8.0.0c for about 10 months now. Recently I observed execution > hosts failing randomly. The execd is loaded, but the jobs scheduled to this > host keep hanging forever in state "t" or "dt" once I try to delete them. The > log contains: > > 08/26/2012 09:28:38| main|my-exechost|W|can't register at qmaster > "my-masterhost": abort qmaster registration due to communication errors > 08/26/2012 09:28:38| main|my-exechost|E|commlib error: access denied (client > IP resolved to host name "my-gateway". This is not identical to clients host > name "my-exechost") > 08/26/2012 09:31:10| main|my-exechost|E|commlib error: endpoint is not > unique error (endpoint "my-masterhost/qmaster/1" is already connected) Can you check in such a situation with the tools in $SGE_ROOT/utilbin/lx-amd64 `gethostbyaddr -all ...` and `gethostbyname -all ...` whether the output is correct. > after that no job can be successfully scheduled to this host. The master logs > the same: > > 08/26/2012 09:28:21|listen|my-masterhost|E|commlib error: local host name > error (IP based host name resolving "my-gateway" doesn't match client host > name from connect message "my-exechost") > > A "fix", or rather workaround, is to restart sge_execd. > > Since the host name of the exechost is getting confused with the host name of > my network gateway, the reason appears to be some weird DNS setup. I think the same. > Both my-exechost and my-masterhost are in the same network and don't need > the gateway to communicate (also checked with traceroute). The exechost > points to the masterhost to resolve DNS queries, and the masterhost has the > correct entries in his /etc/hosts, so that lookups are working fine on both > hosts. Is the DNS request forwarded to any external resolver? Usually "named" has its own configuration file and doesn't look into /etc/hosts. > Now, in order to further debug my problem, I have a couple of short questions: > > 1) Does the error message indicate that the problem is the masterhost failing > to resolve the (real) IP of the exechost correctly? Not the IP, the name. It uses the IP to get the name and the answer is "my-gateway", it should be "my-exechost". -- Reuti > 2) Or does the DNS lookup work fine, but the IP the master is getting is > really the one of the gateway host? If so, what might be the reason for that? > > 3) The client doesn't do anything wrong, does it? > > > Thanks for looking into my problem! > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
