Hi! I'm using SGE 8.0.0c for about 10 months now. Recently I observed execution hosts failing randomly. The execd is loaded, but the jobs scheduled to this host keep hanging forever in state "t" or "dt" once I try to delete them. The log contains:
08/26/2012 09:28:38| main|my-exechost|W|can't register at qmaster "my-masterhost": abort qmaster registration due to communication errors 08/26/2012 09:28:38| main|my-exechost|E|commlib error: access denied (client IP resolved to host name "my-gateway". This is not identical to clients host name "my-exechost") 08/26/2012 09:31:10| main|my-exechost|E|commlib error: endpoint is not unique error (endpoint "my-masterhost/qmaster/1" is already connected) after that no job can be successfully scheduled to this host. The master logs the same: 08/26/2012 09:28:21|listen|my-masterhost|E|commlib error: local host name error (IP based host name resolving "my-gateway" doesn't match client host name from connect message "my-exechost") A "fix", or rather workaround, is to restart sge_execd. Since the host name of the exechost is getting confused with the host name of my network gateway, the reason appears to be some weird DNS setup. Both my-exechost and my-masterhost are in the same network and don't need the gateway to communicate (also checked with traceroute). The exechost points to the masterhost to resolve DNS queries, and the masterhost has the correct entries in his /etc/hosts, so that lookups are working fine on both hosts. Now, in order to further debug my problem, I have a couple of short questions: 1) Does the error message indicate that the problem is the masterhost failing to resolve the (real) IP of the exechost correctly? 2) Or does the DNS lookup work fine, but the IP the master is getting is really the one of the gateway host? If so, what might be the reason for that? 3) The client doesn't do anything wrong, does it? Thanks for looking into my problem!
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
