Hi!

I'm using SGE 8.0.0c for about 10 months now. Recently I observed execution
hosts failing randomly. The execd is loaded, but the jobs scheduled to this
host keep hanging forever in state "t" or "dt" once I try to delete them.
The log contains:

08/26/2012 09:28:38|  main|my-exechost|W|can't register at qmaster
"my-masterhost": abort qmaster registration due to communication errors
08/26/2012 09:28:38|  main|my-exechost|E|commlib error: access denied
(client IP resolved to host name "my-gateway". This is not identical to
clients host name "my-exechost")
08/26/2012 09:31:10|  main|my-exechost|E|commlib error: endpoint is not
unique error (endpoint "my-masterhost/qmaster/1" is already connected)

after that no job can be successfully scheduled to this host. The master
logs the same:

08/26/2012 09:28:21|listen|my-masterhost|E|commlib error: local host name
error (IP based host name resolving "my-gateway" doesn't match client host
name from connect message "my-exechost")

A "fix", or rather workaround, is to restart sge_execd.

Since the host name of the exechost is getting confused with the host name
of my network gateway, the reason appears to be some weird DNS setup. Both
my-exechost and my-masterhost are in the same network and don't need the
gateway to communicate (also checked with traceroute). The exechost points
to the masterhost to resolve DNS queries, and the masterhost has the
correct entries in his /etc/hosts, so that lookups are working fine on both
hosts.

Now, in order to further debug my problem, I have a couple of short
questions:

1) Does the error message indicate that the problem is the masterhost
failing to resolve the (real) IP of the exechost correctly?

2) Or does the DNS lookup work fine, but the IP the master is getting is
really the one of the gateway host? If so, what might be the reason for
that?

3) The client doesn't do anything wrong, does it?


Thanks for looking into my problem!
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to