"P. Golik" <[email protected]> writes:

> 08/26/2012 09:28:38|  main|my-exechost|W|can't register at qmaster
> "my-masterhost": abort qmaster registration due to communication errors
> 08/26/2012 09:28:38|  main|my-exechost|E|commlib error: access denied
> (client IP resolved to host name "my-gateway". This is not identical to
> clients host name "my-exechost")
> 08/26/2012 09:31:10|  main|my-exechost|E|commlib error: endpoint is not
> unique error (endpoint "my-masterhost/qmaster/1" is already connected)
>
> after that no job can be successfully scheduled to this host. The master
> logs the same:
>
> 08/26/2012 09:28:21|listen|my-masterhost|E|commlib error: local host name
> error (IP based host name resolving "my-gateway" doesn't match client host
> name from connect message "my-exechost")
>
> A "fix", or rather workaround, is to restart sge_execd.
>
> Since the host name of the exechost is getting confused with the host name
> of my network gateway, the reason appears to be some weird DNS setup. Both
> my-exechost and my-masterhost are in the same network and don't need the
> gateway to communicate (also checked with traceroute). The exechost points
> to the masterhost to resolve DNS queries, and the masterhost has the
> correct entries in his /etc/hosts, so that lookups are working fine on both
> hosts.

I don't understand what's going on, but it sounds as if you just need to
fix the DNS.  Our compute nodes get DNS from dnsmasq on the head, and
that reads /etc/hosts, but other setups might not; is the host serving
the correct data?  (The names above look odd -- have they been changed
for posting?)

The way to check how SGE resolves names and addresses is with the
utilities described in hostnameutils(1), but they're probably only
necessary of you're using host_aliases.

> Now, in order to further debug my problem, I have a couple of short
> questions:
>
> 1) Does the error message indicate that the problem is the masterhost
> failing to resolve the (real) IP of the exechost correctly?
>
> 2) Or does the DNS lookup work fine, but the IP the master is getting is
> really the one of the gateway host? If so, what might be the reason for
> that?
>
> 3) The client doesn't do anything wrong, does it?

I'm not sure where the problem is.  You should probably run
gethostname/gethostbyaddr/gethostbyname on the master and the compute
node to check, but if it's time-dependent that may not help.  If you
have multiple network interfaces, you may need host aliases (see
host_aliases(5)), but I suppose that would have always been the case.

Perhaps someone else has a better idea exactly what's wrong.

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to