[Hadoop Wiki] Update of "UnknownHost" by SteveLoughran

2016-04-06 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The "UnknownHost" page has been changed by SteveLoughran:
https://wiki.apache.org/hadoop/UnknownHost?action=diff=9=10

Comment:
mention unknown localhost

 a. The hostname in the configuration files (such as {{{core-site.xml}}}) 
is misspelled.
   1. The hostname in the configuration files (such as {{{core-site.xml}}}) is 
confused with the hostname of another service. For example, you are using the 
hostname of the YARN Resource Manager in the {{{fs.defaultFS}}} configuration 
option to define the namenode.
   1. A worker node thinks it has a given name which it reports to the NameNode 
and JobTracker, but that isn't the name that the network team gave it, so it 
isn't resolvable.
+  1. If it is happening in service startup, it means the hostname of that 
service (HDFS, YARN, etc) cannot be found in {{{/etc/hosts}}}; the service will 
fail to start as it cannot determine which network card/address to use.
   1. The calling machine is on a different subnet from the target machine, and 
short names are being used instead of fully qualified domain names (FQDNs).
   1. You are running in a cloud infrastructure and the destination machine is 
no longer there. It may have been deleted from the DNS records, or, due to some 
race condition, something is trying to talk to a host that hasn't been created 
yet.
  


[Hadoop Wiki] Update of "UnknownHost" by SteveLoughran

2016-02-19 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The "UnknownHost" page has been changed by SteveLoughran:
https://wiki.apache.org/hadoop/UnknownHost?action=diff=8=9

Comment:
add reference to cloud infra

   1. The hostname in the configuration files (such as {{{core-site.xml}}}) is 
confused with the hostname of another service. For example, you are using the 
hostname of the YARN Resource Manager in the {{{fs.defaultFS}}} configuration 
option to define the namenode.
   1. A worker node thinks it has a given name which it reports to the NameNode 
and JobTracker, but that isn't the name that the network team gave it, so it 
isn't resolvable.
   1. The calling machine is on a different subnet from the target machine, and 
short names are being used instead of fully qualified domain names (FQDNs).
+  1. You are running in a cloud infrastructure and the destination machine is 
no longer there. It may have been deleted from the DNS records, or, due to some 
race condition, something is trying to talk to a host that hasn't been created 
yet.
+ 
  
  Less likely causes:
   1. The client's network card is playing up (network timeouts, etc), the 
network is overloaded, or even the switch is dropping DNS packets.


[Hadoop Wiki] Update of "UnknownHost" by SteveLoughran

2015-10-26 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The "UnknownHost" page has been changed by SteveLoughran:
https://wiki.apache.org/hadoop/UnknownHost?action=diff=4=5

Comment:
add section on HDFS HA, as it's clearly recurrent

  
  Unless the route cause has been identified, the problem may return.
  
+ 
+ == Unknown Host Exception in HA HDFS ==
+ 
+ This exception surfaces when setting up HA HDFS.
+ 
+ 
[[https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#Configuration_details|As
 documented], HA HDFS requires you to list the namenode URLs of a cluster in 
the property {{{edfs.ha.namenodes.mycluster}}}, where "mycluster" is the name 
of your HA cluster.
+ 
+ {{{
+ 
+   dfs.ha.namenodes.mycluster
+   nn1,nn2
+ 
+ }}}
+ 
+ Then for the filesystem URL, you use the name of the cluster:
+ 
+ {{{
+ 
+   fs.defaultFS
+   hdfs://mycluster
+ 
+ }}}
+ 
+ If you get an Unknown Host Exception, and the host is the name of your HA 
cluster, here {{{mycluster}}}, then it means that the HDFS client hasn't 
recognized that this is an HA cluster, and instead tried to connect to it 
directly on the default HDFS port.
+ 
+ '''The dfs.ha.namenodes.mycluster property is unset or the cluster name is 
inconsistent across the properties'''. Check your config and try again.
+ 
+ 
+ 
  Finally, because this is a configuration problem, filing bug reports is not 
going to help. They will only be closed as 
[[http://wiki.apache.org/hadoop/InvalidJiraIssues|Invalid Issues]]
  


[Hadoop Wiki] Update of UnknownHost by SteveLoughran

2014-01-03 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Hadoop Wiki for change 
notification.

The UnknownHost page has been changed by SteveLoughran:
https://wiki.apache.org/hadoop/UnknownHost?action=diffrev1=2rev2=3

Comment:
move typos and wrong hostname for role up to first causes

  
  
  Some possible causes (not an exclusive list):
+  * The hostname in the configuration files (such as {{{hadoop-site.xml}}}) is 
misspelled.
+  * The hostname in the configuration files (such as {{{hadoop-site.xml}}}) is 
confused with the hostname of another service. For example, you are using the 
hostname of the YARN Resource Manager in the {{{fs.defaultFS}}} configuration 
option to define the namenode.
   * The site's DNS server does not have an entry for the node. Test: do an 
{{{nslookup hostname}}} from the client machine.
   * The calling machine's host table {{{/etc/hosts}}} lacks an entry for the 
host, and DNS isn't helping out
-  * There's some error in the configuration files and the hostname is actually 
wrong.
-  * A worker node thinks it has a given name -which it reports to the NameNode 
and JobTracker, but that isn't the name that the network team expect, so it 
isn't resolvable.
+  * A worker node thinks it has a given name -which it reports to the NameNode 
and JobTracker, but that isn't the name that the network team gave it, so it 
isn't resolvable.
   * The calling machine is on a different subnet from the target machine, and 
short names are being used instead of fully qualified domain names (FQDNs).
   * The client's network card is playing up (network timeouts, etc), the 
network is overloaded, or even the switch is dropping DNS packets.
   * The host's IP address has changed but a long-lived JVM is caching the old 
value. This is a known problem with JVMs (search for java negative DNS 
caching for the details and solutions). The quick solution: restart the JVMs
@@ -25, +26 @@

   1. As well as {{{nslookup}}}, the {{{dig}}} command is invaluable for 
tracking down DNS problems, though it does assume you understand DNS records. 
Now is a good time to learn.
   1. Restart the JVMs to see if that makes it go away.
   1. Restart the servers to see if that makes it go away.
+  1. Reboot the network switches.
  
  Remember, unless the route cause has been identified, the problem may return.
  


[Hadoop Wiki] Update of UnknownHost by SteveLoughran

2011-06-27 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on Hadoop Wiki for change 
notification.

The UnknownHost page has been changed by SteveLoughran:
http://wiki.apache.org/hadoop/UnknownHost

Comment:
how to troubleshoot unknown host exceptions

New page:
= Unknown Host =

You get an Unknown Host Error -often wrapped in a Java {{{IOException}}}, when 
one machine on the network cannot determine the IP address of a host that it is 
trying to connect to by way of its hostname. This can happen during file upload 
(in which case the client machine is has the hostname problem), or inside the 
Hadoop cluster.


Some possible causes (not an exclusive list):
 * The site's DNS server does not have an entry for the node. Test: do an 
{{{nslookup hostname}}} from the client machine.
 * The calling machine's host table {{{//etc/hosts}}} lacks an entry for the 
host, and DNS isn't helping out
 * There's some error in the configuration files and the hostname is actually 
wrong.
 * A worker node thinks it has a given name -which it reports to the NameNode 
and JobTracker, but that isn't the name that the network team expect, so it 
isn't resolvable.
 * The calling machine is on a different subnet from the target machine, and 
short names are being used instead of fully qualified domain names (FQDNs).
 * The client's network card is playing up (network timeouts, etc), the network 
is overloaded, or even the switch is dropping DNS packets.
 * The host's IP address has changed but a long-lived JVM is caching the old 
value. This is a known problem with JVMs (search for java negative DNS 
caching for the details and solutions). The quick solution: restart the JVMs
 * The site's DNS server is overloaded. This can happen in large clusters. 
Either move to host table entries or use caching DNS servers in every worker 
node.
 * Your ARP cache is corrupt, either accidentally or maliciously. If you don't 
know what that means, you won't be in a position to verify this is the problem 
-or fix it.

These are all network configuration/router issues. As it is your network, only 
you can find out and track down the problem. That said, any tooling to help 
Hadoop track down such problems in cluster would be welcome, as would extra 
diagnostics. If you have to extend Hadoop to track down these issues -submit 
your patches!

Some tactics to help solve the problem:
 1. Look for configuration problems first (Hadoop XML files, hostnames, host 
tables), as these are easiest to fix and quite common.
 1. Try and identify which client machine is playing up. If it is 
out-of-cluster, try the FQDN instead, and consider that it may not have access 
to the worker node.
 1. If the client that does not work is one of the machines in the cluster, SSH 
to that machine and make sure it can resolve the hostname.
 1. As well as {{{nslookup}}}, the {{{dig}}} command is invaluable for tracking 
down DNS problems, though it does assume you understand DNS records. Now is a 
good time to learn.
 1. Restart the JVMs to see if that makes it go away.
 1. Restart the servers to see if that makes it go away.

Remember, unless the route cause has been identified, the problem may return.