Hello,


I have set up a two-node Hadoop cluster on Ubuntu 12.04 running streaming jobs 
with Hadoop 2.2.0. I am having problems with running tasks on a NM which is on 
a different host than the RM, and I believe that this is happening because the 
NM host's dfs.client.local.interfaces property is not having any effect.



I have two hosts set up as follows:

Host A (1.2.3.4):

NameNode

DataNode

ResourceManager

Job History Server



Host B (5.6.7.8):

DataNode

NodeManager



On each host, hdfs-site.xml was edited to change dfs.client.local.interfaces 
from an interface name ("eth0") to the IPv4 address representing that host's 
interface ("1.2.3.4" or "5.6.7.8"). This is to prevent the HDFS client from 
randomly binding to the IPv6 side of the interface (it randomly swaps between 
the IP4 and IP6 addresses, due to the random bind IP selection in the DFS 
client) which was causing other problems.



However, I am observing that the Yarn container on the NM appears to inherit 
the property from the copy of hdfs-site.xml on the RM, rather than reading it 
from the local configuration file. In other words, setting the 
dfs.client.local.interfaces property in Host A's configuration file causes the 
Yarn containers on Host B to use same value of the property. This causes the 
map task to fail, as the container cannot establish a TCP connection to the 
HDFS. However, on Host B, other commands that access the HDFS (such as "hadoop 
fs") do work, as they respect the local value of the property.



To illustrate with an example, I start a streaming job from the command line on 
Host A:



hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar 
-input hdfs://hosta/linesin/ -output hdfs://hosta/linesout -mapper 
/home/hadoop/toRecords.pl -reducer /bin/cat



The NodeManager on Host B notes that there was an error starting the container:



13/12/14 19:38:45 WARN nodemanager.DefaultContainerExecutor: Exception from 
container-launch with container ID: container_1387067177654_0002_01_000001 and 
exit code: 1

org.apache.hadoop.util.Shell$ExitCodeException:

        at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)

        at org.apache.hadoop.util.Shell.run(Shell.java:379)

        at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)

        at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)

        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)

        at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)

        at java.util.concurrent.FutureTask.run(Unknown Source)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

        at java.lang.Thread.run(Unknown Source)



On Host B, I open 
userlogs/application_1387067177654_0002/container_1387067177654_0002_01_000001/syslog
 and find the following messages (note the DEBUG-level messages which I 
manually enabled for the DFS client):



2013-12-14 19:38:32,439 DEBUG [main] org.apache.hadoop.hdfs.DFSClient: Using 
local interfaces [1.2.3.4] with addresses [/1.2.3.4:0]

<cut>

2013-12-14 19:38:33,085 DEBUG [main] org.apache.hadoop.hdfs.DFSClient: newInfo 
= LocatedBlocks{

  fileLength=537

  underConstruction=false

  blocks=[LocatedBlock{BP-1911846690-1.2.3.4-1386999495143:blk_1073742317_1493; 
getBlockSize()=537; corrupt=false; offset=0; locs=[5.6.7.8:50010, 
1.2.3.4:50010]}]

  
lastLocatedBlock=LocatedBlock{BP-1911846690-1.2.3.4-1386999495143:blk_1073742317_1493;
 getBlockSize()=537; corrupt=false; offset=0; locs=[5.6.7.8:50010, 
1.2.3.4:50010]}

  isLastBlockComplete=true}

2013-12-14 19:38:33,088 DEBUG [main] org.apache.hadoop.hdfs.DFSClient: 
Connecting to datanode 5.6.7.8:50010

2013-12-14 19:38:33,090 DEBUG [main] org.apache.hadoop.hdfs.DFSClient: Using 
local interface /1.2.3.4:0

2013-12-14 19:38:33,095 WARN [main] org.apache.hadoop.hdfs.DFSClient: Failed to 
connect to /5.6.7.8:50010 for block, add to deadNodes and continue. 
java.net.BindException: Cannot assign requested address



Note the failure to bind to 1.2.3.4, as the IP for Node B's local interface is 
actually 5.6.7.8.



Note that when running other HDFS commands on Host B, Host B's setting for 
dfs.client.local.interfaces is respected. On host B:



hadoop@nodeb:~$ hadoop fs -ls hdfs://hosta/

13/12/14 19:45:10 DEBUG hdfs.DFSClient: Using local interfaces [5.6.7.8] with 
addresses [/5.6.7.8:0]

Found 3 items

drwxr-xr-x   - hadoop supergroup          0 2013-12-14 00:40 
hdfs://hosta/linesin

drwxr-xr-x   - hadoop supergroup          0 2013-12-14 02:01 hdfs://hosta/system

drwx------   - hadoop supergroup          0 2013-12-14 10:31 hdfs://hosta/tmp



If I change dfs.client.local.interfaces on Host A to eth0 (without touching the 
setting on Host B), the syslog mentioned above instead shows the following:



2013-12-14 22:32:19,686 DEBUG [main] org.apache.hadoop.hdfs.DFSClient: Using 
local interfaces [eth0] with addresses [/<some IP6 address>:0,/5.6.7.8:0]



The job then successfully completes sometimes, but both Host A and Host B will 
then randomly alternate between the IP4 and IP6 side of their eth0 interfaces, 
which causes other issues. In other words, changing the 
dfs.client.local.interfaces setting on Host A to a named adapter caused the 
Yarn container on Host B to bind to an identically named adapter.

Any ideas on how I can reconfigure the cluster so every container will try to 
bind to its own interface? I successfully worked around this issue by doing a 
custom build of HDFS which hardcodes my IP address in the DFSClient, but I am 
looking for a better long-term solution.



Thanks,

Jeff

Reply via email to