[jira] [Updated] (YARN-1226) ipv4 and ipv6 affect job data locality

2013-09-23 Thread Kaibo Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaibo Zhou updated YARN-1226:
-

Description: 
When I run a mapreduce job which use TableInputFormat to scan a hbase table on 
yarn cluser with 140+ nodes, I consistently get very low data locality around 
0~10%. 

The scheduler is Capacity Scheduler. Hbase and hadoop are integrated in the 
cluster with NodeManager, DataNode and HRegionServer run on the same node.

The reason of low data locality is: most machines in the cluster uses IPV6, few 
machines use IPV4. NodeManager use InetAddress.getLocalHost().getHostName()
 to get the host name, but the return result of this function depends on IPV4 
or IPV6, see [InetAddress.getLocalHost().getHostName() returns 
FQDN|http://bugs.sun.com/view_bug.do?bug_id=7166687]. 

On machines with ipv4, NodeManager get hostName as: 
search042097.sqa.cm4.site.net
But on machines with ipv6, NodeManager get hostName as: search042097.sqa.cm4
if run with IPv6 disabled, -Djava.net.preferIPv4Stack=true, then returns 
search042097.sqa.cm4.site.net.

For the mapred job which scan hbase table, the InputSplit contains node 
locations of [FQDN|http://en.wikipedia.org/wiki/FQDN], e.g. 
search042097.sqa.cm4.site.net. Because in hbase, the RegionServers' hostnames 
are allocated by HMaster. HMaster communicate with RegionServers and get the 
region server's host name use java NIO: 
clientChannel.socket().getInetAddress().getHostName().
Also see the startup log of region server:

13:06:21,200 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Master 
passed us hostname to use. Was=search042024.sqa.cm4, 
Now=search042024.sqa.cm4.site.net

As you can see, most machines in the Yarn cluster with IPV6 get the short 
hostname, but hbase always get the full hostname, so the Host cannot matched 
(see RMContainerAllocator::assignToMap).This can lead to poor locality.

After I use java.net.preferIPv4Stack to force IPv4 in yarn, I get 70+% data 
locality in the cluster.

Thanks,
Kaibo

  was:
When I run a mapreduce job which use TableInputFormat to scan a hbase table on 
yarn cluser with 140+ nodes, I consistently get very low data locality around 
0~10%. 

The scheduler is Capacity Scheduler. Hbase and hadoop are integrated in the 
cluster with NodeManager, DataNode and HRegionServer run on the same node.

The reason of low data locality is: most machines in the cluster uses IPV6, few 
machines use IPV4. NodeManager use InetAddress.getLocalHost().getHostName()
 to get the host name, but the return result of this function depends on IPV4 
or IPV6, see [InetAddress.getLocalHost().getHostName() returns 
FQDN|http://bugs.sun.com/view_bug.do?bug_id=7166687]. 

On machines with ipv4, NodeManager get hostName as: 
search042097.sqa.cm4.site.net
But on machines with ipv6, NodeManager get hostName as: search042097.sqa.cm4
if run with IPv6 disabled, -Djava.net.preferIPv4Stack=true, then returns 
search042097.sqa.cm4.site.net.

For the mapred job which scan hbase table, the InputSplit contains node 
locations of FQDN, e.g. search042097.sqa.cm4.site.net. Because in hbase, the 
RegionServers' hostnames are allocated by HMaster. HMaster communicate with 
RegionServers and get the region server's host name use java NIO: 
clientChannel.socket().getInetAddress().getHostName().
Also see the startup log of region server:

13:06:21,200 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Master 
passed us hostname to use. Was=search042024.sqa.cm4, 
Now=search042024.sqa.cm4.site.net

As you can see, most machines in the Yarn cluster with IPV6 get the short 
hostname, but hbase always get the full hostname, so the Host cannot matched 
(see RMContainerAllocator::assignToMap).This can lead to poor locality.

After I use java.net.preferIPv4Stack to force IPv4 in yarn, I get 70+% data 
locality in the cluster.

Thanks,
Kaibo


 ipv4 and ipv6 affect job data locality
 --

 Key: YARN-1226
 URL: https://issues.apache.org/jira/browse/YARN-1226
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 0.23.3, 2.0.0-alpha, 2.1.0-beta
Reporter: Kaibo Zhou
Priority: Minor

 When I run a mapreduce job which use TableInputFormat to scan a hbase table 
 on yarn cluser with 140+ nodes, I consistently get very low data locality 
 around 0~10%. 
 The scheduler is Capacity Scheduler. Hbase and hadoop are integrated in the 
 cluster with NodeManager, DataNode and HRegionServer run on the same node.
 The reason of low data locality is: most machines in the cluster uses IPV6, 
 few machines use IPV4. NodeManager use 
 InetAddress.getLocalHost().getHostName()
  to get the host name, but the return result of this function depends on 
 IPV4 or IPV6, see [InetAddress.getLocalHost().getHostName() returns 
 

[jira] [Updated] (YARN-1226) ipv4 and ipv6 affect job data locality

2013-09-23 Thread Kaibo Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaibo Zhou updated YARN-1226:
-

Description: 
When I run a mapreduce job which use TableInputFormat to scan a hbase table on 
yarn cluser with 140+ nodes, I consistently get very low data locality around 
0~10%. 

The scheduler is Capacity Scheduler. Hbase and hadoop are integrated in the 
cluster with NodeManager, DataNode and HRegionServer run on the same node.

The reason of low data locality is: most machines in the cluster uses IPV6, few 
machines use IPV4. NodeManager use InetAddress.getLocalHost().getHostName()
 to get the host name, but the return result of this function depends on IPV4 
or IPV6, see [InetAddress.getLocalHost().getHostName() returns 
FQDN|http://bugs.sun.com/view_bug.do?bug_id=7166687]. 

On machines with ipv4, NodeManager get hostName as: 
search042097.sqa.cm4.site.net
But on machines with ipv6, NodeManager get hostName as: search042097.sqa.cm4
if run with IPv6 disabled, -Djava.net.preferIPv4Stack=true, then returns 
search042097.sqa.cm4.site.net.

For the mapred job which scan hbase table, the InputSplit contains node 
locations of FQDN, e.g. search042097.sqa.cm4.site.net. Because in hbase, the 
RegionServers' hostnames are allocated by HMaster. HMaster communicate with 
RegionServers and get the region server's host name use java NIO: 
clientChannel.socket().getInetAddress().getHostName().
Also see the startup log of region server:

13:06:21,200 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Master 
passed us hostname to use. Was=search042024.sqa.cm4, 
Now=search042024.sqa.cm4.site.net

As you can see, most machines in the Yarn cluster with IPV6 get the short 
hostname, but hbase always get the full hostname, so the Host cannot matched 
(see RMContainerAllocator::assignToMap).This can lead to poor locality.

After I use java.net.preferIPv4Stack to force IPv4 in yarn, I get 70+% data 
locality in the cluster.

Thanks,
Kaibo

  was:
When I run a mapreduce job which use TableInputFormat to scan a hbase table on 
yarn cluser with 140+ nodes, I consistently get very low data locality around 
0~10%. 

The scheduler is Capacity Scheduler. Hbase and hadoop are integrated in the 
cluster with NodeManager, DataNode and HRegionServer run on the same node.

The reason of low data locality is: most machines in the cluster uses IPV6, few 
machines use IPV4. NodeManager use InetAddress.getLocalHost().getHostName()
 to get the host name, but the return result of this function depends on IPV4 
or IPV6, see [InetAddress.getLocalHost().getHostName() returns 
FQDN](http://bugs.sun.com/view_bug.do?bug_id=7166687). 

On machines with ipv4, NodeManager get hostName as: 
search042097.sqa.cm4.site.net
But on machines with ipv6, NodeManager get hostName as: search042097.sqa.cm4
if run with IPv6 disabled, -Djava.net.preferIPv4Stack=true, then returns 
search042097.sqa.cm4.site.net.

For the mapred job which scan hbase table, the InputSplit contains node 
locations of FQDN, e.g. search042097.sqa.cm4.site.net. Because in hbase, the 
RegionServers' hostnames are allocated by HMaster. HMaster communicate with 
RegionServers and get the region server's host name use java NIO: 
clientChannel.socket().getInetAddress().getHostName().
Also see the startup log of region server:

13:06:21,200 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Master 
passed us hostname to use. Was=search042024.sqa.cm4, 
Now=search042024.sqa.cm4.site.net

As you can see, most machines in the Yarn cluster with IPV6 get the short 
hostname, but hbase always get the full hostname, so the Host cannot matched 
(see RMContainerAllocator::assignToMap).This can lead to poor locality.

After I use java.net.preferIPv4Stack to force IPv4 in yarn, I get 70+% data 
locality in the cluster.

Thanks,
Kaibo


 ipv4 and ipv6 affect job data locality
 --

 Key: YARN-1226
 URL: https://issues.apache.org/jira/browse/YARN-1226
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 0.23.3, 2.0.0-alpha, 2.1.0-beta
Reporter: Kaibo Zhou
Priority: Minor

 When I run a mapreduce job which use TableInputFormat to scan a hbase table 
 on yarn cluser with 140+ nodes, I consistently get very low data locality 
 around 0~10%. 
 The scheduler is Capacity Scheduler. Hbase and hadoop are integrated in the 
 cluster with NodeManager, DataNode and HRegionServer run on the same node.
 The reason of low data locality is: most machines in the cluster uses IPV6, 
 few machines use IPV4. NodeManager use 
 InetAddress.getLocalHost().getHostName()
  to get the host name, but the return result of this function depends on 
 IPV4 or IPV6, see [InetAddress.getLocalHost().getHostName() returns 
 

[jira] [Updated] (YARN-1226) ipv4 and ipv6 affect job data locality

2013-09-23 Thread Kaibo Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaibo Zhou updated YARN-1226:
-

Description: 
When I run a mapreduce job which use TableInputFormat to scan a hbase table on 
yarn cluser with 140+ nodes, I consistently get very low data locality around 
0~10%. 

The scheduler is Capacity Scheduler. Hbase and hadoop are integrated in the 
cluster with NodeManager, DataNode and HRegionServer run on the same node.

The reason of low data locality is: most machines in the cluster uses IPV6, few 
machines use IPV4. NodeManager use InetAddress.getLocalHost().getHostName() 
to get the host name, but the return result of this function depends on IPV4 or 
IPV6, see [InetAddress.getLocalHost().getHostName() returns 
FQDN|http://bugs.sun.com/view_bug.do?bug_id=7166687]. 

On machines with ipv4, NodeManager get hostName as: 
search042097.sqa.cm4.site.net
But on machines with ipv6, NodeManager get hostName as: search042097.sqa.cm4
if run with IPv6 disabled, -Djava.net.preferIPv4Stack=true, then returns 
search042097.sqa.cm4.site.net.


For the mapred job which scan hbase table, the InputSplit contains node 
locations of [FQDN|http://en.wikipedia.org/wiki/FQDN], e.g. 
search042097.sqa.cm4.site.net. Because in hbase, the RegionServers' hostnames 
are allocated by HMaster. HMaster communicate with RegionServers and get the 
region server's host name use java NIO: 
clientChannel.socket().getInetAddress().getHostName().
Also see the startup log of region server:

13:06:21,200 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Master 
passed us hostname to use. Was=search042024.sqa.cm4, 
Now=search042024.sqa.cm4.site.net


As you can see, most machines in the Yarn cluster with IPV6 get the short 
hostname, but hbase always get the full hostname, so the Host cannot matched 
(see RMContainerAllocator::assignToMap).This can lead to poor locality.

After I use java.net.preferIPv4Stack to force IPv4 in yarn, I get 70+% data 
locality in the cluster.

Thanks,
Kaibo

  was:
When I run a mapreduce job which use TableInputFormat to scan a hbase table on 
yarn cluser with 140+ nodes, I consistently get very low data locality around 
0~10%. 

The scheduler is Capacity Scheduler. Hbase and hadoop are integrated in the 
cluster with NodeManager, DataNode and HRegionServer run on the same node.

The reason of low data locality is: most machines in the cluster uses IPV6, few 
machines use IPV4. NodeManager use InetAddress.getLocalHost().getHostName()
 to get the host name, but the return result of this function depends on IPV4 
or IPV6, see [InetAddress.getLocalHost().getHostName() returns 
FQDN|http://bugs.sun.com/view_bug.do?bug_id=7166687]. 

On machines with ipv4, NodeManager get hostName as: 
search042097.sqa.cm4.site.net
But on machines with ipv6, NodeManager get hostName as: search042097.sqa.cm4
if run with IPv6 disabled, -Djava.net.preferIPv4Stack=true, then returns 
search042097.sqa.cm4.site.net.

For the mapred job which scan hbase table, the InputSplit contains node 
locations of [FQDN|http://en.wikipedia.org/wiki/FQDN], e.g. 
search042097.sqa.cm4.site.net. Because in hbase, the RegionServers' hostnames 
are allocated by HMaster. HMaster communicate with RegionServers and get the 
region server's host name use java NIO: 
clientChannel.socket().getInetAddress().getHostName().
Also see the startup log of region server:

13:06:21,200 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Master 
passed us hostname to use. Was=search042024.sqa.cm4, 
Now=search042024.sqa.cm4.site.net

As you can see, most machines in the Yarn cluster with IPV6 get the short 
hostname, but hbase always get the full hostname, so the Host cannot matched 
(see RMContainerAllocator::assignToMap).This can lead to poor locality.

After I use java.net.preferIPv4Stack to force IPv4 in yarn, I get 70+% data 
locality in the cluster.

Thanks,
Kaibo


 ipv4 and ipv6 affect job data locality
 --

 Key: YARN-1226
 URL: https://issues.apache.org/jira/browse/YARN-1226
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 0.23.3, 2.0.0-alpha, 2.1.0-beta
Reporter: Kaibo Zhou
Priority: Minor

 When I run a mapreduce job which use TableInputFormat to scan a hbase table 
 on yarn cluser with 140+ nodes, I consistently get very low data locality 
 around 0~10%. 
 The scheduler is Capacity Scheduler. Hbase and hadoop are integrated in the 
 cluster with NodeManager, DataNode and HRegionServer run on the same node.
 The reason of low data locality is: most machines in the cluster uses IPV6, 
 few machines use IPV4. NodeManager use 
 InetAddress.getLocalHost().getHostName() to get the host name, but the 
 return result of this function depends on IPV4 or IPV6, see 
 

[jira] [Updated] (YARN-1226) ipv4 and ipv6 affect job data locality

2013-09-23 Thread Kaibo Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaibo Zhou updated YARN-1226:
-

Description: 
When I run a mapreduce job which use TableInputFormat to scan a hbase table on 
yarn cluser with 140+ nodes, I consistently get very low data locality around 
0~10%. 

The scheduler is Capacity Scheduler. Hbase and hadoop are integrated in the 
cluster with NodeManager, DataNode and HRegionServer run on the same node.

The reason of low data locality is: most machines in the cluster uses IPV6, few 
machines use IPV4. NodeManager use InetAddress.getLocalHost().getHostName()
 to get the host name, but the return result of this function depends on IPV4 
or IPV6, see [InetAddress.getLocalHost().getHostName() returns 
FQDN|http://bugs.sun.com/view_bug.do?bug_id=7166687]. 

On machines with ipv4, NodeManager get hostName as: 
search042097.sqa.cm4.site.net
But on machines with ipv6, NodeManager get hostName as: search042097.sqa.cm4
if run with IPv6 disabled, -Djava.net.preferIPv4Stack=true, then returns 
search042097.sqa.cm4.site.net.

For the mapred job which scan hbase table, the InputSplit contains node 
locations of [FQDN|http://en.wikipedia.org/wiki/FQDN], e.g. 
search042097.sqa.cm4.site.net. Because in hbase, the RegionServers' hostnames 
are allocated by HMaster. HMaster communicate with RegionServers and get the 
region server's host name use java NIO: 
clientChannel.socket().getInetAddress().getHostName().
Also see the startup log of region server:

13:06:21,200 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Master 
passed us hostname to use. Was=search042024.sqa.cm4, 
Now=search042024.sqa.cm4.site.net

As you can see, most machines in the Yarn cluster with IPV6 get the short 
hostname, but hbase always get the full hostname, so the Host cannot matched 
(see RMContainerAllocator::assignToMap).This can lead to poor locality.

After I use java.net.preferIPv4Stack to force IPv4 in yarn, I get 70+% data 
locality in the cluster.

Thanks,
Kaibo

  was:
When I run a mapreduce job which use TableInputFormat to scan a hbase table on 
yarn cluser with 140+ nodes, I consistently get very low data locality around 
0~10%. 

The scheduler is Capacity Scheduler. Hbase and hadoop are integrated in the 
cluster with NodeManager, DataNode and HRegionServer run on the same node.

The reason of low data locality is: most machines in the cluster uses IPV6, few 
machines use IPV4. NodeManager use InetAddress.getLocalHost().getHostName()
 to get the host name, but the return result of this function depends on IPV4 
or IPV6, see [InetAddress.getLocalHost().getHostName() returns 
FQDN|http://bugs.sun.com/view_bug.do?bug_id=7166687]. 

On machines with ipv4, NodeManager get hostName as: 
search042097.sqa.cm4.site.net
But on machines with ipv6, NodeManager get hostName as: search042097.sqa.cm4
if run with IPv6 disabled, -Djava.net.preferIPv4Stack=true, then returns 
search042097.sqa.cm4.site.net.

For the mapred job which scan hbase table, the InputSplit contains node 
locations of [FQDN|http://en.wikipedia.org/wiki/FQDN], e.g. 
search042097.sqa.cm4.site.net. Because in hbase, the RegionServers' hostnames 
are allocated by HMaster. HMaster communicate with RegionServers and get the 
region server's host name use java NIO: 
clientChannel.socket().getInetAddress().getHostName().
Also see the startup log of region server:

13:06:21,200 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Master 
passed us hostname to use. Was=search042024.sqa.cm4, 
Now=search042024.sqa.cm4.site.net

As you can see, most machines in the Yarn cluster with IPV6 get the short 
hostname, but hbase always get the full hostname, so the Host cannot matched 
(see RMContainerAllocator::assignToMap).This can lead to poor locality.

After I use java.net.preferIPv4Stack to force IPv4 in yarn, I get 70+% data 
locality in the cluster.

Thanks,
Kaibo


 ipv4 and ipv6 affect job data locality
 --

 Key: YARN-1226
 URL: https://issues.apache.org/jira/browse/YARN-1226
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 0.23.3, 2.0.0-alpha, 2.1.0-beta
Reporter: Kaibo Zhou
Priority: Minor

 When I run a mapreduce job which use TableInputFormat to scan a hbase table 
 on yarn cluser with 140+ nodes, I consistently get very low data locality 
 around 0~10%. 
 The scheduler is Capacity Scheduler. Hbase and hadoop are integrated in the 
 cluster with NodeManager, DataNode and HRegionServer run on the same node.
 The reason of low data locality is: most machines in the cluster uses IPV6, 
 few machines use IPV4. NodeManager use 
 InetAddress.getLocalHost().getHostName()
  to get the host name, but the return result of this function depends on 
 IPV4 or IPV6, see 

[jira] [Updated] (YARN-1226) ipv4 and ipv6 affect job data locality

2013-09-23 Thread Kaibo Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaibo Zhou updated YARN-1226:
-

Description: 
When I run a mapreduce job which use TableInputFormat to scan a hbase table on 
yarn cluser with 140+ nodes, I consistently get very low data locality around 
0~10%. 

The scheduler is Capacity Scheduler. Hbase and hadoop are integrated in the 
cluster with NodeManager, DataNode and HRegionServer run on the same node.

The reason of low data locality is: most machines in the cluster uses IPV6, few 
machines use IPV4. NodeManager use InetAddress.getLocalHost().getHostName()
 to get the host name, but the return result of this function depends on IPV4 
or IPV6, see http://bugs.sun.com/view_bug.do?bug_id=7166687;. 

On machines with ipv4, NodeManager get hostName as: 
search042097.sqa.cm4.site.net
But on machines with ipv6, NodeManager get hostName as: search042097.sqa.cm4
if run with IPv6 disabled, -Djava.net.preferIPv4Stack=true, then returns 
search042097.sqa.cm4.site.net.

For the mapred job which scan hbase table, the InputSplit contains node 
locations of FQDN, e.g. search042097.sqa.cm4.site.net. Because in hbase, the 
RegionServers' hostnames are allocated by HMaster. HMaster communicate with 
RegionServers and get the region server's host name use java NIO: 
clientChannel.socket().getInetAddress().getHostName().
Also see the startup log of region server:

13:06:21,200 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Master 
passed us hostname to use. Was=search042024.sqa.cm4, 
Now=search042024.sqa.cm4.site.net

As you can see, most machines in the Yarn cluster with IPV6 get the short 
hostname, but hbase always get the full hostname, so the Host cannot matched 
(see RMContainerAllocator::assignToMap).This can lead to poor locality.

After I use java.net.preferIPv4Stack to force IPv4 in yarn, I get 70+% data 
locality in the cluster.

Thanks,
Kaibo

  was:
When I run a mapreduce job which use TableInputFormat to scan a hbase table on 
yarn cluser with 140+ nodes, I consistently get very low data locality 
(capacity scheduler) around 0~10%.

Hbase and hadoop are integrated in the cluster with NodeManager, DataNode and 
HRegionServer run on the same node.

The reason of low data locality is: most machines in the cluster uses IPV6, few 
machines use IPV4. NodeManager use InetAddress.getLocalHost().getHostName()
 to get the host name, but the return result of this function depends on IPV4 
or IPV6, see http://bugs.sun.com/view_bug.do?bug_id=7166687;. 

On machines with ipv4, NodeManager get hostName as: 
search042097.sqa.cm4.site.net
But on machines with ipv6, NodeManager get hostName as: search042097.sqa.cm4
if run with IPv6 disabled, -Djava.net.preferIPv4Stack=true, then returns 
search042097.sqa.cm4.site.net.

For the mapred job which scan hbase table, the InputSplit contains node 
locations of FQDN, e.g. search042097.sqa.cm4.site.net. Because in hbase, the 
RegionServers' hostnames are allocated by HMaster. HMaster communicate with 
RegionServers and get the region server's host name use java NIO: 
clientChannel.socket().getInetAddress().getHostName().
Also see the startup log of region server:

13:06:21,200 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Master 
passed us hostname to use. Was=search042024.sqa.cm4, 
Now=search042024.sqa.cm4.site.net

As you can see, most machines in the Yarn cluster with IPV6 get the short 
hostname, but hbase always get the full hostname, so the Host cannot matched 
(see RMContainerAllocator::assignToMap).This can lead to poor locality.

After I use java.net.preferIPv4Stack to force IPv4 in yarn, I get 70+% data 
locality in the cluster.

Thanks,
Kaibo


 ipv4 and ipv6 affect job data locality
 --

 Key: YARN-1226
 URL: https://issues.apache.org/jira/browse/YARN-1226
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 0.23.3, 2.0.0-alpha, 2.1.0-beta
Reporter: Kaibo Zhou
Priority: Minor

 When I run a mapreduce job which use TableInputFormat to scan a hbase table 
 on yarn cluser with 140+ nodes, I consistently get very low data locality 
 around 0~10%. 
 The scheduler is Capacity Scheduler. Hbase and hadoop are integrated in the 
 cluster with NodeManager, DataNode and HRegionServer run on the same node.
 The reason of low data locality is: most machines in the cluster uses IPV6, 
 few machines use IPV4. NodeManager use 
 InetAddress.getLocalHost().getHostName()
  to get the host name, but the return result of this function depends on 
 IPV4 or IPV6, see http://bugs.sun.com/view_bug.do?bug_id=7166687;. 
 On machines with ipv4, NodeManager get hostName as: 
 search042097.sqa.cm4.site.net
 But on machines with ipv6, NodeManager get hostName as: search042097.sqa.cm4
 if run with 

[jira] [Updated] (YARN-1226) ipv4 and ipv6 affect job data locality

2013-09-23 Thread Kaibo Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaibo Zhou updated YARN-1226:
-

Description: 
When I run a mapreduce job which use TableInputFormat to scan a hbase table on 
yarn cluser with 140+ nodes, I consistently get very low data locality around 
0~10%. 

The scheduler is Capacity Scheduler. Hbase and hadoop are integrated in the 
cluster with NodeManager, DataNode and HRegionServer run on the same node.

The reason of low data locality is: most machines in the cluster uses IPV6, few 
machines use IPV4. NodeManager use InetAddress.getLocalHost().getHostName()
 to get the host name, but the return result of this function depends on IPV4 
or IPV6, see [InetAddress.getLocalHost().getHostName() returns 
FQDN](http://bugs.sun.com/view_bug.do?bug_id=7166687). 

On machines with ipv4, NodeManager get hostName as: 
search042097.sqa.cm4.site.net
But on machines with ipv6, NodeManager get hostName as: search042097.sqa.cm4
if run with IPv6 disabled, -Djava.net.preferIPv4Stack=true, then returns 
search042097.sqa.cm4.site.net.

For the mapred job which scan hbase table, the InputSplit contains node 
locations of FQDN, e.g. search042097.sqa.cm4.site.net. Because in hbase, the 
RegionServers' hostnames are allocated by HMaster. HMaster communicate with 
RegionServers and get the region server's host name use java NIO: 
clientChannel.socket().getInetAddress().getHostName().
Also see the startup log of region server:

13:06:21,200 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Master 
passed us hostname to use. Was=search042024.sqa.cm4, 
Now=search042024.sqa.cm4.site.net

As you can see, most machines in the Yarn cluster with IPV6 get the short 
hostname, but hbase always get the full hostname, so the Host cannot matched 
(see RMContainerAllocator::assignToMap).This can lead to poor locality.

After I use java.net.preferIPv4Stack to force IPv4 in yarn, I get 70+% data 
locality in the cluster.

Thanks,
Kaibo

  was:
When I run a mapreduce job which use TableInputFormat to scan a hbase table on 
yarn cluser with 140+ nodes, I consistently get very low data locality around 
0~10%. 

The scheduler is Capacity Scheduler. Hbase and hadoop are integrated in the 
cluster with NodeManager, DataNode and HRegionServer run on the same node.

The reason of low data locality is: most machines in the cluster uses IPV6, few 
machines use IPV4. NodeManager use InetAddress.getLocalHost().getHostName()
 to get the host name, but the return result of this function depends on IPV4 
or IPV6, see http://bugs.sun.com/view_bug.do?bug_id=7166687;. 

On machines with ipv4, NodeManager get hostName as: 
search042097.sqa.cm4.site.net
But on machines with ipv6, NodeManager get hostName as: search042097.sqa.cm4
if run with IPv6 disabled, -Djava.net.preferIPv4Stack=true, then returns 
search042097.sqa.cm4.site.net.

For the mapred job which scan hbase table, the InputSplit contains node 
locations of FQDN, e.g. search042097.sqa.cm4.site.net. Because in hbase, the 
RegionServers' hostnames are allocated by HMaster. HMaster communicate with 
RegionServers and get the region server's host name use java NIO: 
clientChannel.socket().getInetAddress().getHostName().
Also see the startup log of region server:

13:06:21,200 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Master 
passed us hostname to use. Was=search042024.sqa.cm4, 
Now=search042024.sqa.cm4.site.net

As you can see, most machines in the Yarn cluster with IPV6 get the short 
hostname, but hbase always get the full hostname, so the Host cannot matched 
(see RMContainerAllocator::assignToMap).This can lead to poor locality.

After I use java.net.preferIPv4Stack to force IPv4 in yarn, I get 70+% data 
locality in the cluster.

Thanks,
Kaibo


 ipv4 and ipv6 affect job data locality
 --

 Key: YARN-1226
 URL: https://issues.apache.org/jira/browse/YARN-1226
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 0.23.3, 2.0.0-alpha, 2.1.0-beta
Reporter: Kaibo Zhou
Priority: Minor

 When I run a mapreduce job which use TableInputFormat to scan a hbase table 
 on yarn cluser with 140+ nodes, I consistently get very low data locality 
 around 0~10%. 
 The scheduler is Capacity Scheduler. Hbase and hadoop are integrated in the 
 cluster with NodeManager, DataNode and HRegionServer run on the same node.
 The reason of low data locality is: most machines in the cluster uses IPV6, 
 few machines use IPV4. NodeManager use 
 InetAddress.getLocalHost().getHostName()
  to get the host name, but the return result of this function depends on 
 IPV4 or IPV6, see [InetAddress.getLocalHost().getHostName() returns 
 FQDN](http://bugs.sun.com/view_bug.do?bug_id=7166687). 
 On machines with ipv4, NodeManager get