[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297873#comment-16297873
 ] 

Hudson commented on HBASE-15482:


FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4257 (See 
[https://builds.apache.org/job/HBase-Trunk_matrix/4257/])
HBASE-15482 Provide an option to skip calculating block locations for (tedyu: 
rev 5e7d16a3ceaeec5057474f9bae2d40d306f6dd8e)
* (edit) 
hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormatTestBase.java
* (edit) 
hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapred/TestTableSnapshotInputFormat.java
* (edit) 
hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableSnapshotInputFormat.java
* (edit) 
hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormatImpl.java


> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
> Fix For: 2.0.0-beta-1
>
> Attachments: 15482.v3.txt, HBASE-15482.master.000.patch, 
> HBASE-15482.master.001.patch, HBASE-15482.master.002.patch, 
> HBASE-15482.master.003.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-19 Thread Jerry He (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297478#comment-16297478
 ] 

Jerry He commented on HBASE-15482:
--

+1 on the latest patch.

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: 15482.v3.txt, HBASE-15482.master.000.patch, 
> HBASE-15482.master.001.patch, HBASE-15482.master.002.patch, 
> HBASE-15482.master.003.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-19 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297466#comment-16297466
 ] 

Ted Yu commented on HBASE-15482:


lgtm

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: 15482.v3.txt, HBASE-15482.master.000.patch, 
> HBASE-15482.master.001.patch, HBASE-15482.master.002.patch, 
> HBASE-15482.master.003.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-19 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297150#comment-16297150
 ] 

Hadoop QA commented on HBASE-15482:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
8s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
1s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
49s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
24s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
17s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  5m 
 0s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
14s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
 0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
53s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
19m 48s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.6.5 2.7.4 or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
15s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 10m 
25s{color} | {color:green} hbase-mapreduce in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
 9s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 47m  7s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 |
| JIRA Issue | HBASE-15482 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12902892/HBASE-15482.master.003.patch
 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  shadedjars  
hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 6e068a59f070 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 
15:49:21 UTC 2017 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / 03e79b7994 |
| maven | version: Apache Maven 3.5.2 
(138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) |
| Default Java | 1.8.0_151 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/10566/testReport/ |
| modules | C: hbase-mapreduce U: hbase-mapreduce |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/10566/console |
| Powered by | Apache Yetus 0.6.0   http://yetus.apache.org |


This message was automatically generated.



> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> 

[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-19 Thread Xiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297084#comment-16297084
 ] 

Xiang Li commented on HBASE-15482:
--

Regarding the UT:
For mapred, the number of splits generated(=10) is exactly the same as 
numRegions specified(=10), while only 8 of them has location not being an empty 
array.
For mapreduce, only 8 splits are generated when there is 10 regions.

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: 15482.v3.txt, HBASE-15482.master.000.patch, 
> HBASE-15482.master.001.patch, HBASE-15482.master.002.patch, 
> HBASE-15482.master.003.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-19 Thread Xiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297060#comment-16297060
 ] 

Xiang Li commented on HBASE-15482:
--

Hi [~tedyu], [~jerryhe], thanks for your comments and guide! 
Patch 003 is uploaded to address the following changes mainly:
* Simple the logic in light of 15482.v3.txt. Besides, add the logic to
** Check if numTopsAtMost < 1 (which is invalid)
** Check if top is 1. When it is 1, return top host directly.
* Change the conf key string from 
{{hbase.TableSnapshotInputFormat.locality.enable}} into 
{{hbase.TableSnapshotInputFormat.locality.enabled}}, by using "enabled" instead 
of "enable", as I see most of the conf key strings are using "enabled"

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: 15482.v3.txt, HBASE-15482.master.000.patch, 
> HBASE-15482.master.001.patch, HBASE-15482.master.002.patch, 
> HBASE-15482.master.003.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-11 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287161#comment-16287161
 ] 

Ted Yu commented on HBASE-15482:


With v3, the test needs some adjustment. Otherwise you would see:
{code}
java.lang.AssertionError: expected:<[h2, h3, h4, h1]> but was:<[h2, h3, h4]>
at 
org.apache.hadoop.hbase.mapreduce.TestTableSnapshotInputFormat.testGetBestLocations(TestTableSnapshotInputFormat.java:138)
{code}

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: 15482.v3.txt, HBASE-15482.master.000.patch, 
> HBASE-15482.master.001.patch, HBASE-15482.master.002.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-11 Thread Xiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287092#comment-16287092
 ] 

Xiang Li commented on HBASE-15482:
--

[~jerryhe], thanks very much for the comments. Got your idea.
Under the condition of {{else { // hostAndWeights.length >= 2 && numTopsAtMost 
>= 2}}, it could break out when numTopsAtMost is met.
{code}
List locations = new ArrayList<>(Math.min(numTopsAtMost, 
hostAndWeights.length));
...
for (int i = 1; i < hostAndWeights.length; i++) {
}
{code}
The length of locations is the min of numTopsAtMost and hostAndWeights.length, 
and if numTopsAtMost is less than hostAndWeights.length, the loop will run 
until numTopsAtMost is met.

I agree that those logic added is hard to read. The original code hardcodes to 
3 as numTopsAtMost and given the comment that it is not very likely to get more 
than 3 hosts with at least 80%  of best locality, I also feel it is probably 
unnecessary to make numTopsAtMost be a variable (could be specified) and add 
those logic.

I am trying to reach [~ndimiduk] to see if he could have more comments on the 
change or could help to explain more his comment introduced by HBASE-11137.

[~ted_yu], what about your opinion?





> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: HBASE-15482.master.000.patch, 
> HBASE-15482.master.001.patch, HBASE-15482.master.002.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-10 Thread Jerry He (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285528#comment-16285528
 ] 

Jerry He commented on HBASE-15482:
--

Hi, [~water]

In the 002 patch, you added 'numTopsAtMost' in getBestLocations.  You will need 
another 'break' in the loop? Like:
If numTopsAtMost is met, then break out.

But again, the new code with this 'numTopsAtMost' is probably unnecessary.  The 
comment for the method getBestLocations has explained that it is not very 
likely you will get more than 3 hosts with at least 80% 
(hbase.tablesnapshotinputformat.locality.cutoff.multiplier) as much block 
locality as the top host with the best locality.  So you will break out early 
anyway with the filterWeight check. 
Your first patch's logic is good enough.
The added comment is good.
{code}
// As hostAndWeights is in descending order,
// we could break the loop as long as we meet a weight which is less than 
filterWeight
{code}

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: HBASE-15482.master.000.patch, 
> HBASE-15482.master.001.patch, HBASE-15482.master.002.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-10 Thread Xiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285456#comment-16285456
 ] 

Xiang Li commented on HBASE-15482:
--

I will upload a new patch to make some changes on UT, to address the following 
comment by Jerry
{quote}
{code}
if (careBlockLocality) {
  Assert.assertTrue(split.getLocations() != null && split.getLocations().length 
!= 0);
} else {
  Assert.assertTrue(split.getLocations() != null && split.getLocations().length 
== 0);
}
{code}
This is ok too. The first test is an existing test, and it has not failed 
previously.
{quote}


> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: HBASE-15482.master.000.patch, 
> HBASE-15482.master.001.patch, HBASE-15482.master.002.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-09 Thread Jerry He (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285062#comment-16285062
 ] 

Jerry He commented on HBASE-15482:
--

The patch looks good!
I just think the first patch 000 is cleaner.  But, as Ted suggested, change 
hbase.TableSnapshotInputFormat.locality to 
hbase.TableSnapshotInputFormat.locality.enable.  (Change the name 
SNAPSHOT_INPUTFORMAT_CARE_BLOCK_LOCALITY_KEY too).  The other changes look 
unnecessary except making it more complicated.
{code}
if (careBlockLocality) {
  Assert.assertTrue(split.getLocations() != null && split.getLocations().length 
!= 0);
} else {
  Assert.assertTrue(split.getLocations() != null && split.getLocations().length 
== 0);
}
{code}
This is ok too.  The first test is an existing test, and it has not failed 
previously.

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: HBASE-15482.master.000.patch, 
> HBASE-15482.master.001.patch, HBASE-15482.master.002.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-09 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284998#comment-16284998
 ] 

Ted Yu commented on HBASE-15482:


[~davelatham]:
Can you take another look ?

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: HBASE-15482.master.000.patch, 
> HBASE-15482.master.001.patch, HBASE-15482.master.002.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-08 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284499#comment-16284499
 ] 

Ted Yu commented on HBASE-15482:


lgtm

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: HBASE-15482.master.000.patch, 
> HBASE-15482.master.001.patch, HBASE-15482.master.002.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-08 Thread Xiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284496#comment-16284496
 ] 

Xiang Li commented on HBASE-15482:
--

[~tedyu], would you please help to review patch 002 at your convenience?

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: HBASE-15482.master.000.patch, 
> HBASE-15482.master.001.patch, HBASE-15482.master.002.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283272#comment-16283272
 ] 

Hadoop QA commented on HBASE-15482:
---

| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
7s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
30s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
21s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
15s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
39s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
14s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
16s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
26s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
50m 42s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 2.7.4 or 3.0.0-alpha4. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
14s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  9m 
48s{color} | {color:green} hbase-mapreduce in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
 9s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 76m  6s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 |
| JIRA Issue | HBASE-15482 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12901199/HBASE-15482.master.002.patch
 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  shadedjars  
hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux a7d9558a5feb 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 
15:49:21 UTC 2017 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / 5034411438 |
| maven | version: Apache Maven 3.5.2 
(138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) |
| Default Java | 1.8.0_151 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/10301/testReport/ |
| modules | C: hbase-mapreduce U: hbase-mapreduce |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/10301/console |
| Powered by | Apache Yetus 0.6.0   http://yetus.apache.org |


This message was automatically generated.



> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
>   

[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283093#comment-16283093
 ] 

Hadoop QA commented on HBASE-15482:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
8s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
42s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
24s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
17s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
47s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
15s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
17s{color} | {color:red} hbase-mapreduce: The patch generated 3 new + 17 
unchanged - 0 fixed = 20 total (was 17) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
30s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
59m 35s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 2.7.4 or 3.0.0-alpha4. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
17s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 11m 
54s{color} | {color:green} hbase-mapreduce in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
11s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 87m 54s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 |
| JIRA Issue | HBASE-15482 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12901194/HBASE-15482.master.001.patch
 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  shadedjars  
hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 7a1faeb05d01 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 
15:49:21 UTC 2017 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / 5034411438 |
| maven | version: Apache Maven 3.5.2 
(138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) |
| Default Java | 1.8.0_151 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HBASE-Build/10299/artifact/patchprocess/diff-checkstyle-hbase-mapreduce.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/10299/testReport/ |
| modules | C: hbase-mapreduce U: hbase-mapreduce |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/10299/console |
| Powered by | Apache Yetus 0.6.0   http://yetus.apache.org |


This message 

[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-07 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283087#comment-16283087
 ] 

Ted Yu commented on HBASE-15482:


You can choose the form which is intuitive to you. 

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: HBASE-15482.master.000.patch, 
> HBASE-15482.master.001.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-07 Thread Xiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283085#comment-16283085
 ] 

Xiang Li commented on HBASE-15482:
--

Hi [~tedyu], thanks very much for the review and comment. I got your idea.
I was thinking that if the current statements  are more easy to understand than 
you suggested. 
The current statements explicitly show that we have 3 conditions to handle:
# When hostAndWeights.length == 0
# When hostAndWeights.length == 1 || numTopsAtMost <= 1
# Others (hostAndWeights.length >= 2 && numTopsAtMost >= 2)

Please let me know your idea. Thanks!


> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: HBASE-15482.master.000.patch, 
> HBASE-15482.master.001.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-07 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283043#comment-16283043
 ] 

Ted Yu commented on HBASE-15482:


{code}
+  List locations = new ArrayList<>(Math.min(numTopsAtMost, 
hostAndWeights.length));
...
+  for (int i = 1; i < hostAndWeights.length; i++) {
{code}
Shouldn't the value of Math.min() be used as the upper bound above ?
{code}
+  return locations;
+} else { // hostAndWeights.length >= 2 && numTopsAtMost >= 2
{code}
nit: you can omit the 'else' keyword following the return in previous if block.

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: HBASE-15482.master.000.patch, 
> HBASE-15482.master.001.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-07 Thread Xiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283035#comment-16283035
 ] 

Xiang Li commented on HBASE-15482:
--

[~tedyu], thanks very much for your comments!
patch 001 is updated to address your comments as well as the errors reported by 
checkstyle.
* "hbase.TableSnapshotInputFormat.locality" is changed into 
"hbase.TableSnapshotInputFormat.locality.enable".
* The truncation of locations is moved into getBestLocations().
* The errors reported by checkstyle are corrected.

Regarding {{moving the truncation of locations into getBestLocations()}}:
The code has different logic for different combinations of 
hostAndWeights.length and numTopsAtMost.
And there is a small behavior change on getBestLocations() when 
hostAndWeights.length is 0:
* Originally, it returns a empty list.
* After the change, it returns null. I think we do not need to allocate an 
empty list here, as the locations will be used to construct 
TableSnapshotInputFormatImpl.InputSplit and null will be checked as follow
{code:title=hbase/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormatImpl.java|borderStyle=solid}
public InputSplit(TableDescriptor htd, HRegionInfo regionInfo, List 
locations,
Scan scan, Path restoreDir) {
  this.htd = htd;
  this.regionInfo = regionInfo;
  if (locations == null || locations.isEmpty()) { // <--- here
this.locations = new String[0];
  } else {
this.locations = locations.toArray(new String[locations.size()]);
  }
  try {
this.scan = scan != null ? TableMapReduceUtil.convertScanToString(scan) 
: "";
  } catch (IOException e) {
LOG.warn("Failed to convert Scan to String", e);
  }

  this.restoreDir = restoreDir.toString();
}
{code}
And TableSnapshotInputFormatImpl is @InterfaceAudience.Private, there is no 
other calls of getBestLocations() in the whole HBase project except UTs. A UT 
is updated according to the change above.

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: HBASE-15482.master.000.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-06 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281299#comment-16281299
 ] 

Ted Yu commented on HBASE-15482:


{code}
+  public static final String  SNAPSHOT_INPUTFORMAT_CARE_BLOCK_LOCALITY_KEY = 
"hbase.TableSnapshotInputFormat.locality";
{code}
>From the current key name, it sounds like locality measure (say percent). How 
>about naming it "hbase.TableSnapshotInputFormat.locality.enable" (or something 
>similar).
{code}
+  List hosts = getBestLocations(conf,
+  HRegion.computeHDFSBlocksDistribution(conf, htd, hri, tableDir));
+
+  // return at most top 3 hosts
+  int len = Math.min(3, hosts.size());
{code}
You can pass 3 to getBestLocations() so that no more than 3 hosts are returned.



> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
> Fix For: 2.1.0
>
> Attachments: HBASE-15482.master.000.patch
>
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281248#comment-16281248
 ] 

Hadoop QA commented on HBASE-15482:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
52s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
 2s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
24s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
20s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
52s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
15s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
17s{color} | {color:red} hbase-mapreduce: The patch generated 15 new + 17 
unchanged - 0 fixed = 32 total (was 17) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
33s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
54m  6s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 2.7.4 or 3.0.0-alpha4. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
15s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 11m 
38s{color} | {color:green} hbase-mapreduce in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
11s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 84m 20s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 |
| JIRA Issue | HBASE-15482 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12900899/HBASE-15482.master.000.patch
 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  shadedjars  
hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux eeaf111c9913 3.13.0-129-generic #178-Ubuntu SMP Fri Aug 11 
12:48:20 UTC 2017 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build@2/component/dev-support/hbase-personality.sh
 |
| git revision | master / 4a2e8b852d |
| maven | version: Apache Maven 3.5.2 
(138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) |
| Default Java | 1.8.0_151 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HBASE-Build/10276/artifact/patchprocess/diff-checkstyle-hbase-mapreduce.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/10276/testReport/ |
| modules | C: hbase-mapreduce U: hbase-mapreduce |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/10276/console |
| Powered by | Apache Yetus 0.6.0   http://yetus.apache.org |


This 

[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-12-06 Thread Xiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16280518#comment-16280518
 ] 

Xiang Li commented on HBASE-15482:
--

[~davelatham], thanks very much for the comment and guide.
Uploaded the very first patch 000:
* The conf key is "hbase.TableSnapshotInputFormat.locality" with default to 
true, that is, always care the locality and calculate the block locations, 
unless it is set to false explicitly. When it is set to false, the logic 
containing getBestLocations() is skipped and new 
TableSnapshotInputFormatImpl.InputSplit with locations as null.
* The access modifier of both the conf key and the default value is set to 
public, so that they could be accessed by test classes of other packages.
* The UTs are embedded into existing test cases.
** Test case in the package of mapred covers the scenario that the conf key is 
not specifiy and default value of true is taken.
** Test cases in the package of mapreduce cover the scenarios that the conf key 
is explicitly set to true or false.

Hi [~davelatham], [~liyin], [~ted_yu], could you please help to review the 
patch at your most convenience?

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-11-07 Thread Dave Latham (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242203#comment-16242203
 ] 

Dave Latham commented on HBASE-15482:
-

No, that will only stop using the locations.  Need to prevent spending the time 
to calculate them in the first place.  See 
TableSnapshotInputFormatImpl.getSplits and getBestLocations.  (Unless that has 
changed in trunk)

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-11-07 Thread Xiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16241965#comment-16241965
 ] 

Xiang Li commented on HBASE-15482:
--

Hi [~liyintang], [~davelatham]
I planned to provide an option (the key is like 
"hbase.TableSnapshotInputFormat.locality", default to true)
And in TableSnapshotInputFormat.TableSnapshotRegionSplit#getLocations()
{code}
if option is true // care locality
  return delegate.getLocations();
else 
  return new String[0];
{code}
Is it the right way to go?

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Assignee: Xiang Li
>Priority: Minor
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-09-18 Thread Xiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171033#comment-16171033
 ] 

Xiang Li commented on HBASE-15482:
--

Thanks [~jerryhe]. Got it. 
Just un-assign this JIRA. Joined the new company last week, still learning 
everything, hard to manage my time. Should be back soon.

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Priority: Minor
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-09-12 Thread Jerry He (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164014#comment-16164014
 ] 

Jerry He commented on HBASE-15482:
--

Hi, [~water]   It is still relevant, Feel free to take it.

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Priority: Minor
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2017-09-12 Thread Xiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16162699#comment-16162699
 ] 

Xiang Li commented on HBASE-15482:
--

Hi, this JIRA is still valid now? I plan to work on it if it is still valid.

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Priority: Minor
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2016-03-19 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201910#comment-15201910
 ] 

Nick Dimiduk commented on HBASE-15482:
--

Seems like this would apply for all the InputFormat implementations, not just 
SnapshotInputFormat. I have that right [~liyin]?

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Priority: Minor
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2016-03-19 Thread churro morales (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202058#comment-15202058
 ] 

churro morales commented on HBASE-15482:


I think Liyin is referring to only those InputFormats that deal specifically 
with store files, if the Input format scans meta, that should still be fine.  
We encountered the same issue when dealing with snapshots because when you have 
a million store files and you calculate the block distribution of each one that 
creates quite a bit of stress on the namenode.  

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Priority: Minor
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2016-03-18 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200994#comment-15200994
 ] 

Ted Yu commented on HBASE-15482:


Welcome back, Liyin.

Yeah. This makes sense.

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Priority: Minor
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2016-03-18 Thread Liyin Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202071#comment-15202071
 ] 

Liyin Tang commented on HBASE-15482:


Yeah, that's right. Ideally, if SnapshotInputFormat can read directly from 
snapshot instead of restore, that will be awesome ! Restoring a millions of 
storefiles will also take a long time. But that will be out of the scope of 
this jira.

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Priority: Minor
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2016-03-18 Thread Dave Latham (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202543#comment-15202543
 ] 

Dave Latham commented on HBASE-15482:
-

Yes, I quite agree - we also skip it.

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Priority: Minor
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2016-03-18 Thread Liyin Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202539#comment-15202539
 ] 

Liyin Tang commented on HBASE-15482:


Dave, thanks for the response.

Even we use HDFS snapshots, it will be great to have an option to skip 
calculating block locations. To decouple computing with storage , it is 
possible to set up computing layer for query engine like Spark/Hive/Presto in a 
different cluster. In these cases, the locality doesn't matter for both HBase 
and HDFS snapshots.  

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Priority: Minor
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat

2016-03-18 Thread Dave Latham (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202170#comment-15202170
 ] 

Dave Latham commented on HBASE-15482:
-

Liyin, you mentioned it being outside of the scope of this jira, but we use 
hdfs snapshots rather than hbase snapshots for reasons like that, and have an 
adapted input format for reading from them.  If there is interest we could look 
at trying to share that better, can see what shape it is in.

> Provide an option to skip calculating block locations for SnapshotInputFormat
> -
>
> Key: HBASE-15482
> URL: https://issues.apache.org/jira/browse/HBASE-15482
> Project: HBase
>  Issue Type: Improvement
>  Components: mapreduce
>Reporter: Liyin Tang
>Priority: Minor
>
> When a MR job is reading from SnapshotInputFormat, it needs to calculate the 
> splits based on the block locations in order to get best locality. However, 
> this process may take a long time for large snapshots. 
> In some setup, the computing layer, Spark, Hive or Presto could run out side 
> of HBase cluster. In these scenarios, the block locality doesn't matter. 
> Therefore, it will be great to have an option to skip calculating the block 
> locations for every job. That will super useful for the Hive/Presto/Spark 
> connectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)