[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297873#comment-16297873 ] Hudson commented on HBASE-15482: FAILURE: Integrated in Jenkins build HBase-Trunk_matrix #4257 (See [https://builds.apache.org/job/HBase-Trunk_matrix/4257/]) HBASE-15482 Provide an option to skip calculating block locations for (tedyu: rev 5e7d16a3ceaeec5057474f9bae2d40d306f6dd8e) * (edit) hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormatTestBase.java * (edit) hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapred/TestTableSnapshotInputFormat.java * (edit) hbase-mapreduce/src/test/java/org/apache/hadoop/hbase/mapreduce/TestTableSnapshotInputFormat.java * (edit) hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormatImpl.java > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > Fix For: 2.0.0-beta-1 > > Attachments: 15482.v3.txt, HBASE-15482.master.000.patch, > HBASE-15482.master.001.patch, HBASE-15482.master.002.patch, > HBASE-15482.master.003.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297478#comment-16297478 ] Jerry He commented on HBASE-15482: -- +1 on the latest patch. > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > Fix For: 2.1.0 > > Attachments: 15482.v3.txt, HBASE-15482.master.000.patch, > HBASE-15482.master.001.patch, HBASE-15482.master.002.patch, > HBASE-15482.master.003.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297466#comment-16297466 ] Ted Yu commented on HBASE-15482: lgtm > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > Fix For: 2.1.0 > > Attachments: 15482.v3.txt, HBASE-15482.master.000.patch, > HBASE-15482.master.001.patch, HBASE-15482.master.002.patch, > HBASE-15482.master.003.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297150#comment-16297150 ] Hadoop QA commented on HBASE-15482: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 8s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 1s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 49s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 17s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 5m 0s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 14s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 53s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 19m 48s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 10m 25s{color} | {color:green} hbase-mapreduce in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 9s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 47m 7s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 | | JIRA Issue | HBASE-15482 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12902892/HBASE-15482.master.003.patch | | Optional Tests | asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 6e068a59f070 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 15:49:21 UTC 2017 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 03e79b7994 | | maven | version: Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) | | Default Java | 1.8.0_151 | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/10566/testReport/ | | modules | C: hbase-mapreduce U: hbase-mapreduce | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/10566/console | | Powered by | Apache Yetus 0.6.0 http://yetus.apache.org | This message was automatically generated. > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 >
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297084#comment-16297084 ] Xiang Li commented on HBASE-15482: -- Regarding the UT: For mapred, the number of splits generated(=10) is exactly the same as numRegions specified(=10), while only 8 of them has location not being an empty array. For mapreduce, only 8 splits are generated when there is 10 regions. > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > Fix For: 2.1.0 > > Attachments: 15482.v3.txt, HBASE-15482.master.000.patch, > HBASE-15482.master.001.patch, HBASE-15482.master.002.patch, > HBASE-15482.master.003.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297060#comment-16297060 ] Xiang Li commented on HBASE-15482: -- Hi [~tedyu], [~jerryhe], thanks for your comments and guide! Patch 003 is uploaded to address the following changes mainly: * Simple the logic in light of 15482.v3.txt. Besides, add the logic to ** Check if numTopsAtMost < 1 (which is invalid) ** Check if top is 1. When it is 1, return top host directly. * Change the conf key string from {{hbase.TableSnapshotInputFormat.locality.enable}} into {{hbase.TableSnapshotInputFormat.locality.enabled}}, by using "enabled" instead of "enable", as I see most of the conf key strings are using "enabled" > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > Fix For: 2.1.0 > > Attachments: 15482.v3.txt, HBASE-15482.master.000.patch, > HBASE-15482.master.001.patch, HBASE-15482.master.002.patch, > HBASE-15482.master.003.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287161#comment-16287161 ] Ted Yu commented on HBASE-15482: With v3, the test needs some adjustment. Otherwise you would see: {code} java.lang.AssertionError: expected:<[h2, h3, h4, h1]> but was:<[h2, h3, h4]> at org.apache.hadoop.hbase.mapreduce.TestTableSnapshotInputFormat.testGetBestLocations(TestTableSnapshotInputFormat.java:138) {code} > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > Fix For: 2.1.0 > > Attachments: 15482.v3.txt, HBASE-15482.master.000.patch, > HBASE-15482.master.001.patch, HBASE-15482.master.002.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16287092#comment-16287092 ] Xiang Li commented on HBASE-15482: -- [~jerryhe], thanks very much for the comments. Got your idea. Under the condition of {{else { // hostAndWeights.length >= 2 && numTopsAtMost >= 2}}, it could break out when numTopsAtMost is met. {code} List locations = new ArrayList<>(Math.min(numTopsAtMost, hostAndWeights.length)); ... for (int i = 1; i < hostAndWeights.length; i++) { } {code} The length of locations is the min of numTopsAtMost and hostAndWeights.length, and if numTopsAtMost is less than hostAndWeights.length, the loop will run until numTopsAtMost is met. I agree that those logic added is hard to read. The original code hardcodes to 3 as numTopsAtMost and given the comment that it is not very likely to get more than 3 hosts with at least 80% of best locality, I also feel it is probably unnecessary to make numTopsAtMost be a variable (could be specified) and add those logic. I am trying to reach [~ndimiduk] to see if he could have more comments on the change or could help to explain more his comment introduced by HBASE-11137. [~ted_yu], what about your opinion? > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > Fix For: 2.1.0 > > Attachments: HBASE-15482.master.000.patch, > HBASE-15482.master.001.patch, HBASE-15482.master.002.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285528#comment-16285528 ] Jerry He commented on HBASE-15482: -- Hi, [~water] In the 002 patch, you added 'numTopsAtMost' in getBestLocations. You will need another 'break' in the loop? Like: If numTopsAtMost is met, then break out. But again, the new code with this 'numTopsAtMost' is probably unnecessary. The comment for the method getBestLocations has explained that it is not very likely you will get more than 3 hosts with at least 80% (hbase.tablesnapshotinputformat.locality.cutoff.multiplier) as much block locality as the top host with the best locality. So you will break out early anyway with the filterWeight check. Your first patch's logic is good enough. The added comment is good. {code} // As hostAndWeights is in descending order, // we could break the loop as long as we meet a weight which is less than filterWeight {code} > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > Fix For: 2.1.0 > > Attachments: HBASE-15482.master.000.patch, > HBASE-15482.master.001.patch, HBASE-15482.master.002.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285456#comment-16285456 ] Xiang Li commented on HBASE-15482: -- I will upload a new patch to make some changes on UT, to address the following comment by Jerry {quote} {code} if (careBlockLocality) { Assert.assertTrue(split.getLocations() != null && split.getLocations().length != 0); } else { Assert.assertTrue(split.getLocations() != null && split.getLocations().length == 0); } {code} This is ok too. The first test is an existing test, and it has not failed previously. {quote} > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > Fix For: 2.1.0 > > Attachments: HBASE-15482.master.000.patch, > HBASE-15482.master.001.patch, HBASE-15482.master.002.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285062#comment-16285062 ] Jerry He commented on HBASE-15482: -- The patch looks good! I just think the first patch 000 is cleaner. But, as Ted suggested, change hbase.TableSnapshotInputFormat.locality to hbase.TableSnapshotInputFormat.locality.enable. (Change the name SNAPSHOT_INPUTFORMAT_CARE_BLOCK_LOCALITY_KEY too). The other changes look unnecessary except making it more complicated. {code} if (careBlockLocality) { Assert.assertTrue(split.getLocations() != null && split.getLocations().length != 0); } else { Assert.assertTrue(split.getLocations() != null && split.getLocations().length == 0); } {code} This is ok too. The first test is an existing test, and it has not failed previously. > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > Fix For: 2.1.0 > > Attachments: HBASE-15482.master.000.patch, > HBASE-15482.master.001.patch, HBASE-15482.master.002.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284998#comment-16284998 ] Ted Yu commented on HBASE-15482: [~davelatham]: Can you take another look ? > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > Fix For: 2.1.0 > > Attachments: HBASE-15482.master.000.patch, > HBASE-15482.master.001.patch, HBASE-15482.master.002.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284499#comment-16284499 ] Ted Yu commented on HBASE-15482: lgtm > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > Fix For: 2.1.0 > > Attachments: HBASE-15482.master.000.patch, > HBASE-15482.master.001.patch, HBASE-15482.master.002.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284496#comment-16284496 ] Xiang Li commented on HBASE-15482: -- [~tedyu], would you please help to review patch 002 at your convenience? > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > Fix For: 2.1.0 > > Attachments: HBASE-15482.master.000.patch, > HBASE-15482.master.001.patch, HBASE-15482.master.002.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283272#comment-16283272 ] Hadoop QA commented on HBASE-15482: --- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 7s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 30s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 21s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 15s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 39s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 14s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 16s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 26s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 50m 42s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 2.7.4 or 3.0.0-alpha4. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 14s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 48s{color} | {color:green} hbase-mapreduce in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 9s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 76m 6s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 | | JIRA Issue | HBASE-15482 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12901199/HBASE-15482.master.002.patch | | Optional Tests | asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux a7d9558a5feb 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 15:49:21 UTC 2017 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 5034411438 | | maven | version: Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) | | Default Java | 1.8.0_151 | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/10301/testReport/ | | modules | C: hbase-mapreduce U: hbase-mapreduce | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/10301/console | | Powered by | Apache Yetus 0.6.0 http://yetus.apache.org | This message was automatically generated. > Provide an option to skip calculating block locations for SnapshotInputFormat > - > >
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283093#comment-16283093 ] Hadoop QA commented on HBASE-15482: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 8s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 42s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 17s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 47s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 22s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 17s{color} | {color:red} hbase-mapreduce: The patch generated 3 new + 17 unchanged - 0 fixed = 20 total (was 17) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 30s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 59m 35s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 2.7.4 or 3.0.0-alpha4. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 17s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 11m 54s{color} | {color:green} hbase-mapreduce in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 11s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 87m 54s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 | | JIRA Issue | HBASE-15482 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12901194/HBASE-15482.master.001.patch | | Optional Tests | asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 7a1faeb05d01 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 15:49:21 UTC 2017 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | master / 5034411438 | | maven | version: Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) | | Default Java | 1.8.0_151 | | checkstyle | https://builds.apache.org/job/PreCommit-HBASE-Build/10299/artifact/patchprocess/diff-checkstyle-hbase-mapreduce.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/10299/testReport/ | | modules | C: hbase-mapreduce U: hbase-mapreduce | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/10299/console | | Powered by | Apache Yetus 0.6.0 http://yetus.apache.org | This message
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283087#comment-16283087 ] Ted Yu commented on HBASE-15482: You can choose the form which is intuitive to you. > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > Fix For: 2.1.0 > > Attachments: HBASE-15482.master.000.patch, > HBASE-15482.master.001.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283085#comment-16283085 ] Xiang Li commented on HBASE-15482: -- Hi [~tedyu], thanks very much for the review and comment. I got your idea. I was thinking that if the current statements are more easy to understand than you suggested. The current statements explicitly show that we have 3 conditions to handle: # When hostAndWeights.length == 0 # When hostAndWeights.length == 1 || numTopsAtMost <= 1 # Others (hostAndWeights.length >= 2 && numTopsAtMost >= 2) Please let me know your idea. Thanks! > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > Fix For: 2.1.0 > > Attachments: HBASE-15482.master.000.patch, > HBASE-15482.master.001.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283043#comment-16283043 ] Ted Yu commented on HBASE-15482: {code} + List locations = new ArrayList<>(Math.min(numTopsAtMost, hostAndWeights.length)); ... + for (int i = 1; i < hostAndWeights.length; i++) { {code} Shouldn't the value of Math.min() be used as the upper bound above ? {code} + return locations; +} else { // hostAndWeights.length >= 2 && numTopsAtMost >= 2 {code} nit: you can omit the 'else' keyword following the return in previous if block. > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > Fix For: 2.1.0 > > Attachments: HBASE-15482.master.000.patch, > HBASE-15482.master.001.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283035#comment-16283035 ] Xiang Li commented on HBASE-15482: -- [~tedyu], thanks very much for your comments! patch 001 is updated to address your comments as well as the errors reported by checkstyle. * "hbase.TableSnapshotInputFormat.locality" is changed into "hbase.TableSnapshotInputFormat.locality.enable". * The truncation of locations is moved into getBestLocations(). * The errors reported by checkstyle are corrected. Regarding {{moving the truncation of locations into getBestLocations()}}: The code has different logic for different combinations of hostAndWeights.length and numTopsAtMost. And there is a small behavior change on getBestLocations() when hostAndWeights.length is 0: * Originally, it returns a empty list. * After the change, it returns null. I think we do not need to allocate an empty list here, as the locations will be used to construct TableSnapshotInputFormatImpl.InputSplit and null will be checked as follow {code:title=hbase/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormatImpl.java|borderStyle=solid} public InputSplit(TableDescriptor htd, HRegionInfo regionInfo, List locations, Scan scan, Path restoreDir) { this.htd = htd; this.regionInfo = regionInfo; if (locations == null || locations.isEmpty()) { // <--- here this.locations = new String[0]; } else { this.locations = locations.toArray(new String[locations.size()]); } try { this.scan = scan != null ? TableMapReduceUtil.convertScanToString(scan) : ""; } catch (IOException e) { LOG.warn("Failed to convert Scan to String", e); } this.restoreDir = restoreDir.toString(); } {code} And TableSnapshotInputFormatImpl is @InterfaceAudience.Private, there is no other calls of getBestLocations() in the whole HBase project except UTs. A UT is updated according to the change above. > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > Fix For: 2.1.0 > > Attachments: HBASE-15482.master.000.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281299#comment-16281299 ] Ted Yu commented on HBASE-15482: {code} + public static final String SNAPSHOT_INPUTFORMAT_CARE_BLOCK_LOCALITY_KEY = "hbase.TableSnapshotInputFormat.locality"; {code} >From the current key name, it sounds like locality measure (say percent). How >about naming it "hbase.TableSnapshotInputFormat.locality.enable" (or something >similar). {code} + List hosts = getBestLocations(conf, + HRegion.computeHDFSBlocksDistribution(conf, htd, hri, tableDir)); + + // return at most top 3 hosts + int len = Math.min(3, hosts.size()); {code} You can pass 3 to getBestLocations() so that no more than 3 hosts are returned. > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > Fix For: 2.1.0 > > Attachments: HBASE-15482.master.000.patch > > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16281248#comment-16281248 ] Hadoop QA commented on HBASE-15482: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 52s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 2s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 24s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 20s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 52s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 17s{color} | {color:red} hbase-mapreduce: The patch generated 15 new + 17 unchanged - 0 fixed = 32 total (was 17) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 33s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 54m 6s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.1 2.6.2 2.6.3 2.6.4 2.6.5 2.7.1 2.7.2 2.7.3 2.7.4 or 3.0.0-alpha4. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 11m 38s{color} | {color:green} hbase-mapreduce in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 11s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 84m 20s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 | | JIRA Issue | HBASE-15482 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12900899/HBASE-15482.master.000.patch | | Optional Tests | asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux eeaf111c9913 3.13.0-129-generic #178-Ubuntu SMP Fri Aug 11 12:48:20 UTC 2017 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build@2/component/dev-support/hbase-personality.sh | | git revision | master / 4a2e8b852d | | maven | version: Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) | | Default Java | 1.8.0_151 | | checkstyle | https://builds.apache.org/job/PreCommit-HBASE-Build/10276/artifact/patchprocess/diff-checkstyle-hbase-mapreduce.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/10276/testReport/ | | modules | C: hbase-mapreduce U: hbase-mapreduce | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/10276/console | | Powered by | Apache Yetus 0.6.0 http://yetus.apache.org | This
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16280518#comment-16280518 ] Xiang Li commented on HBASE-15482: -- [~davelatham], thanks very much for the comment and guide. Uploaded the very first patch 000: * The conf key is "hbase.TableSnapshotInputFormat.locality" with default to true, that is, always care the locality and calculate the block locations, unless it is set to false explicitly. When it is set to false, the logic containing getBestLocations() is skipped and new TableSnapshotInputFormatImpl.InputSplit with locations as null. * The access modifier of both the conf key and the default value is set to public, so that they could be accessed by test classes of other packages. * The UTs are embedded into existing test cases. ** Test case in the package of mapred covers the scenario that the conf key is not specifiy and default value of true is taken. ** Test cases in the package of mapreduce cover the scenarios that the conf key is explicitly set to true or false. Hi [~davelatham], [~liyin], [~ted_yu], could you please help to review the patch at your most convenience? > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242203#comment-16242203 ] Dave Latham commented on HBASE-15482: - No, that will only stop using the locations. Need to prevent spending the time to calculate them in the first place. See TableSnapshotInputFormatImpl.getSplits and getBestLocations. (Unless that has changed in trunk) > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16241965#comment-16241965 ] Xiang Li commented on HBASE-15482: -- Hi [~liyintang], [~davelatham] I planned to provide an option (the key is like "hbase.TableSnapshotInputFormat.locality", default to true) And in TableSnapshotInputFormat.TableSnapshotRegionSplit#getLocations() {code} if option is true // care locality return delegate.getLocations(); else return new String[0]; {code} Is it the right way to go? > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Assignee: Xiang Li >Priority: Minor > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171033#comment-16171033 ] Xiang Li commented on HBASE-15482: -- Thanks [~jerryhe]. Got it. Just un-assign this JIRA. Joined the new company last week, still learning everything, hard to manage my time. Should be back soon. > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Priority: Minor > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16164014#comment-16164014 ] Jerry He commented on HBASE-15482: -- Hi, [~water] It is still relevant, Feel free to take it. > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Priority: Minor > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16162699#comment-16162699 ] Xiang Li commented on HBASE-15482: -- Hi, this JIRA is still valid now? I plan to work on it if it is still valid. > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Priority: Minor > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201910#comment-15201910 ] Nick Dimiduk commented on HBASE-15482: -- Seems like this would apply for all the InputFormat implementations, not just SnapshotInputFormat. I have that right [~liyin]? > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Priority: Minor > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202058#comment-15202058 ] churro morales commented on HBASE-15482: I think Liyin is referring to only those InputFormats that deal specifically with store files, if the Input format scans meta, that should still be fine. We encountered the same issue when dealing with snapshots because when you have a million store files and you calculate the block distribution of each one that creates quite a bit of stress on the namenode. > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Priority: Minor > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200994#comment-15200994 ] Ted Yu commented on HBASE-15482: Welcome back, Liyin. Yeah. This makes sense. > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Priority: Minor > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202071#comment-15202071 ] Liyin Tang commented on HBASE-15482: Yeah, that's right. Ideally, if SnapshotInputFormat can read directly from snapshot instead of restore, that will be awesome ! Restoring a millions of storefiles will also take a long time. But that will be out of the scope of this jira. > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Priority: Minor > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202543#comment-15202543 ] Dave Latham commented on HBASE-15482: - Yes, I quite agree - we also skip it. > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Priority: Minor > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202539#comment-15202539 ] Liyin Tang commented on HBASE-15482: Dave, thanks for the response. Even we use HDFS snapshots, it will be great to have an option to skip calculating block locations. To decouple computing with storage , it is possible to set up computing layer for query engine like Spark/Hive/Presto in a different cluster. In these cases, the locality doesn't matter for both HBase and HDFS snapshots. > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Priority: Minor > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-15482) Provide an option to skip calculating block locations for SnapshotInputFormat
[ https://issues.apache.org/jira/browse/HBASE-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15202170#comment-15202170 ] Dave Latham commented on HBASE-15482: - Liyin, you mentioned it being outside of the scope of this jira, but we use hdfs snapshots rather than hbase snapshots for reasons like that, and have an adapted input format for reading from them. If there is interest we could look at trying to share that better, can see what shape it is in. > Provide an option to skip calculating block locations for SnapshotInputFormat > - > > Key: HBASE-15482 > URL: https://issues.apache.org/jira/browse/HBASE-15482 > Project: HBase > Issue Type: Improvement > Components: mapreduce >Reporter: Liyin Tang >Priority: Minor > > When a MR job is reading from SnapshotInputFormat, it needs to calculate the > splits based on the block locations in order to get best locality. However, > this process may take a long time for large snapshots. > In some setup, the computing layer, Spark, Hive or Presto could run out side > of HBase cluster. In these scenarios, the block locality doesn't matter. > Therefore, it will be great to have an option to skip calculating the block > locations for every job. That will super useful for the Hive/Presto/Spark > connectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332)