[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-07-20 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386507#comment-15386507
 ] 

Ted Yu commented on HBASE-16052:


Integrated the addendum to 0.98 branch.

Thanks, Ben.

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
>Assignee: Ben Lau
> Fix For: 2.0.0, 1.4.0, 0.98.21
>
> Attachments: HBASE-16052-0.98.v3-amendment.patch, 
> HBASE-16052-0.98.v3.patch, HBASE-16052-master.patch, 
> HBASE-16052-v3-0.98.patch, HBASE-16052-v3-branch-1.patch, 
> HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-07-20 Thread Ben Lau (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386372#comment-15386372
 ] 

Ben Lau commented on HBASE-16052:
-

Looks like some of the methods don't exist on Hadoop 1.1.  When I ran tests 
locally it was against Hadoop 2.X.  Will fix.

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
>Assignee: Ben Lau
> Fix For: 2.0.0, 1.4.0, 0.98.21
>
> Attachments: HBASE-16052-0.98.v3.patch, HBASE-16052-master.patch, 
> HBASE-16052-v3-0.98.patch, HBASE-16052-v3-branch-1.patch, 
> HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-07-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15386281#comment-15386281
 ] 

Hudson commented on HBASE-16052:


FAILURE: Integrated in HBase-0.98-matrix #373 (See 
[https://builds.apache.org/job/HBase-0.98-matrix/373/])
HBASE-16052 Improve HBaseFsck Scalability (Ben Lau) (tedyu: rev 
6e5a773d359d1ea28d2e4042f4d9b0b6cbb448eb)
* hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSVisitor.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/util/hbck/HFileCorruptionChecker.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSUtils.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/util/FileStatusFilter.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/util/AbstractFileStatusFilter.java


> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
>Assignee: Ben Lau
> Fix For: 2.0.0, 1.4.0, 0.98.21
>
> Attachments: HBASE-16052-0.98.v3.patch, HBASE-16052-master.patch, 
> HBASE-16052-v3-0.98.patch, HBASE-16052-v3-branch-1.patch, 
> HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-07-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15385802#comment-15385802
 ] 

Hudson commented on HBASE-16052:


FAILURE: Integrated in HBase-0.98-on-Hadoop-1.1 #1245 (See 
[https://builds.apache.org/job/HBase-0.98-on-Hadoop-1.1/1245/])
HBASE-16052 Improve HBaseFsck Scalability (Ben Lau) (tedyu: rev 
6e5a773d359d1ea28d2e4042f4d9b0b6cbb448eb)
* hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSVisitor.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSUtils.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/util/hbck/HFileCorruptionChecker.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/util/AbstractFileStatusFilter.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/util/FileStatusFilter.java


> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
>Assignee: Ben Lau
> Fix For: 2.0.0, 1.4.0, 0.98.21
>
> Attachments: HBASE-16052-0.98.v3.patch, HBASE-16052-master.patch, 
> HBASE-16052-v3-0.98.patch, HBASE-16052-v3-branch-1.patch, 
> HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-07-19 Thread Ben Lau (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384626#comment-15384626
 ] 

Ben Lau commented on HBASE-16052:
-

Test failure looks unrelated to me.  Reran locally and passed.  Let me know if 
I missed anything but I don't think there are new Javadoc/findbugs warnings in 
the patch.

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
>Assignee: Ben Lau
> Fix For: 2.0.0, 1.4.0
>
> Attachments: HBASE-16052-0.98.v3.patch, HBASE-16052-master.patch, 
> HBASE-16052-v3-0.98.patch, HBASE-16052-v3-branch-1.patch, 
> HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-07-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15383179#comment-15383179
 ] 

Hadoop QA commented on HBASE-16052:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 
0s {color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 13m 
30s {color} | {color:green} 0.98 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 47s 
{color} | {color:green} 0.98 passed with JDK v1.8.0 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s 
{color} | {color:green} 0.98 passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
54s {color} | {color:green} 0.98 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
22s {color} | {color:green} 0.98 passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 40s 
{color} | {color:red} hbase-server in 0.98 has 85 extant Findbugs warnings. 
{color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 46s 
{color} | {color:red} hbase-server in 0.98 failed with JDK v1.8.0. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s 
{color} | {color:green} 0.98 passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
33s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s 
{color} | {color:green} the patch passed with JDK v1.8.0 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 41s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
10m 24s {color} | {color:green} Patch does not cause any errors with Hadoop 
2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
53s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 33s 
{color} | {color:red} hbase-server in the patch failed with JDK v1.8.0. {color} 
|
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed with JDK v1.7.0_80 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 122m 2s {color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
32s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 159m 37s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hbase.TestZooKeeper |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12818610/HBASE-16052-0.98.v3.patch
 |
| JIRA Issue | HBASE-16052 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  
hbaseanti  checkstyle  compile  |
| uname | Linux asf900.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | 0.98 / 8da8565 |
| Default Java | 1.7.0_80 |
| 

[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-07-18 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15382783#comment-15382783
 ] 

Ted Yu commented on HBASE-16052:


Ben:
You can name the patch HBASE-16052-0.98.v3.patch and try QA again.

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
>Assignee: Ben Lau
> Fix For: 2.0.0, 1.4.0
>
> Attachments: HBASE-16052-master.patch, HBASE-16052-v3-0.98.patch, 
> HBASE-16052-v3-branch-1.patch, HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-07-18 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15382723#comment-15382723
 ] 

Hadoop QA commented on HBASE-16052:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 5s {color} 
| {color:red} HBASE-16052 does not apply to master. Rebase required? Wrong 
Branch? See https://yetus.apache.org/documentation/0.2.1/precommit-patchnames 
for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12817506/HBASE-16052-v3-0.98.patch
 |
| JIRA Issue | HBASE-16052 |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/2666/console |
| Powered by | Apache Yetus 0.2.1   http://yetus.apache.org |


This message was automatically generated.



> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
>Assignee: Ben Lau
> Fix For: 2.0.0, 1.4.0
>
> Attachments: HBASE-16052-master.patch, HBASE-16052-v3-0.98.patch, 
> HBASE-16052-v3-branch-1.patch, HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-07-18 Thread Ben Lau (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15382689#comment-15382689
 ] 

Ben Lau commented on HBASE-16052:
-

[~busbey] Thanks!  

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
>Assignee: Ben Lau
> Fix For: 2.0.0, 1.4.0
>
> Attachments: HBASE-16052-master.patch, HBASE-16052-v3-0.98.patch, 
> HBASE-16052-v3-branch-1.patch, HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-07-18 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15382661#comment-15382661
 ] 

Sean Busbey commented on HBASE-16052:
-

{quote}
Is there a way for me to trigger the tests? Does uploading a file after the 
ticket is closed not trigger tests anymore?
{quote}

Tests only run for JIRAs in "Patch Available" status. If you reopen the issue 
and then move it to patch available, that should trigger a test run.

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
>Assignee: Ben Lau
> Fix For: 2.0.0, 1.4.0
>
> Attachments: HBASE-16052-master.patch, HBASE-16052-v3-0.98.patch, 
> HBASE-16052-v3-branch-1.patch, HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-07-18 Thread Ben Lau (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15382638#comment-15382638
 ] 

Ben Lau commented on HBASE-16052:
-

[~tedyu] / [~syuanjiang] How does the 0.98 patch look?  Is there a way for me 
to trigger the tests?  Does uploading a file after the ticket is closed not 
trigger tests anymore?  

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
>Assignee: Ben Lau
> Fix For: 2.0.0, 1.4.0
>
> Attachments: HBASE-16052-master.patch, HBASE-16052-v3-0.98.patch, 
> HBASE-16052-v3-branch-1.patch, HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-07-06 Thread Ben Lau (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365175#comment-15365175
 ] 

Ben Lau commented on HBASE-16052:
-

Ah okay, good to know, thanks.

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
>Assignee: Ben Lau
> Fix For: 2.0.0, 1.4.0
>
> Attachments: HBASE-16052-master.patch, HBASE-16052-v3-branch-1.patch, 
> HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-07-06 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365173#comment-15365173
 ] 

Ted Yu commented on HBASE-16052:


0.98 branch can also take improvement.

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
>Assignee: Ben Lau
> Fix For: 2.0.0, 1.4.0
>
> Attachments: HBASE-16052-master.patch, HBASE-16052-v3-branch-1.patch, 
> HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-07-06 Thread Ben Lau (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365169#comment-15365169
 ] 

Ben Lau commented on HBASE-16052:
-

Hi Ted, I can probably get to it on Monday. Just to make sure I understand the 
process you guys were discussing, since this is an enhancement (not bug fix), 
this means it should go into 0.98?  I thought 0.98 was mostly in a "bug fixes 
only" state?  

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
>Assignee: Ben Lau
> Fix For: 2.0.0, 1.4.0
>
> Attachments: HBASE-16052-master.patch, HBASE-16052-v3-branch-1.patch, 
> HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-07-01 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359917#comment-15359917
 ] 

Ted Yu commented on HBASE-16052:


Ben, when you have a chance (after long weekend), can you backport to 0.98 
branch ?

Thanks

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
>Assignee: Ben Lau
> Fix For: 2.0.0, 1.4.0
>
> Attachments: HBASE-16052-master.patch, HBASE-16052-v3-branch-1.patch, 
> HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-07-01 Thread Ben Lau (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359848#comment-15359848
 ] 

Ben Lau commented on HBASE-16052:
-

Okay let me know if you guys have a consensus for what other versions should be 
patched.  Re: 0.98-- yes we tested the original version of this patch in 0.98.  
However if we want to patch 0.98 it probably makes more sense to apply a 
backport of the trunk patch than our original 0.98 patch.  The trunk patch has 
some improvements that make the code cleaner on trunk but weren't that 
important on 0.98 originally (eg there are a lot more PathFilter classes in 
trunk so adding an abstract class to avoid too much code duplication became a 
no brainer).  To keep the code from diverging too much it would make sense to 
backport from trunk if we decide to patch 0.98.  I'll add a release note later 
based on the Jira description.  Feel free to expand/shorten it.

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
>Assignee: Ben Lau
> Fix For: 2.0.0, 1.4.0
>
> Attachments: HBASE-16052-master.patch, HBASE-16052-v3-branch-1.patch, 
> HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-07-01 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359832#comment-15359832
 ] 

Nick Dimiduk commented on HBASE-16052:
--

Sounds like an enhancement to me, not a bug (correctness) fix.

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
>Assignee: Ben Lau
> Fix For: 2.0.0, 1.4.0
>
> Attachments: HBASE-16052-master.patch, HBASE-16052-v3-branch-1.patch, 
> HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-07-01 Thread Sean Busbey (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359542#comment-15359542
 ] 

Sean Busbey commented on HBASE-16052:
-

Personally, I'd rather see this limited to new minor releases. (e.g. 0.98, 1.3, 
etc.) Also could use a release note.

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
>Assignee: Ben Lau
> Fix For: 2.0.0, 1.4.0
>
> Attachments: HBASE-16052-master.patch, HBASE-16052-v3-branch-1.patch, 
> HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-07-01 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359443#comment-15359443
 ] 

Ted Yu commented on HBASE-16052:


Ping [~mantonov] [~busbey] [~ndimiduk] for green light

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
>Assignee: Ben Lau
> Fix For: 2.0.0, 1.4.0
>
> Attachments: HBASE-16052-master.patch, HBASE-16052-v3-branch-1.patch, 
> HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-07-01 Thread Stephen Yuan Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359439#comment-15359439
 ] 

Stephen Yuan Jiang commented on HBASE-16052:


[~benlau] and [~tedyu], any reason that this patch did not go to other 1.x 
branches (eg. branch-1.1)?  Also I thought Ben did this for 0.98 and tested in 
production, so maybe we should also put it into 0.98.

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
>Assignee: Ben Lau
> Fix For: 2.0.0, 1.4.0
>
> Attachments: HBASE-16052-master.patch, HBASE-16052-v3-branch-1.patch, 
> HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15352417#comment-15352417
 ] 

Hudson commented on HBASE-16052:


FAILURE: Integrated in HBase-1.4 #254 (See 
[https://builds.apache.org/job/HBase-1.4/254/])
HBASE-16052 Improve HBaseFsck Scalability (Ben Lau) (tedyu: rev 
1a88673f33ac642d6e8d7758bcc39dc15a1397e5)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/util/hbck/HFileCorruptionChecker.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/util/FileStatusFilter.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSVisitor.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/util/AbstractFileStatusFilter.java
* hbase-server/src/main/java/org/apache/hadoop/hbase/util/FSUtils.java


> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
> Attachments: HBASE-16052-master.patch, HBASE-16052-v3-branch-1.patch, 
> HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-06-27 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351700#comment-15351700
 ] 

Ted Yu commented on HBASE-16052:


Test failures above were not related to the patch.

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
> Attachments: HBASE-16052-master.patch, HBASE-16052-v3-branch-1.patch, 
> HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-06-27 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351683#comment-15351683
 ] 

Hadoop QA commented on HBASE-16052:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 
0s {color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
55s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s 
{color} | {color:green} master passed with JDK v1.8.0 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s 
{color} | {color:green} master passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
54s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
59s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s 
{color} | {color:green} master passed with JDK v1.8.0 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 41s 
{color} | {color:green} master passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
51s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 56s 
{color} | {color:green} the patch passed with JDK v1.8.0 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 56s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s 
{color} | {color:green} the patch passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 41s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
58s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
31m 26s {color} | {color:green} Patch does not cause any errors with Hadoop 
2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 9s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s 
{color} | {color:green} the patch passed with JDK v1.8.0 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 42s 
{color} | {color:green} the patch passed with JDK v1.7.0_80 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 103m 54s 
{color} | {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
16s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 152m 33s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hbase.client.TestHCM |
|   | hadoop.hbase.snapshot.TestMobFlushSnapshotFromClient |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12813801/HBASE-16052-v3-master.patch
 |
| JIRA Issue | HBASE-16052 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  
hbaseanti  checkstyle  compile  |
| uname | Linux asf900.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build@2/component/dev-support/hbase-personality.sh
 |
| git revision | master / 424b789 |
| 

[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-06-27 Thread Ben Lau (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351586#comment-15351586
 ] 

Ben Lau commented on HBASE-16052:
-

Uploaded the patch for branch-1.  Re-ran the Fsck unit tests which passed.

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
> Attachments: HBASE-16052-master.patch, HBASE-16052-v3-branch-1.patch, 
> HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-06-27 Thread Ben Lau (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351422#comment-15351422
 ] 

Ben Lau commented on HBASE-16052:
-

Thanks guys.  The patch doesn't apply cleanly to branch-1 so I will upload a 
new patch for branch-1.

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
> Attachments: HBASE-16052-master.patch, HBASE-16052-v3-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-06-27 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351363#comment-15351363
 ] 

Ted Yu commented on HBASE-16052:


Ben:
Stephen gave green light on the reviewboard.
Can you upload new patch ?

See if branch-1 needs separate patch.

Thanks

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
> Attachments: HBASE-16052-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-06-22 Thread Ben Lau (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15345499#comment-15345499
 ] 

Ben Lau commented on HBASE-16052:
-

Added a comment about a refactor that may clean up / unify the FSUtils 
FileStatus filtering interface a bit at the expense of some minor object 
overhead.  Let me know what you guys think on the reviewboard.

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
> Attachments: HBASE-16052-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-06-21 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342806#comment-15342806
 ] 

Enis Soztutar commented on HBASE-16052:
---

Skimmed the patch, improvements makes sense. [~syuanjiang] want to take a look? 

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
> Attachments: HBASE-16052-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-06-21 Thread Ben Lau (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342282#comment-15342282
 ] 

Ben Lau commented on HBASE-16052:
-

[~jmhsieh] and [~jxiang] Any comments on this patch?  Just checking since you 
guys are listed as the owners for the HBaseFsck component on 
https://issues.apache.org/jira/browse/HBASE/?selectedTab=com.atlassian.jira.jira-projects-plugin:components-panel.

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
> Attachments: HBASE-16052-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-06-20 Thread Ben Lau (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15340148#comment-15340148
 ] 

Ben Lau commented on HBASE-16052:
-

Reviewboard link: https://reviews.apache.org/r/48959/

Added annotation and javadoc for the AbstractFileStatusFilter.

Fixed the whitespaces mentioned in the Hadoop QA bot's comment.  

Also re: Hadoop QA bot comment: I didn't add any new tests since it seems 
existing unit tests are sufficient as this patch optimizes common codepaths 
already exercised by the Fsck tests.


> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
> Attachments: HBASE-16052-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-06-17 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336375#comment-15336375
 ] 

Ted Yu commented on HBASE-16052:


Can you put the patch on reviewboard ?
{code}
31  public abstract class AbstractFileStatusFilter implements PathFilter, 
FileStatusFilter {
{code}
Add annotation for audience.
{code}
32protected abstract boolean accept(Path p, @CheckForNull Boolean 
isDir);
{code}
Add javadoc for what isDir means.

> Improve HBaseFsck Scalability
> -
>
> Key: HBASE-16052
> URL: https://issues.apache.org/jira/browse/HBASE-16052
> Project: HBase
>  Issue Type: Improvement
>  Components: hbck
>Reporter: Ben Lau
> Attachments: HBASE-16052-master.patch
>
>
> There are some problems with HBaseFsck that make it unnecessarily slow 
> especially for large tables or clusters with many regions.  
> This patch tries to fix the biggest bottlenecks and also include a couple of 
> bug fixes for some of the race conditions caused by gathering and holding 
> state about a live cluster that is no longer true by the time you use that 
> state in Fsck processing.  These race conditions cause Fsck to crash and 
> become unusable on large clusters with lots of region splits/merges.
> Here are some scalability/performance problems in HBaseFsck and the changes 
> the patch makes:
> - Unnecessary I/O and RPCs caused by fetching an array of FileStatuses and 
> then discarding everything but the Paths, then passing the Paths to a 
> PathFilter, and then having the filter look up the (previously discarded) 
> FileStatuses of the paths again.  This is actually worse than double I/O 
> because the first lookup obtains a batch of FileStatuses while all the other 
> lookups are individual RPCs performed sequentially.
> -- Avoid this by adding a FileStatusFilter so that filtering can happen 
> directly on FileStatuses
> -- This performance bug affects more than Fsck, but also to some extent 
> things like snapshots, hfile archival, etc.  I didn't have time to look too 
> deep into other things affected and didn't want to increase the scope of this 
> ticket so I focus mostly on Fsck and make only a few improvements to other 
> codepaths.  The changes in this patch though should make it fairly easy to 
> fix other code paths in later jiras if we feel there are some other features 
> strongly impacted by this problem.  
> - OfflineReferenceFileRepair is the most expensive part of Fsck (often 50% of 
> Fsck runtime) and the running time scales with the number of store files, yet 
> the function is completely serial
> -- Make offlineReferenceFileRepair multithreaded
> - LoadHdfsRegionDirs() uses table-level concurrency, which is a big 
> bottleneck if you have 1 large cluster with 1 very large table that has 
> nearly all the regions
> -- Change loadHdfsRegionDirs() to region-level parallelism instead of 
> table-level parallelism for operations.
> The changes benefit all clusters but are especially noticeable for large 
> clusters with a few very large tables.  On our version of 0.98 with the 
> original patch we had a moderately sized production cluster with 2 (user) 
> tables and ~160k regions where HBaseFsck went from taking 18 min to 5 minutes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16052) Improve HBaseFsck Scalability

2016-06-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336226#comment-15336226
 ] 

Hadoop QA commented on HBASE-16052:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 
0s {color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s 
{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
58s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 44s 
{color} | {color:green} master passed with JDK v1.8.0 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 34s 
{color} | {color:green} master passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
52s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
16s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
57s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s 
{color} | {color:green} master passed with JDK v1.8.0 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s 
{color} | {color:green} master passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
44s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 37s 
{color} | {color:green} the patch passed with JDK v1.8.0 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 37s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 34s 
{color} | {color:green} the patch passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 34s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
52s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
16s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red} 0m 0s 
{color} | {color:red} The patch has 26 line(s) that end in whitespace. Use git 
apply --whitespace=fix. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
26m 55s {color} | {color:green} Patch does not cause any errors with Hadoop 
2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
10s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s 
{color} | {color:green} the patch passed with JDK v1.8.0 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed with JDK v1.7.0_80 {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 80m 18s 
{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
15s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 122m 37s {color} 
| {color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12811257/HBASE-16052-master.patch
 |
| JIRA Issue | HBASE-16052 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  hadoopcheck  
hbaseanti  checkstyle  compile  |
| uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | master / 3abd52b |
| Default Java | 1.7.0_80 |
| Multi-JDK versions |  /home/jenkins/tools/java/jdk1.8.0:1.8.0