[jira] [Created] (HBASE-16747) Track memstore data size and heap overhead separately

2016-10-02 Thread Anoop Sam John (JIRA)
Anoop Sam John created HBASE-16747:
--

 Summary: Track memstore data size and heap overhead separately 
 Key: HBASE-16747
 URL: https://issues.apache.org/jira/browse/HBASE-16747
 Project: HBase
  Issue Type: Sub-task
Reporter: Anoop Sam John
Assignee: Anoop Sam John
 Fix For: 2.0.0


We track the memstore size in 3 places.
1. Global at RS level in RegionServerAccounting. This tracks all memstore's 
size and used to calculate whether forced flushes needed because of global heap 
pressure
2. At region level in HRegion. This is sum of sizes of all memstores within 
this region. This is used to decide whether region reaches flush size (128 MB)
3. Segment level. This tracks the in memory flush/compaction decisions.

All these use the Cell's heap size which include the data bytes# as well as 
Cell object heap overhead.  Also we include the overhead because of addition of 
Cells into Segment's data structures (Like CSLM).

Once we have off heap memstore, we will keep the cell data bytes in off heap 
area. So we can not track both data size and heap overhead as one entity. We 
need to separate them and track.

Proposal here is to track both cell data size and heap overhead separately at 
global accounting layer.  As of now we have only on heap memstore. So the 
global memstore boundary checks will consider both (adds up and check against 
global max memstore size)
Track cell data size alone (This can be on heap or off heap) in region level.  
Region flushes use cell data size alone for the region flush decision. A user 
configuring 128 MB as flush size, normally he will expect to get a 128MB data 
flush size. But as we were including the heap overhead also, once the flush 
happens, the actual data size getting flushed is way behind this 128 MB.  Now 
with this change we will behave more like what a user thinks.
Segment level in memory flush/compaction also considers cell data size alone.  
But we will need to track the heap overhead also. (Once the in memory flush or 
normal flush happens, we will have to adjust both cell data size and heap 
overhead)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15560) TinyLFU-based BlockCache

2016-10-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15541521#comment-15541521
 ] 

Hadoop QA commented on HBASE-15560:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 5 new or modified test 
files. {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s 
{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 
45s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 11s 
{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
26s {color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 
32s {color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s 
{color} | {color:blue} Skipped patched modules with no Java source: 
hbase-resource-bundle . {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 32s 
{color} | {color:red} hbase-common in master has 1 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 19s 
{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 7s 
{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 
43s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 12s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 12s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
26s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 1m 
32s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 5s 
{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
24m 21s {color} | {color:green} Patch does not cause any errors with Hadoop 
2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} |
| {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 1m 
33s {color} | {color:green} the patch passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s 
{color} | {color:blue} Skipped patched modules with no Java source: 
hbase-resource-bundle . {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 
24s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 20s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 6s 
{color} | {color:green} hbase-resource-bundle in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 40s 
{color} | {color:green} hbase-common in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 89m 36s {color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 86m 13s {color} 
| {color:red} root in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
48s {color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 231m 36s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hbase.regionserver.TestHRegionWithInMemoryFlush |
|   | hadoop.hbase.regionserver.TestHRegionWithInMemoryFlush |
| Timed out junit tests | org.apache.hadoop.hbase.constraint.TestConstraint |
|   | org.apache.hadoop.hbase.TestNamespace |
|   | 

[jira] [Updated] (HBASE-16739) Timed out exception message should include encoded region name

2016-10-02 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated HBASE-16739:
---
   Resolution: Fixed
 Assignee: Ted Yu
 Hadoop Flags: Reviewed
Fix Version/s: 1.4.0
   2.0.0
   Status: Resolved  (was: Patch Available)

Thanks for the review, Jerry.

> Timed out exception message should include encoded region name
> --
>
> Key: HBASE-16739
> URL: https://issues.apache.org/jira/browse/HBASE-16739
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Minor
> Fix For: 2.0.0, 1.4.0
>
> Attachments: 16739.v1.txt
>
>
> Saw the following in region server log repeatedly:
> {code}
> 2016-09-26 10:13:33,219 WARN org.apache.hadoop.hbase.regionserver.HRegion: 
> Failed getting lock in batch put, row=1
> java.io.IOException: Timed out waiting for lock for row: 1
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:5151)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3046)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2902)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2844)
>   at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:692)
>   at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:654)
>   at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2040)
> {code}
> Region name was not logged - making troubleshooting a bit difficult.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16739) Timed out exception message should include encoded region name

2016-10-02 Thread Jerry He (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15541470#comment-15541470
 ] 

Jerry He commented on HBASE-16739:
--

+1

> Timed out exception message should include encoded region name
> --
>
> Key: HBASE-16739
> URL: https://issues.apache.org/jira/browse/HBASE-16739
> Project: HBase
>  Issue Type: Bug
>Reporter: Ted Yu
>Priority: Minor
> Attachments: 16739.v1.txt
>
>
> Saw the following in region server log repeatedly:
> {code}
> 2016-09-26 10:13:33,219 WARN org.apache.hadoop.hbase.regionserver.HRegion: 
> Failed getting lock in batch put, row=1
> java.io.IOException: Timed out waiting for lock for row: 1
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:5151)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:3046)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2902)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2844)
>   at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:692)
>   at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:654)
>   at 
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2040)
> {code}
> Region name was not logged - making troubleshooting a bit difficult.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15560) TinyLFU-based BlockCache

2016-10-02 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15541420#comment-15541420
 ] 

Ted Yu commented on HBASE-15560:


+1, pending QA results.

> TinyLFU-based BlockCache
> 
>
> Key: HBASE-15560
> URL: https://issues.apache.org/jira/browse/HBASE-15560
> Project: HBase
>  Issue Type: Improvement
>  Components: BlockCache
>Affects Versions: 2.0.0
>Reporter: Ben Manes
>Assignee: Ben Manes
> Attachments: HBASE-15560.patch, HBASE-15560.patch, HBASE-15560.patch, 
> HBASE-15560.patch, HBASE-15560.patch, HBASE-15560.patch, tinylfu.patch
>
>
> LruBlockCache uses the Segmented LRU (SLRU) policy to capture frequency and 
> recency of the working set. It achieves concurrency by using an O( n ) 
> background thread to prioritize the entries and evict. Accessing an entry is 
> O(1) by a hash table lookup, recording its logical access time, and setting a 
> frequency flag. A write is performed in O(1) time by updating the hash table 
> and triggering an async eviction thread. This provides ideal concurrency and 
> minimizes the latencies by penalizing the thread instead of the caller. 
> However the policy does not age the frequencies and may not be resilient to 
> various workload patterns.
> W-TinyLFU ([research paper|http://arxiv.org/pdf/1512.00727.pdf]) records the 
> frequency in a counting sketch, ages periodically by halving the counters, 
> and orders entries by SLRU. An entry is discarded by comparing the frequency 
> of the new arrival (candidate) to the SLRU's victim, and keeping the one with 
> the highest frequency. This allows the operations to be performed in O(1) 
> time and, though the use of a compact sketch, a much larger history is 
> retained beyond the current working set. In a variety of real world traces 
> the policy had [near optimal hit 
> rates|https://github.com/ben-manes/caffeine/wiki/Efficiency].
> Concurrency is achieved by buffering and replaying the operations, similar to 
> a write-ahead log. A read is recorded into a striped ring buffer and writes 
> to a queue. The operations are applied in batches under a try-lock by an 
> asynchronous thread, thereby track the usage pattern without incurring high 
> latencies 
> ([benchmarks|https://github.com/ben-manes/caffeine/wiki/Benchmarks#server-class]).
> In YCSB benchmarks the results were inconclusive. For a large cache (99% hit 
> rates) the two caches have near identical throughput and latencies with 
> LruBlockCache narrowly winning. At medium and small caches, TinyLFU had a 
> 1-4% hit rate improvement and therefore lower latencies. The lack luster 
> result is because a synthetic Zipfian distribution is used, which SLRU 
> performs optimally. In a more varied, real-world workload we'd expect to see 
> improvements by being able to make smarter predictions.
> The provided patch implements BlockCache using the 
> [Caffeine|https://github.com/ben-manes/caffeine] caching library (see 
> HighScalability 
> [article|http://highscalability.com/blog/2016/1/25/design-of-a-modern-cache.html]).
> Edward Bortnikov and Eshcar Hillel have graciously provided guidance for 
> evaluating this patch ([github 
> branch|https://github.com/ben-manes/hbase/tree/tinylfu]).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-15560) TinyLFU-based BlockCache

2016-10-02 Thread Ben Manes (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-15560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Manes updated HBASE-15560:
--
Attachment: HBASE-15560.patch

> TinyLFU-based BlockCache
> 
>
> Key: HBASE-15560
> URL: https://issues.apache.org/jira/browse/HBASE-15560
> Project: HBase
>  Issue Type: Improvement
>  Components: BlockCache
>Affects Versions: 2.0.0
>Reporter: Ben Manes
>Assignee: Ben Manes
> Attachments: HBASE-15560.patch, HBASE-15560.patch, HBASE-15560.patch, 
> HBASE-15560.patch, HBASE-15560.patch, HBASE-15560.patch, tinylfu.patch
>
>
> LruBlockCache uses the Segmented LRU (SLRU) policy to capture frequency and 
> recency of the working set. It achieves concurrency by using an O( n ) 
> background thread to prioritize the entries and evict. Accessing an entry is 
> O(1) by a hash table lookup, recording its logical access time, and setting a 
> frequency flag. A write is performed in O(1) time by updating the hash table 
> and triggering an async eviction thread. This provides ideal concurrency and 
> minimizes the latencies by penalizing the thread instead of the caller. 
> However the policy does not age the frequencies and may not be resilient to 
> various workload patterns.
> W-TinyLFU ([research paper|http://arxiv.org/pdf/1512.00727.pdf]) records the 
> frequency in a counting sketch, ages periodically by halving the counters, 
> and orders entries by SLRU. An entry is discarded by comparing the frequency 
> of the new arrival (candidate) to the SLRU's victim, and keeping the one with 
> the highest frequency. This allows the operations to be performed in O(1) 
> time and, though the use of a compact sketch, a much larger history is 
> retained beyond the current working set. In a variety of real world traces 
> the policy had [near optimal hit 
> rates|https://github.com/ben-manes/caffeine/wiki/Efficiency].
> Concurrency is achieved by buffering and replaying the operations, similar to 
> a write-ahead log. A read is recorded into a striped ring buffer and writes 
> to a queue. The operations are applied in batches under a try-lock by an 
> asynchronous thread, thereby track the usage pattern without incurring high 
> latencies 
> ([benchmarks|https://github.com/ben-manes/caffeine/wiki/Benchmarks#server-class]).
> In YCSB benchmarks the results were inconclusive. For a large cache (99% hit 
> rates) the two caches have near identical throughput and latencies with 
> LruBlockCache narrowly winning. At medium and small caches, TinyLFU had a 
> 1-4% hit rate improvement and therefore lower latencies. The lack luster 
> result is because a synthetic Zipfian distribution is used, which SLRU 
> performs optimally. In a more varied, real-world workload we'd expect to see 
> improvements by being able to make smarter predictions.
> The provided patch implements BlockCache using the 
> [Caffeine|https://github.com/ben-manes/caffeine] caching library (see 
> HighScalability 
> [article|http://highscalability.com/blog/2016/1/25/design-of-a-modern-cache.html]).
> Edward Bortnikov and Eshcar Hillel have graciously provided guidance for 
> evaluating this patch ([github 
> branch|https://github.com/ben-manes/hbase/tree/tinylfu]).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15560) TinyLFU-based BlockCache

2016-10-02 Thread Ben Manes (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15541173#comment-15541173
 ] 

Ben Manes commented on HBASE-15560:
---

We can add it now if that's desired, or defer on whether to add or remove 
depending on the team's consensus. Neither the Lru nor TinyLfu caches use the 
field in their policies and I don't see how it provides meaningful information 
to users.

> TinyLFU-based BlockCache
> 
>
> Key: HBASE-15560
> URL: https://issues.apache.org/jira/browse/HBASE-15560
> Project: HBase
>  Issue Type: Improvement
>  Components: BlockCache
>Affects Versions: 2.0.0
>Reporter: Ben Manes
>Assignee: Ben Manes
> Attachments: HBASE-15560.patch, HBASE-15560.patch, HBASE-15560.patch, 
> HBASE-15560.patch, HBASE-15560.patch, tinylfu.patch
>
>
> LruBlockCache uses the Segmented LRU (SLRU) policy to capture frequency and 
> recency of the working set. It achieves concurrency by using an O( n ) 
> background thread to prioritize the entries and evict. Accessing an entry is 
> O(1) by a hash table lookup, recording its logical access time, and setting a 
> frequency flag. A write is performed in O(1) time by updating the hash table 
> and triggering an async eviction thread. This provides ideal concurrency and 
> minimizes the latencies by penalizing the thread instead of the caller. 
> However the policy does not age the frequencies and may not be resilient to 
> various workload patterns.
> W-TinyLFU ([research paper|http://arxiv.org/pdf/1512.00727.pdf]) records the 
> frequency in a counting sketch, ages periodically by halving the counters, 
> and orders entries by SLRU. An entry is discarded by comparing the frequency 
> of the new arrival (candidate) to the SLRU's victim, and keeping the one with 
> the highest frequency. This allows the operations to be performed in O(1) 
> time and, though the use of a compact sketch, a much larger history is 
> retained beyond the current working set. In a variety of real world traces 
> the policy had [near optimal hit 
> rates|https://github.com/ben-manes/caffeine/wiki/Efficiency].
> Concurrency is achieved by buffering and replaying the operations, similar to 
> a write-ahead log. A read is recorded into a striped ring buffer and writes 
> to a queue. The operations are applied in batches under a try-lock by an 
> asynchronous thread, thereby track the usage pattern without incurring high 
> latencies 
> ([benchmarks|https://github.com/ben-manes/caffeine/wiki/Benchmarks#server-class]).
> In YCSB benchmarks the results were inconclusive. For a large cache (99% hit 
> rates) the two caches have near identical throughput and latencies with 
> LruBlockCache narrowly winning. At medium and small caches, TinyLFU had a 
> 1-4% hit rate improvement and therefore lower latencies. The lack luster 
> result is because a synthetic Zipfian distribution is used, which SLRU 
> performs optimally. In a more varied, real-world workload we'd expect to see 
> improvements by being able to make smarter predictions.
> The provided patch implements BlockCache using the 
> [Caffeine|https://github.com/ben-manes/caffeine] caching library (see 
> HighScalability 
> [article|http://highscalability.com/blog/2016/1/25/design-of-a-modern-cache.html]).
> Edward Bortnikov and Eshcar Hillel have graciously provided guidance for 
> evaluating this patch ([github 
> branch|https://github.com/ben-manes/hbase/tree/tinylfu]).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16698) Performance issue: handlers stuck waiting for CountDownLatch inside WALKey#getWriteEntry under high writing workload

2016-10-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15540760#comment-15540760
 ] 

Hadoop QA commented on HBASE-16698:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 26s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 
59s {color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s 
{color} | {color:green} branch-1 passed with JDK v1.8.0_101 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 33s 
{color} | {color:green} branch-1 passed with JDK v1.7.0_111 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
53s {color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
16s {color} | {color:green} branch-1 passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 51s 
{color} | {color:red} hbase-server in branch-1 has 1 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s 
{color} | {color:green} branch-1 passed with JDK v1.8.0_101 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 33s 
{color} | {color:green} branch-1 passed with JDK v1.7.0_111 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
45s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed with JDK v1.8.0_101 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 30s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s 
{color} | {color:green} the patch passed with JDK v1.7.0_111 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 32s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
55s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
16m 4s {color} | {color:green} The patch does not cause any errors with Hadoop 
2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} |
| {color:green}+1{color} | {color:green} hbaseprotoc {color} | {color:green} 0m 
14s {color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 2m 6s 
{color} | {color:red} hbase-server generated 1 new + 1 unchanged - 0 fixed = 2 
total (was 1) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed with JDK v1.8.0_101 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s 
{color} | {color:green} the patch passed with JDK v1.7.0_111 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 82m 44s {color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
17s {color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 113m 20s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hbase-server |
|  |  
org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion$BatchOperationInProgress)
 does not release lock on all paths  At HRegion.java:on all paths  At 
HRegion.java:[line 3313] |
| Timed out junit tests | org.apache.hadoop.hbase.regionserver.TestClusterId |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.11.2 Server=1.11.2 Image:yetus/hbase:date2016-10-02 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12831253/HBASE-16698.branch-1.patch
 |
| JIRA Issue | HBASE-16698 |
| Optional Tests |  asflicense  javac  javadoc  unit  

[jira] [Commented] (HBASE-16698) Performance issue: handlers stuck waiting for CountDownLatch inside WALKey#getWriteEntry under high writing workload

2016-10-02 Thread Yu Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15540746#comment-15540746
 ] 

Yu Li commented on HBASE-16698:
---

btw, this testing is against latest code of branch-1, not master

> Performance issue: handlers stuck waiting for CountDownLatch inside 
> WALKey#getWriteEntry under high writing workload
> 
>
> Key: HBASE-16698
> URL: https://issues.apache.org/jira/browse/HBASE-16698
> Project: HBase
>  Issue Type: Improvement
>  Components: Performance
>Affects Versions: 1.1.6, 1.2.3
>Reporter: Yu Li
>Assignee: Yu Li
> Attachments: HBASE-16698.branch-1.patch, HBASE-16698.patch, 
> HBASE-16698.v2.patch, hadoop0495.et2.jstack
>
>
> As titled, on our production environment we observed 98 out of 128 handlers 
> get stuck waiting for the CountDownLatch {{seqNumAssignedLatch}} inside 
> {{WALKey#getWriteEntry}} under a high writing workload.
> After digging into the problem, we found that the problem is mainly caused by 
> advancing mvcc in the append logic. Below is some detailed analysis:
> Under current branch-1 code logic, all batch puts will call 
> {{WALKey#getWriteEntry}} after appending edit to WAL, and 
> {{seqNumAssignedLatch}} is only released when the relative append call is 
> handled by RingBufferEventHandler (see {{FSWALEntry#stampRegionSequenceId}}). 
> Because currently we're using a single event handler for the ringbuffer, the 
> append calls are handled one by one (actually lot's of our current logic 
> depending on this sequential dealing logic), and this becomes a bottleneck 
> under high writing workload.
> The worst part is that by default we only use one WAL per RS, so appends on 
> all regions are dealt with in sequential, which causes contention among 
> different regions...
> To fix this, we could also take use of the "sequential appends" mechanism, 
> that we could grab the WriteEntry before publishing append onto ringbuffer 
> and use it as sequence id, only that we need to add a lock to make "grab 
> WriteEntry" and "append edit" a transaction. This will still cause contention 
> inside a region but could avoid contention between different regions. This 
> solution is already verified in our online environment and proved to be 
> effective.
> Notice that for master (2.0) branch since we already change the write 
> pipeline to sync before writing memstore (HBASE-15158), this issue only 
> exists for the ASYNC_WAL writes scenario.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16698) Performance issue: handlers stuck waiting for CountDownLatch inside WALKey#getWriteEntry under high writing workload

2016-10-02 Thread Yu Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15540744#comment-15540744
 ] 

Yu Li commented on HBASE-16698:
---

This is a common workload and nothing special, I hope this simple result 
answers your question [~chenheng], or else please wait for the YCSB result. 
Thanks.

> Performance issue: handlers stuck waiting for CountDownLatch inside 
> WALKey#getWriteEntry under high writing workload
> 
>
> Key: HBASE-16698
> URL: https://issues.apache.org/jira/browse/HBASE-16698
> Project: HBase
>  Issue Type: Improvement
>  Components: Performance
>Affects Versions: 1.1.6, 1.2.3
>Reporter: Yu Li
>Assignee: Yu Li
> Attachments: HBASE-16698.branch-1.patch, HBASE-16698.patch, 
> HBASE-16698.v2.patch, hadoop0495.et2.jstack
>
>
> As titled, on our production environment we observed 98 out of 128 handlers 
> get stuck waiting for the CountDownLatch {{seqNumAssignedLatch}} inside 
> {{WALKey#getWriteEntry}} under a high writing workload.
> After digging into the problem, we found that the problem is mainly caused by 
> advancing mvcc in the append logic. Below is some detailed analysis:
> Under current branch-1 code logic, all batch puts will call 
> {{WALKey#getWriteEntry}} after appending edit to WAL, and 
> {{seqNumAssignedLatch}} is only released when the relative append call is 
> handled by RingBufferEventHandler (see {{FSWALEntry#stampRegionSequenceId}}). 
> Because currently we're using a single event handler for the ringbuffer, the 
> append calls are handled one by one (actually lot's of our current logic 
> depending on this sequential dealing logic), and this becomes a bottleneck 
> under high writing workload.
> The worst part is that by default we only use one WAL per RS, so appends on 
> all regions are dealt with in sequential, which causes contention among 
> different regions...
> To fix this, we could also take use of the "sequential appends" mechanism, 
> that we could grab the WriteEntry before publishing append onto ringbuffer 
> and use it as sequence id, only that we need to add a lock to make "grab 
> WriteEntry" and "append edit" a transaction. This will still cause contention 
> inside a region but could avoid contention between different regions. This 
> solution is already verified in our online environment and proved to be 
> effective.
> Notice that for master (2.0) branch since we already change the write 
> pipeline to sync before writing memstore (HBASE-15158), this issue only 
> exists for the ASYNC_WAL writes scenario.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16698) Performance issue: handlers stuck waiting for CountDownLatch inside WALKey#getWriteEntry under high writing workload

2016-10-02 Thread Yu Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15540740#comment-15540740
 ] 

Yu Li commented on HBASE-16698:
---

Ok, here is the result with a single RS and PE tool, with below command:
{{hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred 
--table=PERandomWrite --presplit=20 --latency randomWrite 20}}

round-1:
||Type||AverageTime(ms)||ThroughputPerClient||
|w/ patch|376150|2.74MB/s|
|w/o patch|382549|2.69MB/s|

round-2:
||Type||AverageTime(ms)||ThroughputPerClient||
|w/ patch|381925|2.70MB/s|
|w/o patch|385666|2.67MB/s|

round-3:
||Type||AverageTime(ms)||ThroughputPerClient||
|w/ patch|364555|2.83MB/s|
|w/o patch|374948|2.75MB/s|

And when testing w/o patch we could easily see the waiting on CountDownLatch.

This simple result could show the effect of the patch to some extent, but not 
quite obviously (I'm afraid PE output for write is not well formatted and we 
could not see metrics like throughput directly...). Will upload more testing 
result with YCSB workload later.

> Performance issue: handlers stuck waiting for CountDownLatch inside 
> WALKey#getWriteEntry under high writing workload
> 
>
> Key: HBASE-16698
> URL: https://issues.apache.org/jira/browse/HBASE-16698
> Project: HBase
>  Issue Type: Improvement
>  Components: Performance
>Affects Versions: 1.1.6, 1.2.3
>Reporter: Yu Li
>Assignee: Yu Li
> Attachments: HBASE-16698.branch-1.patch, HBASE-16698.patch, 
> HBASE-16698.v2.patch, hadoop0495.et2.jstack
>
>
> As titled, on our production environment we observed 98 out of 128 handlers 
> get stuck waiting for the CountDownLatch {{seqNumAssignedLatch}} inside 
> {{WALKey#getWriteEntry}} under a high writing workload.
> After digging into the problem, we found that the problem is mainly caused by 
> advancing mvcc in the append logic. Below is some detailed analysis:
> Under current branch-1 code logic, all batch puts will call 
> {{WALKey#getWriteEntry}} after appending edit to WAL, and 
> {{seqNumAssignedLatch}} is only released when the relative append call is 
> handled by RingBufferEventHandler (see {{FSWALEntry#stampRegionSequenceId}}). 
> Because currently we're using a single event handler for the ringbuffer, the 
> append calls are handled one by one (actually lot's of our current logic 
> depending on this sequential dealing logic), and this becomes a bottleneck 
> under high writing workload.
> The worst part is that by default we only use one WAL per RS, so appends on 
> all regions are dealt with in sequential, which causes contention among 
> different regions...
> To fix this, we could also take use of the "sequential appends" mechanism, 
> that we could grab the WriteEntry before publishing append onto ringbuffer 
> and use it as sequence id, only that we need to add a lock to make "grab 
> WriteEntry" and "append edit" a transaction. This will still cause contention 
> inside a region but could avoid contention between different regions. This 
> solution is already verified in our online environment and proved to be 
> effective.
> Notice that for master (2.0) branch since we already change the write 
> pipeline to sync before writing memstore (HBASE-15158), this issue only 
> exists for the ASYNC_WAL writes scenario.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16698) Performance issue: handlers stuck waiting for CountDownLatch inside WALKey#getWriteEntry under high writing workload

2016-10-02 Thread Yu Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Li updated HBASE-16698:
--
Attachment: HBASE-16698.branch-1.patch

Uploading patch for branch-1, will upload perf number later. Sorry for the lag.

> Performance issue: handlers stuck waiting for CountDownLatch inside 
> WALKey#getWriteEntry under high writing workload
> 
>
> Key: HBASE-16698
> URL: https://issues.apache.org/jira/browse/HBASE-16698
> Project: HBase
>  Issue Type: Improvement
>  Components: Performance
>Affects Versions: 1.1.6, 1.2.3
>Reporter: Yu Li
>Assignee: Yu Li
> Attachments: HBASE-16698.branch-1.patch, HBASE-16698.patch, 
> HBASE-16698.v2.patch, hadoop0495.et2.jstack
>
>
> As titled, on our production environment we observed 98 out of 128 handlers 
> get stuck waiting for the CountDownLatch {{seqNumAssignedLatch}} inside 
> {{WALKey#getWriteEntry}} under a high writing workload.
> After digging into the problem, we found that the problem is mainly caused by 
> advancing mvcc in the append logic. Below is some detailed analysis:
> Under current branch-1 code logic, all batch puts will call 
> {{WALKey#getWriteEntry}} after appending edit to WAL, and 
> {{seqNumAssignedLatch}} is only released when the relative append call is 
> handled by RingBufferEventHandler (see {{FSWALEntry#stampRegionSequenceId}}). 
> Because currently we're using a single event handler for the ringbuffer, the 
> append calls are handled one by one (actually lot's of our current logic 
> depending on this sequential dealing logic), and this becomes a bottleneck 
> under high writing workload.
> The worst part is that by default we only use one WAL per RS, so appends on 
> all regions are dealt with in sequential, which causes contention among 
> different regions...
> To fix this, we could also take use of the "sequential appends" mechanism, 
> that we could grab the WriteEntry before publishing append onto ringbuffer 
> and use it as sequence id, only that we need to add a lock to make "grab 
> WriteEntry" and "append edit" a transaction. This will still cause contention 
> inside a region but could avoid contention between different regions. This 
> solution is already verified in our online environment and proved to be 
> effective.
> Notice that for master (2.0) branch since we already change the write 
> pipeline to sync before writing memstore (HBASE-15158), this issue only 
> exists for the ASYNC_WAL writes scenario.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16373) precommit needs a dockerfile with hbase prereqs

2016-10-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15540504#comment-15540504
 ] 

Hadoop QA commented on HBASE-16373:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:blue}0{color} | {color:blue} shelldocs {color} | {color:blue} 0m 7s 
{color} | {color:blue} Shelldocs was not available. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 
0s {color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} shellcheck {color} | {color:green} 0m 
3s {color} | {color:green} There were no new shellcheck issues. {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
10m 0s {color} | {color:green} The patch does not cause any errors with Hadoop 
2.4.0 2.4.1 2.5.0 2.5.1 2.5.2 2.6.1 2.6.2 2.6.3 2.7.1. {color} |
| {color:red}-1{color} | {color:red} hbaseprotoc {color} | {color:red} 3m 16s 
{color} | {color:red} root in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
37s {color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 15m 0s {color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=1.11.2 Server=1.11.2 Image:yetus/hbase:date2016-10-02 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12831250/HBASE-16373-0.98.patch
 |
| JIRA Issue | HBASE-16373 |
| Optional Tests |  asflicense  shellcheck  shelldocs  |
| uname | Linux b70f2d16a187 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 
17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/hbase.sh |
| git revision | 0.98 / 744b867 |
| shellcheck | v0.4.4 |
| hbaseprotoc | 
https://builds.apache.org/job/PreCommit-HBASE-Build/3793/artifact/patchprocess/patch-hbaseprotoc-root.txt
 |
| modules | C: . U: . |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/3793/console |
| Powered by | Apache Yetus 0.3.0   http://yetus.apache.org |


This message was automatically generated.



> precommit needs a dockerfile with hbase prereqs
> ---
>
> Key: HBASE-16373
> URL: https://issues.apache.org/jira/browse/HBASE-16373
> Project: HBase
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.3.0, 1.4.0, 1.1.6, 1.2.3, 0.98.22
>Reporter: Sean Busbey
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 1.3.0, 1.4.0, 1.1.7, 0.98.23, 1.2.4
>
> Attachments: HBASE-16373-0.98.patch, HBASE-16373-branch-1.patch
>
>
> specifically, we need protoc. starting with the dockerfile used by default in 
> yetus and adding it will probably suffice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-16373) precommit needs a dockerfile with hbase prereqs

2016-10-02 Thread Duo Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-16373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-16373:
--
Attachment: HBASE-16373-0.98.patch

Install oracle java6 for 0.98.

> precommit needs a dockerfile with hbase prereqs
> ---
>
> Key: HBASE-16373
> URL: https://issues.apache.org/jira/browse/HBASE-16373
> Project: HBase
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.3.0, 1.4.0, 1.1.6, 1.2.3, 0.98.22
>Reporter: Sean Busbey
>Assignee: Duo Zhang
>Priority: Critical
> Fix For: 1.3.0, 1.4.0, 1.1.7, 0.98.23, 1.2.4
>
> Attachments: HBASE-16373-0.98.patch, HBASE-16373-branch-1.patch
>
>
> specifically, we need protoc. starting with the dockerfile used by default in 
> yetus and adding it will probably suffice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-15536) Make AsyncFSWAL as our default WAL

2016-10-02 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-15536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15540424#comment-15540424
 ] 

Duo Zhang commented on HBASE-15536:
---

Seems there is no key value size limitation at server side currently. And the 
max key-value size is only used to limit one key-value, which means you can add 
100 cells into one put object and each cell is 1MB, then the size of the put is 
100MB and breaks our {{AsyncFSWAL}}...

We can introduce a new config named 'max.wal.edit.size' which is used to limit 
the size of one atomic operation and set the default value less than 16MB? This 
could make it possible to make AsyncFSWAL as our default WAL.

BTW, I think we could add the 'max.wal.edit.size' as a new stop condition when 
grab row locks in the multi operation as multi does not guarantee cross row 
atomic. Then AsyncFSWAL could also work with batch operation larger than 16MB 
unless the size of one row is larger than 16MB or MultiRowMutationEndpoint is 
used.

What do you think? [~stack] [~anoopsamjohn]. Too many configs maybe... Thanks.

> Make AsyncFSWAL as our default WAL
> --
>
> Key: HBASE-15536
> URL: https://issues.apache.org/jira/browse/HBASE-15536
> Project: HBase
>  Issue Type: Sub-task
>  Components: wal
>Affects Versions: 2.0.0
>Reporter: Duo Zhang
>Assignee: Duo Zhang
> Fix For: 2.0.0
>
> Attachments: HBASE-15536-v1.patch, HBASE-15536-v2.patch, 
> HBASE-15536-v3.patch, HBASE-15536-v4.patch, HBASE-15536-v5.patch, 
> HBASE-15536.patch
>
>
> As it should be predicated on passing basic cluster ITBLL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16690) Move znode path configs to a separated class

2016-10-02 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15540362#comment-15540362
 ] 

Duo Zhang commented on HBASE-16690:
---

The timed out tests can pass locally.

> Move znode path configs to a separated class
> 
>
> Key: HBASE-16690
> URL: https://issues.apache.org/jira/browse/HBASE-16690
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Duo Zhang
>Assignee: Duo Zhang
> Fix For: 2.0.0
>
> Attachments: HBASE-16690-v1.patch, HBASE-16690-v2.patch, 
> HBASE-16690.patch
>
>
> Which makes it easier to use curator at client in the future.
> And also, will try to fix the bad practise that declares masterMaintZNode as 
> static but update it a non-static method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-16721) Concurrency issue in WAL unflushed seqId tracking

2016-10-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15539924#comment-15539924
 ] 

Hudson commented on HBASE-16721:


FAILURE: Integrated in Jenkins build HBase-1.1-JDK8 #1874 (See 
[https://builds.apache.org/job/HBase-1.1-JDK8/1874/])
HBASE-16721 Concurrency issue in WAL unflushed seqId tracking - ADDENDUM (enis: 
rev e8019307c58da488ad26555d104990cd1c005dab)
* (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/wal/WAL.java


> Concurrency issue in WAL unflushed seqId tracking
> -
>
> Key: HBASE-16721
> URL: https://issues.apache.org/jira/browse/HBASE-16721
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 1.0.0, 1.1.0, 1.2.0
>Reporter: Enis Soztutar
>Assignee: Enis Soztutar
>Priority: Critical
> Fix For: 2.0.0, 1.3.0, 1.4.0, 1.2.4, 1.1.8
>
> Attachments: hbase-16721_addendum.patch, 
> hbase-16721_v1.branch-1.patch, hbase-16721_v2.branch-1.patch, 
> hbase-16721_v2.master.patch
>
>
> I'm inspecting an interesting case where in a production cluster, some 
> regionservers ends up accumulating hundreds of WAL files, even with force 
> flushes going on due to max logs. This happened multiple times on the 
> cluster, but not on other clusters. The cluster has periodic memstore flusher 
> disabled, however, this still does not explain why the force flush of regions 
> due to max limit is not working. I think the periodic memstore flusher just 
> masks the underlying problem, which is why we do not see this in other 
> clusters. 
> The problem starts like this: 
> {code}
> 2016-09-21 17:49:18,272 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=33, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-21 17:49:18,273 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> then, it continues until the RS is restarted: 
> {code}
> 2016-09-23 17:43:49,356 INFO  [regionserver//10.2.0.55:16020.logRoller] 
> wal.FSHLog: Too many wals: logs=721, maxlogs=32; forcing flush of 1 
> regions(s): d4cf39dc40ea79f5da4d0cf66d03cb1f
> 2016-09-23 17:43:49,357 WARN  [regionserver//10.2.0.55:16020.logRoller] 
> regionserver.LogRoller: Failed to schedule flush of 
> d4cf39dc40ea79f5da4d0cf66d03cb1f, region=null, requester=null
> {code}
> The problem is that region {{d4cf39dc40ea79f5da4d0cf66d03cb1f}} is already 
> split some time ago, and was able to flush its data and split without any 
> problems. However, the FSHLog still thinks that there is some unflushed data 
> for this region. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)