[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-06-27 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874109#comment-16874109
 ] 

Hudson commented on HBASE-22422:


Results for branch branch-2
[build #2029 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2029/]: 
(x) *{color:red}-1 overall{color}*

details (if available):

(x) {color:red}-1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2029//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2029//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/branch-2/2029//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Retain an ByteBuff with refCnt=0 when getBlock from LRUCache
> 
>
> Key: HBASE-22422
> URL: https://issues.apache.org/jira/browse/HBASE-22422
> Project: HBase
>  Issue Type: Sub-task
>  Components: BlockCache
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: 0001-debug2.patch, 0001-debug2.patch, 0001-debug2.patch, 
> 0001-debug3.patch, 0001-debug4.patch, 
> HBASE-22422-qps-after-fix-the-zero-retain-bug.png, 
> HBASE-22422.HBASE-21879.v01.patch, HBASE-22422.HBASE-21879.v02.patch, 
> LRUBlockCache-getBlock.png, debug.patch, 
> failed-to-check-positive-on-web-ui.png, image-2019-05-15-12-00-03-641.png
>
>
> After runing YCSB scan/get benchmark in our XiaoMi cluster,  we found the get 
> QPS dropped from  25000/s to hunderds per second in a cluster with five 
> nodes.  
> After enable the debug log at YCSB client side,  I found the following 
> stacktrace , see 
> https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png.
>  
> After looking into the stractrace, I can ensure that the zero refCnt block is 
> an intermedia index block, see [2] http://hbase.apache.org/images/hfilev2.png
> Need a patch to fix this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-06-24 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870934#comment-16870934
 ] 

Hudson commented on HBASE-22422:


Results for branch master
[build #1168 on 
builds.a.o|https://builds.apache.org/job/HBase%20Nightly/job/master/1168/]: (x) 
*{color:red}-1 overall{color}*

details (if available):

(x) {color:red}-1 general checks{color}
-- For more information [see general 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/1168//General_Nightly_Build_Report/]




(x) {color:red}-1 jdk8 hadoop2 checks{color}
-- For more information [see jdk8 (hadoop2) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/1168//JDK8_Nightly_Build_Report_(Hadoop2)/]


(x) {color:red}-1 jdk8 hadoop3 checks{color}
-- For more information [see jdk8 (hadoop3) 
report|https://builds.apache.org/job/HBase%20Nightly/job/master/1168//JDK8_Nightly_Build_Report_(Hadoop3)/]


(/) {color:green}+1 source release artifact{color}
-- See build output for details.


(/) {color:green}+1 client integration test{color}


> Retain an ByteBuff with refCnt=0 when getBlock from LRUCache
> 
>
> Key: HBASE-22422
> URL: https://issues.apache.org/jira/browse/HBASE-22422
> Project: HBase
>  Issue Type: Sub-task
>  Components: BlockCache
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: 0001-debug2.patch, 0001-debug2.patch, 0001-debug2.patch, 
> 0001-debug3.patch, 0001-debug4.patch, 
> HBASE-22422-qps-after-fix-the-zero-retain-bug.png, 
> HBASE-22422.HBASE-21879.v01.patch, HBASE-22422.HBASE-21879.v02.patch, 
> LRUBlockCache-getBlock.png, debug.patch, 
> failed-to-check-positive-on-web-ui.png, image-2019-05-15-12-00-03-641.png
>
>
> After runing YCSB scan/get benchmark in our XiaoMi cluster,  we found the get 
> QPS dropped from  25000/s to hunderds per second in a cluster with five 
> nodes.  
> After enable the debug log at YCSB client side,  I found the following 
> stacktrace , see 
> https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png.
>  
> After looking into the stractrace, I can ensure that the zero refCnt block is 
> an intermedia index block, see [2] http://hbase.apache.org/images/hfilev2.png
> Need a patch to fix this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-05-28 Thread ramkrishna.s.vasudevan (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849899#comment-16849899
 ] 

ramkrishna.s.vasudevan commented on HBASE-22422:


[~openinx]
I just asked a question in the PR. 

> Retain an ByteBuff with refCnt=0 when getBlock from LRUCache
> 
>
> Key: HBASE-22422
> URL: https://issues.apache.org/jira/browse/HBASE-22422
> Project: HBase
>  Issue Type: Sub-task
>  Components: BlockCache
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: 0001-debug2.patch, 0001-debug2.patch, 0001-debug2.patch, 
> 0001-debug3.patch, 0001-debug4.patch, 
> HBASE-22422-qps-after-fix-the-zero-retain-bug.png, 
> HBASE-22422.HBASE-21879.v01.patch, HBASE-22422.HBASE-21879.v02.patch, 
> LRUBlockCache-getBlock.png, debug.patch, 
> failed-to-check-positive-on-web-ui.png, image-2019-05-15-12-00-03-641.png
>
>
> After runing YCSB scan/get benchmark in our XiaoMi cluster,  we found the get 
> QPS dropped from  25000/s to hunderds per second in a cluster with five 
> nodes.  
> After enable the debug log at YCSB client side,  I found the following 
> stacktrace , see 
> https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png.
>  
> After looking into the stractrace, I can ensure that the zero refCnt block is 
> an intermedia index block, see [2] http://hbase.apache.org/images/hfilev2.png
> Need a patch to fix this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-05-27 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849286#comment-16849286
 ] 

Zheng Hu commented on HBASE-22422:
--

Pushed to HBASE-21879 branch,  Thanks [~Apache9] & [~ram_krish] for reviewing. 

> Retain an ByteBuff with refCnt=0 when getBlock from LRUCache
> 
>
> Key: HBASE-22422
> URL: https://issues.apache.org/jira/browse/HBASE-22422
> Project: HBase
>  Issue Type: Sub-task
>  Components: BlockCache
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: 0001-debug2.patch, 0001-debug2.patch, 0001-debug2.patch, 
> 0001-debug3.patch, 0001-debug4.patch, 
> HBASE-22422-qps-after-fix-the-zero-retain-bug.png, 
> HBASE-22422.HBASE-21879.v01.patch, HBASE-22422.HBASE-21879.v02.patch, 
> LRUBlockCache-getBlock.png, debug.patch, 
> failed-to-check-positive-on-web-ui.png, image-2019-05-15-12-00-03-641.png
>
>
> After runing YCSB scan/get benchmark in our XiaoMi cluster,  we found the get 
> QPS dropped from  25000/s to hunderds per second in a cluster with five 
> nodes.  
> After enable the debug log at YCSB client side,  I found the following 
> stacktrace , see 
> https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png.
>  
> After looking into the stractrace, I can ensure that the zero refCnt block is 
> an intermedia index block, see [2] http://hbase.apache.org/images/hfilev2.png
> Need a patch to fix this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-05-26 Thread HBase QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848366#comment-16848366
 ] 

HBase QA commented on HBASE-22422:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  2m 
41s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
1s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 3 new or modified test 
files. {color} |
|| || || || {color:brown} HBASE-21879 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
26s{color} | {color:green} HBASE-21879 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
51s{color} | {color:green} HBASE-21879 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
11s{color} | {color:green} HBASE-21879 passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
23s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  2m 
47s{color} | {color:blue} hbase-server in HBASE-21879 has 11 extant Findbugs 
warnings. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} | {color:green} HBASE-21879 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
 2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m 
12s{color} | {color:red} hbase-server: The patch generated 1 new + 130 
unchanged - 2 fixed = 131 total (was 132) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
24s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
8m 25s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4 
or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m  
1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}130m 
17s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
29s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}170m 49s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce base: 
https://builds.apache.org/job/PreCommit-HBASE-Build/419/artifact/patchprocess/Dockerfile
 |
| JIRA Issue | HBASE-22422 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12969803/HBASE-22422.HBASE-21879.v02.patch
 |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux a8b269ea44a8 4.4.0-138-generic #164-Ubuntu SMP Tue Oct 2 
17:16:02 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/hbase-personality.sh |
| git revision | HBASE-21879 / 111c95c11c |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_181 |
| findbugs | v3.1.11 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HBASE-Build/419/artifact/patchprocess/diff-checkstyle-hbase-server.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/419/testReport/ |
| Max. process+thread count | 4895 (vs. ulimit of 1) |
| 

[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-05-24 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848024#comment-16848024
 ] 

Zheng Hu commented on HBASE-22422:
--

Upload a picture to show the current YCSB result ( see 
https://issues.apache.org/jira/secure/attachment/12969731/HBASE-22422-qps-after-fix-the-zero-retain-bug.png).
  At least, the QPS wouldn't drop to hundreds. but the sawtooth curve still 
looks some strange, anyway, will continue the digging. 

> Retain an ByteBuff with refCnt=0 when getBlock from LRUCache
> 
>
> Key: HBASE-22422
> URL: https://issues.apache.org/jira/browse/HBASE-22422
> Project: HBase
>  Issue Type: Sub-task
>  Components: BlockCache
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: 0001-debug2.patch, 0001-debug2.patch, 0001-debug2.patch, 
> 0001-debug3.patch, 0001-debug4.patch, 
> HBASE-22422-qps-after-fix-the-zero-retain-bug.png, 
> HBASE-22422.HBASE-21879.v01.patch, LRUBlockCache-getBlock.png, debug.patch, 
> failed-to-check-positive-on-web-ui.png, image-2019-05-15-12-00-03-641.png
>
>
> After runing YCSB scan/get benchmark in our XiaoMi cluster,  we found the get 
> QPS dropped from  25000/s to hunderds per second in a cluster with five 
> nodes.  
> After enable the debug log at YCSB client side,  I found the following 
> stacktrace , see 
> https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png.
>  
> After looking into the stractrace, I can ensure that the zero refCnt block is 
> an intermedia index block, see [2] http://hbase.apache.org/images/hfilev2.png
> Need a patch to fix this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-05-23 Thread ramkrishna.s.vasudevan (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847259#comment-16847259
 ] 

ramkrishna.s.vasudevan commented on HBASE-22422:


bq.Understand now, it's a cnocurrent bug in RAMCache, say if thread1 try to 
getBlock as following: 
Good one. 

> Retain an ByteBuff with refCnt=0 when getBlock from LRUCache
> 
>
> Key: HBASE-22422
> URL: https://issues.apache.org/jira/browse/HBASE-22422
> Project: HBase
>  Issue Type: Sub-task
>  Components: BlockCache
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: 0001-debug2.patch, 0001-debug2.patch, 0001-debug2.patch, 
> 0001-debug3.patch, 0001-debug4.patch, HBASE-22422.HBASE-21879.v01.patch, 
> LRUBlockCache-getBlock.png, debug.patch, 
> failed-to-check-positive-on-web-ui.png, image-2019-05-15-12-00-03-641.png
>
>
> After runing YCSB scan/get benchmark in our XiaoMi cluster,  we found the get 
> QPS dropped from  25000/s to hunderds per second in a cluster with five 
> nodes.  
> After enable the debug log at YCSB client side,  I found the following 
> stacktrace , see 
> https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png.
>  
> After looking into the stractrace, I can ensure that the zero refCnt block is 
> an intermedia index block, see [2] http://hbase.apache.org/images/hfilev2.png
> Need a patch to fix this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-05-23 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847199#comment-16847199
 ] 

Zheng Hu commented on HBASE-22422:
--

Data block reading failure will lead to an extra index-block release, that's to 
say: there will be a index block in LruBlockCache with refCnt=0, then all the 
following RPC requesting to this zero refCnt index-block will get a 
IllegalReferenceCountException, which make the QPS dropped from 25000/s to 
hunderds per second. 


Let me explain the detail, see the method 
HFileBlockIndex#loadDataBlockWithScanInfo: 

{code}
HFileBlock block = null;
boolean dataBlock = false;
KeyOnlyKeyValue tmpNextIndexKV = new KeyValue.KeyOnlyKeyValue();
while (true) {
try {
//.
block =
cachingBlockReader.readBlock(currentOffset, currentOnDiskSize, 
shouldCache, pread,
  isCompaction, true, expectedBlockType, expectedDataBlockEncoding);
// Loop until we got a DataBlock; 
  }
} finally {
  if (!dataBlock && block != null) {
// Release the block immediately if it is not the data block
block.release();
  }
}
{code}

The first time in while loop, the block is a index block and read successfully 
from the LRuBlockCache; 
The second time in while loop,  need to read a data block in 
CombinedBLockcache, while read failure because of the above RAMCache concurrent 
issue. then an exception thrown when cachingBlockReader#readBlock. But the 
block variable still reference to a index block, then we did an extra release 
in the finally block.

> Retain an ByteBuff with refCnt=0 when getBlock from LRUCache
> 
>
> Key: HBASE-22422
> URL: https://issues.apache.org/jira/browse/HBASE-22422
> Project: HBase
>  Issue Type: Sub-task
>  Components: BlockCache
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: 0001-debug2.patch, 0001-debug2.patch, 0001-debug2.patch, 
> 0001-debug3.patch, 0001-debug4.patch, HBASE-22422.HBASE-21879.v01.patch, 
> LRUBlockCache-getBlock.png, debug.patch, 
> failed-to-check-positive-on-web-ui.png, image-2019-05-15-12-00-03-641.png
>
>
> After runing YCSB scan/get benchmark in our XiaoMi cluster,  we found the get 
> QPS dropped from  25000/s to hunderds per second in a cluster with five 
> nodes.  
> After enable the debug log at YCSB client side,  I found the following 
> stacktrace , see 
> https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png.
>  
> After looking into the stractrace, I can ensure that the zero refCnt block is 
> an intermedia index block, see [2] http://hbase.apache.org/images/hfilev2.png
> Need a patch to fix this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-05-23 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847185#comment-16847185
 ] 

Zheng Hu commented on HBASE-22422:
--

Understand now, it's a cnocurrent bug in RAMCache, say if thread1 try to 
getBlock as following: 
Step.1 :  get the block1 from RAMCache#delegate; 
Step.2 :  call the block1#retain to increase its refCnt; 
But another thread2 have flushed block into IOEngine and start clear the block 
from RAMCache: 
Step.a :  get the block1 by RAMCache#delegate.remove;
Step.b:   call the block1#release to decrease its refCnt. 

If those steps above ordered as following: 
Step.1 :  get the block1 from RAMCache#delegate; 
Step.a :  get the block1 by RAMCache#delegate.remove;
Step.b:   call the block1#release to decrease its refCnt, here the refCnt 
decrease from 1 to 0;
Step.2 :  call the block1#retain to increase its refCnt; 

Then, the concurrent bug will occur.  One way to fix this is : make the 
getAndRetain/removeAndRelease to be atomic.

 

> Retain an ByteBuff with refCnt=0 when getBlock from LRUCache
> 
>
> Key: HBASE-22422
> URL: https://issues.apache.org/jira/browse/HBASE-22422
> Project: HBase
>  Issue Type: Sub-task
>  Components: BlockCache
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: 0001-debug2.patch, 0001-debug2.patch, 0001-debug2.patch, 
> 0001-debug3.patch, 0001-debug4.patch, HBASE-22422.HBASE-21879.v01.patch, 
> LRUBlockCache-getBlock.png, debug.patch, 
> failed-to-check-positive-on-web-ui.png, image-2019-05-15-12-00-03-641.png
>
>
> After runing YCSB scan/get benchmark in our XiaoMi cluster,  we found the get 
> QPS dropped from  25000/s to hunderds per second in a cluster with five 
> nodes.  
> After enable the debug log at YCSB client side,  I found the following 
> stacktrace , see 
> https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png.
>  
> After looking into the stractrace, I can ensure that the zero refCnt block is 
> an intermedia index block, see [2] http://hbase.apache.org/images/hfilev2.png
> Need a patch to fix this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-05-23 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847172#comment-16847172
 ] 

Zheng Hu commented on HBASE-22422:
--

After running some hours,  the bug reproduced in my pressure cluster,  has the 
following log: 
{code}
2019-05-24,03:43:10,796 INFO org.apache.hadoop.hbase.nio.RefCnt: ===> Start to 
dump callerSet for #641783987
2019-05-24,03:43:10,796 INFO org.apache.hadoop.hbase.nio.RefCnt:   --> 
#641783987 -> caller: HFileScannerImpl#returnBlocks: return curBlock, refCnt 
before release is: 2
2019-05-24,03:43:10,796 INFO org.apache.hadoop.hbase.nio.RefCnt:   --> 
#641783987 -> caller: RAMCache#remove, refCnt before release is: 1
2019-05-24,03:43:10,796 INFO org.apache.hadoop.hbase.nio.RefCnt: ===> End to 
dump callerSet #641783987
2019-05-24,03:43:10,801 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
Encountered an unknown exception in RegionScannerImpl: 
org.apache.hbase.thirdparty.io.netty.util.IllegalReferenceCountException: 
refCnt: 0, increment: 1
at 
org.apache.hbase.thirdparty.io.netty.util.AbstractReferenceCounted.retain0(AbstractReferenceCounted.java:87)
at 
org.apache.hbase.thirdparty.io.netty.util.AbstractReferenceCounted.retain(AbstractReferenceCounted.java:74)
at org.apache.hadoop.hbase.nio.RefCnt.retain(RefCnt.java:73)
at 
org.apache.hadoop.hbase.nio.SingleByteBuff.retain(SingleByteBuff.java:398)
at 
org.apache.hadoop.hbase.nio.SingleByteBuff.retain(SingleByteBuff.java:39)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock.retain(HFileBlock.java:457)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock.retain(HFileBlock.java:115)
at 
org.apache.hadoop.hbase.io.hfile.bucket.BucketCache$RAMCache.get(BucketCache.java:1539)
at 
org.apache.hadoop.hbase.io.hfile.bucket.BucketCache.getBlock(BucketCache.java:483)
at 
org.apache.hadoop.hbase.io.hfile.CombinedBlockCache.getBlock(CombinedBlockCache.java:85)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.getCachedBlock(HFileReaderImpl.java:1306)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1472)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$CellBasedKeyBlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:339)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:843)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekTo(HFileReaderImpl.java:794)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:315)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:216)
at 
org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:394)
at 
org.apache.hadoop.hbase.regionserver.StoreScanner.(StoreScanner.java:249)
at 
org.apache.hadoop.hbase.regionserver.HStore.createScanner(HStore.java:2063)
at 
org.apache.hadoop.hbase.regionserver.HStore.getScanner(HStore.java:2054)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.initializeScanners(HRegion.java:6493)
at 
org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.(HRegion.java:6473)
at 
org.apache.hadoop.hbase.regionserver.HRegion.instantiateRegionScanner(HRegion.java:2999)
at 
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2979)
at 
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2961)
at 
org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:2955)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2621)
at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2548)
at 
org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41998)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:374)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:132)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324)
at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304)
2019-05-24,03:43:10,813 INFO org.apache.hadoop.hbase.nio.RefCnt: ===> Start to 
dump callerSet for #312566113
2019-05-24,03:43:10,813 INFO org.apache.hadoop.hbase.nio.RefCnt:   --> 
#312566113 -> caller: CellBasedKeyBlockIndexReader#loadDataBlockWithScanInfo, 
refCnt before release is: 1
2019-05-24,03:43:10,813 INFO org.apache.hadoop.hbase.nio.RefCnt:   --> 
#312566113 -> caller: CellBasedKeyBlockIndexReader#loadDataBlockWithScanInfo, 
refCnt before release is: 2
2019-05-24,03:43:10,813 INFO org.apache.hadoop.hbase.nio.RefCnt:   --> 
#312566113 -> caller: CellBasedKeyBlockIndexReader#loadDataBlockWithScanInfo, 
refCnt 

[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-05-23 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846593#comment-16846593
 ] 

Zheng Hu commented on HBASE-22422:
--

Attached the debug4 as said above, let's see what happen in my pressure cluster.

> Retain an ByteBuff with refCnt=0 when getBlock from LRUCache
> 
>
> Key: HBASE-22422
> URL: https://issues.apache.org/jira/browse/HBASE-22422
> Project: HBase
>  Issue Type: Sub-task
>  Components: BlockCache
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: 0001-debug2.patch, 0001-debug2.patch, 0001-debug2.patch, 
> 0001-debug3.patch, 0001-debug4.patch, HBASE-22422.HBASE-21879.v01.patch, 
> LRUBlockCache-getBlock.png, debug.patch, 
> failed-to-check-positive-on-web-ui.png, image-2019-05-15-12-00-03-641.png
>
>
> After runing YCSB scan/get benchmark in our XiaoMi cluster,  we found the get 
> QPS dropped from  25000/s to hunderds per second in a cluster with five 
> nodes.  
> After enable the debug log at YCSB client side,  I found the following 
> stacktrace , see 
> https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png.
>  
> After looking into the stractrace, I can ensure that the zero refCnt block is 
> an intermedia index block, see [2] http://hbase.apache.org/images/hfilev2.png
> Need a patch to fix this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-05-23 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846518#comment-16846518
 ] 

Zheng Hu commented on HBASE-22422:
--

The bug never happened since applied debug3.patch,   it's some frustrating.  
Discussed with [~Apache9],  it's possible that the stack track catcher make the 
release a bit slow and the concurrent bug disappear. 
{code}
+  @Override   
+  public boolean release() {  
+callerSet.add(debugString(Thread.currentThread().getStackTrace(), 
this.refCnt()));
+return super.release();   
+  }
{code}
So I plan to pass the caller's  string message as an argument into release, 
then it won't cost that much time as the strace trace do at the same time, 
will continue to check those code paths. 

> Retain an ByteBuff with refCnt=0 when getBlock from LRUCache
> 
>
> Key: HBASE-22422
> URL: https://issues.apache.org/jira/browse/HBASE-22422
> Project: HBase
>  Issue Type: Sub-task
>  Components: BlockCache
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: 0001-debug2.patch, 0001-debug2.patch, 0001-debug2.patch, 
> 0001-debug3.patch, HBASE-22422.HBASE-21879.v01.patch, 
> LRUBlockCache-getBlock.png, debug.patch, 
> failed-to-check-positive-on-web-ui.png, image-2019-05-15-12-00-03-641.png
>
>
> After runing YCSB scan/get benchmark in our XiaoMi cluster,  we found the get 
> QPS dropped from  25000/s to hunderds per second in a cluster with five 
> nodes.  
> After enable the debug log at YCSB client side,  I found the following 
> stacktrace , see 
> https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png.
>  
> After looking into the stractrace, I can ensure that the zero refCnt block is 
> an intermedia index block, see [2] http://hbase.apache.org/images/hfilev2.png
> Need a patch to fix this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-05-21 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845433#comment-16845433
 ] 

Zheng Hu commented on HBASE-22422:
--

After runing about 12 hours in my pressure cluster,  still no 
IllegalReferenceCountException happened.  That's strange,  will wait some more 
time. 

> Retain an ByteBuff with refCnt=0 when getBlock from LRUCache
> 
>
> Key: HBASE-22422
> URL: https://issues.apache.org/jira/browse/HBASE-22422
> Project: HBase
>  Issue Type: Sub-task
>  Components: BlockCache
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: 0001-debug2.patch, 0001-debug2.patch, 0001-debug2.patch, 
> 0001-debug3.patch, HBASE-22422.HBASE-21879.v1.patch, 
> LRUBlockCache-getBlock.png, debug.patch, 
> failed-to-check-positive-on-web-ui.png, image-2019-05-15-12-00-03-641.png
>
>
> After runing YCSB scan/get benchmark in our XiaoMi cluster,  we found the get 
> QPS dropped from  25000/s to hunderds per second in a cluster with five 
> nodes.  
> After enable the debug log at YCSB client side,  I found the following 
> stacktrace , see 
> https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png.
>  
> After looking into the stractrace, I can ensure that the zero refCnt block is 
> an intermedia index block, see [2] http://hbase.apache.org/images/hfilev2.png
> Need a patch to fix this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-05-21 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844851#comment-16844851
 ] 

Zheng Hu commented on HBASE-22422:
--

I've applied the debug3.patch into my test cluster, still waiting for the 
IllegalReferenceCountException...

> Retain an ByteBuff with refCnt=0 when getBlock from LRUCache
> 
>
> Key: HBASE-22422
> URL: https://issues.apache.org/jira/browse/HBASE-22422
> Project: HBase
>  Issue Type: Sub-task
>  Components: BlockCache
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: 0001-debug2.patch, 0001-debug2.patch, 0001-debug2.patch, 
> 0001-debug3.patch, HBASE-22422.HBASE-21879.v1.patch, 
> LRUBlockCache-getBlock.png, debug.patch, 
> failed-to-check-positive-on-web-ui.png, image-2019-05-15-12-00-03-641.png
>
>
> After runing YCSB scan/get benchmark in our XiaoMi cluster,  we found the get 
> QPS dropped from  25000/s to hunderds per second in a cluster with five 
> nodes.  
> After enable the debug log at YCSB client side,  I found the following 
> stacktrace , see 
> https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png.
>  
> After looking into the stractrace, I can ensure that the zero refCnt block is 
> an intermedia index block, see [2] http://hbase.apache.org/images/hfilev2.png
> Need a patch to fix this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-05-21 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844564#comment-16844564
 ] 

Zheng Hu commented on HBASE-22422:
--

Update the patch with debug2.patch, which only LOG the non-data HFileblock's 
release caller. 

> Retain an ByteBuff with refCnt=0 when getBlock from LRUCache
> 
>
> Key: HBASE-22422
> URL: https://issues.apache.org/jira/browse/HBASE-22422
> Project: HBase
>  Issue Type: Sub-task
>  Components: BlockCache
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: 0001-debug2.patch, HBASE-22422.HBASE-21879.v1.patch, 
> LRUBlockCache-getBlock.png, debug.patch, 
> failed-to-check-positive-on-web-ui.png, image-2019-05-15-12-00-03-641.png
>
>
> After runing YCSB scan/get benchmark in our XiaoMi cluster,  we found the get 
> QPS dropped from  25000/s to hunderds per second in a cluster with five 
> nodes.  
> After enable the debug log at YCSB client side,  I found the following 
> stacktrace , see 
> https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png.
>  
> After looking into the stractrace, I can ensure that the zero refCnt block is 
> an intermedia index block, see [2] http://hbase.apache.org/images/hfilev2.png
> Need a patch to fix this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-05-21 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844541#comment-16844541
 ] 

Zheng Hu commented on HBASE-22422:
--

After applied the debug.patch, seems it's easy to full gc and restart to the RS 
now because of the high Get  throughput. 

> Retain an ByteBuff with refCnt=0 when getBlock from LRUCache
> 
>
> Key: HBASE-22422
> URL: https://issues.apache.org/jira/browse/HBASE-22422
> Project: HBase
>  Issue Type: Sub-task
>  Components: BlockCache
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: HBASE-22422.HBASE-21879.v1.patch, 
> LRUBlockCache-getBlock.png, debug.patch, 
> failed-to-check-positive-on-web-ui.png, image-2019-05-15-12-00-03-641.png
>
>
> After runing YCSB scan/get benchmark in our XiaoMi cluster,  we found the get 
> QPS dropped from  25000/s to hunderds per second in a cluster with five 
> nodes.  
> After enable the debug log at YCSB client side,  I found the following 
> stacktrace , see 
> https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png.
>  
> After looking into the stractrace, I can ensure that the zero refCnt block is 
> an intermedia index block, see [2] http://hbase.apache.org/images/hfilev2.png
> Need a patch to fix this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-05-20 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16844460#comment-16844460
 ] 

Zheng Hu commented on HBASE-22422:
--

I've tried to check all the code and made some patch to ensure the bug, but 
seems it did not work.  So I write a simple patch to dump all the release 
caller's stack trace if any IllegalReferenceCountException happen when retain. 
Let see what it will say. 

> Retain an ByteBuff with refCnt=0 when getBlock from LRUCache
> 
>
> Key: HBASE-22422
> URL: https://issues.apache.org/jira/browse/HBASE-22422
> Project: HBase
>  Issue Type: Sub-task
>  Components: BlockCache
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: HBASE-22422.HBASE-21879.v1.patch, 
> LRUBlockCache-getBlock.png, debug.patch, 
> failed-to-check-positive-on-web-ui.png, image-2019-05-15-12-00-03-641.png
>
>
> After runing YCSB scan/get benchmark in our XiaoMi cluster,  we found the get 
> QPS dropped from  25000/s to hunderds per second in a cluster with five 
> nodes.  
> After enable the debug log at YCSB client side,  I found the following 
> stacktrace , see 
> https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png.
>  
> After looking into the stractrace, I can ensure that the zero refCnt block is 
> an intermedia index block, see [2] http://hbase.apache.org/images/hfilev2.png
> Need a patch to fix this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-05-16 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841249#comment-16841249
 ] 

Zheng Hu commented on HBASE-22422:
--

Uploaded an initial patch for fixing above the comments, will design UT for 
each case , also will have a benchmark again.

> Retain an ByteBuff with refCnt=0 when getBlock from LRUCache
> 
>
> Key: HBASE-22422
> URL: https://issues.apache.org/jira/browse/HBASE-22422
> Project: HBase
>  Issue Type: Sub-task
>  Components: BlockCache
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: HBASE-22422.HBASE-21879.v1.patch, 
> LRUBlockCache-getBlock.png, image-2019-05-15-12-00-03-641.png
>
>
> After runing YCSB scan/get benchmark in our XiaoMi cluster,  we found the get 
> QPS dropped from  25000/s to hunderds per second in a cluster with five 
> nodes.  
> After enable the debug log at YCSB client side,  I found the following 
> stacktrace , see 
> https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png.
>  
> After looking into the stractrace, I can ensure that the zero refCnt block is 
> an intermedia index block, see [2] http://hbase.apache.org/images/hfilev2.png
> Need a patch to fix this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-05-15 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840973#comment-16840973
 ] 

Zheng Hu commented on HBASE-22422:
--

Here we should consider the unpacked block release if any prepareDecoding 
failure happen:
{code}
  HFileBlock unpack(HFileContext fileContext, FSReader reader) throws 
IOException {
if (!fileContext.isCompressedOrEncrypted()) {
  // TODO: cannot use our own fileContext here because 
HFileBlock(ByteBuffer, boolean),
  // which is used for block serialization to L2 cache, does not preserve 
encoding and
  // encryption details.
  return this;
}

HFileBlock unpacked = new HFileBlock(this);
unpacked.allocateBuffer(); // allocates space for the decompressed block

HFileBlockDecodingContext ctx = blockType == BlockType.ENCODED_DATA
? reader.getBlockDecodingContext() : 
reader.getDefaultBlockDecodingContext();

ByteBuff dup = this.buf.duplicate();
dup.position(this.headerSize());
dup = dup.slice();

ctx.prepareDecoding(unpacked.getOnDiskSizeWithoutHeader(),
  unpacked.getUncompressedSizeWithoutHeader(), 
unpacked.getBufferWithoutHeader(true), dup);

return unpacked;
  }
{code}

> Retain an ByteBuff with refCnt=0 when getBlock from LRUCache
> 
>
> Key: HBASE-22422
> URL: https://issues.apache.org/jira/browse/HBASE-22422
> Project: HBase
>  Issue Type: Sub-task
>  Components: BlockCache
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: LRUBlockCache-getBlock.png, 
> image-2019-05-15-12-00-03-641.png
>
>
> After runing YCSB scan/get benchmark in our XiaoMi cluster,  we found the get 
> QPS dropped from  25000/s to hunderds per second in a cluster with five 
> nodes.  
> After enable the debug log at YCSB client side,  I found the following 
> stacktrace , see 
> https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png
> Need a patch to fix this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-05-15 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840955#comment-16840955
 ] 

Zheng Hu commented on HBASE-22422:
--

Another risk is here, LruBlockCache#evictBlock,  we should move the 
previous.getBuffer().release() to the last line before return because once the 
release decrease the refCnt to zero then nobody can access the buf (such as 
victimHandler).
{code}
  protected long evictBlock(LruCachedBlock block, boolean 
evictedByEvictionProcess) {
LruCachedBlock previous = map.remove(block.getCacheKey());
if (previous == null) {
  return 0;
}
// Decrease the block's reference count, and if refCount is 0, then it'll 
auto-deallocate.
previous.getBuffer().release();
updateSizeMetrics(block, true);
long val = elements.decrementAndGet();
if (LOG.isTraceEnabled()) {
  long size = map.size();
  assertCounterSanity(size, val);
}
if (block.getBuffer().getBlockType().isData()) {
   dataBlockElements.decrement();
}
if (evictedByEvictionProcess) {
  // When the eviction of the block happened because of invalidation of 
HFiles, no need to
  // update the stats counter.
  stats.evicted(block.getCachedTime(), block.getCacheKey().isPrimary());
  if (victimHandler != null) {
victimHandler.cacheBlock(block.getCacheKey(), block.getBuffer());
  }
}
return block.heapSize();
  }
{code}

> Retain an ByteBuff with refCnt=0 when getBlock from LRUCache
> 
>
> Key: HBASE-22422
> URL: https://issues.apache.org/jira/browse/HBASE-22422
> Project: HBase
>  Issue Type: Sub-task
>  Components: BlockCache
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: LRUBlockCache-getBlock.png, 
> image-2019-05-15-12-00-03-641.png
>
>
> After runing YCSB scan/get benchmark in our XiaoMi cluster,  we found the get 
> QPS dropped from  25000/s to hunderds per second in a cluster with five 
> nodes.  
> After enable the debug log at YCSB client side,  I found the following 
> stacktrace , see 
> https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png
> Need a patch to fix this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-05-15 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840167#comment-16840167
 ] 

Zheng Hu commented on HBASE-22422:
--

The following code may  also have some problem (HFileReaderImpl#readBlock): 
{code}
// Load block from filesystem.
HFileBlock hfileBlock = fsBlockReader.readBlockData(dataBlockOffset, 
onDiskBlockSize, pread,
  !isCompaction, shouldUseHeap(expectedBlockType));
validateBlockType(hfileBlock, expectedBlockType);
HFileBlock unpacked = hfileBlock.unpack(hfileContext, fsBlockReader);
BlockType.BlockCategory category = 
hfileBlock.getBlockType().getCategory();

// Cache the block if necessary
AtomicBoolean cachedRaw = new AtomicBoolean(false);
cacheConf.getBlockCache().ifPresent(cache -> {
  if (cacheBlock && cacheConf.shouldCacheBlockOnRead(category)) {
cachedRaw.set(cacheConf.shouldCacheCompressed(category));
cache.cacheBlock(cacheKey, cachedRaw.get() ? hfileBlock : unpacked,
  cacheConf.isInMemory());
  }
});
if (unpacked != hfileBlock && !cachedRaw.get()) {
  // End of life here if hfileBlock is an independent block.
  hfileBlock.release();
}
{code}

> Retain an ByteBuff with refCnt=0 when getBlock from LRUCache
> 
>
> Key: HBASE-22422
> URL: https://issues.apache.org/jira/browse/HBASE-22422
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: LRUBlockCache-getBlock.png, 
> image-2019-05-15-12-00-03-641.png
>
>
> After runing YCSB scan/get benchmark in our XiaoMi cluster,  we found the get 
> QPS dropped from  25000/s to hunderds per second in a cluster with five 
> nodes.  
> After enable the debug log at YCSB client side,  I found the following 
> stacktrace , see 
> https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png
> Need a patch to fix this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-22422) Retain an ByteBuff with refCnt=0 when getBlock from LRUCache

2019-05-15 Thread Zheng Hu (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-22422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840130#comment-16840130
 ] 

Zheng Hu commented on HBASE-22422:
--

An potential cause would be here: 
https://issues.apache.org/jira/secure/attachment/12968762/LRUBlockCache-getBlock.png
1. get block from map firstly;
2. retain the bock. 
Between the step.1 and step.2,  if a release to zero happen,  then we'll get 
the Exception which says we are retaining a block with refCnt=0.

> Retain an ByteBuff with refCnt=0 when getBlock from LRUCache
> 
>
> Key: HBASE-22422
> URL: https://issues.apache.org/jira/browse/HBASE-22422
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Zheng Hu
>Assignee: Zheng Hu
>Priority: Major
> Attachments: LRUBlockCache-getBlock.png, 
> image-2019-05-15-12-00-03-641.png
>
>
> After runing YCSB scan/get benchmark in our XiaoMi cluster,  we found the get 
> QPS dropped from  25000/s to hunderds per second in a cluster with five 
> nodes.  
> After enable the debug log at YCSB client side,  I found the following 
> stacktrace , see 
> https://issues.apache.org/jira/secure/attachment/12968745/image-2019-05-15-12-00-03-641.png
> Need a patch to fix this. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)