from:"Duo Zhang \(Jira\)"

[jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS

2015-06-04 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573900#comment-14573900
 ] 

Duo Zhang commented on HBASE-13811:
---

{quote}
Rather than add a new method that does what the old getEarliestMemstoreSeqNum 
did, I changed getEarliestMemstoreSeqNum to be how the old version worked.
{quote}
Fine, I think it will work. But I still feel a little nervous to have two 
methods which have same name but different behaviors...

And I remember that, when implmenting HBASE-10201 and HBASE-12405, actually I 
wanted to return the flushedSeqId when calling startCacheFlush first. But there 
are two problems. First is getNextSequenceId method is in HRegion, not in 
FSHLog, so a simple solution is return NO_SEQ_NUM when flushing all stores and 
let HRegion call getNextSequenceId. But here comes the second problem, 
startCacheFlush may fail which means we can not start a flush, so there are 
three types of return values, 'sequenceId', 'choose a sequenceId by yourself', 
'give up flushing!'. I think it is ugly to have a '-2' or a null java.lang.Long 
to indicate a 'give up flushing' at that time so I gave up...

Maybe we could consider this solution again? getEarliestMemstoreSeqNum can be 
used everywhere but startCacheFlush is restricted in the flushing scope I think.

Thanks.

 Splitting WALs, we are filtering out too many edits - DATALOSS
 ---

 Key: HBASE-13811
 URL: https://issues.apache.org/jira/browse/HBASE-13811
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 2.0.0, 1.2.0
Reporter: stack
Assignee: stack
Priority: Critical
 Fix For: 2.0.0, 1.2.0

 Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt, 
 13811.v2.branch-1.txt, 13811.v3.branch-1.txt, 13811.v3.branch-1.txt, 
 13811.v4.branch-1.txt, 13811.v5.branch-1.txt, 13811.v6.branch-1.txt, 
 13811.v6.branch-1.txt, HBASE-13811-v1.testcase.patch, 
 HBASE-13811.testcase.patch


 I've been running ITBLLs against branch-1 around HBASE-13616 (move of 
 ServerShutdownHandler to pv2). I have come across an instance of dataloss. My 
 patch for HBASE-13616 was in place so can only think it the cause (but cannot 
 see how). When we split the logs, we are skipping legit edits. Digging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS

2015-06-03 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14571917#comment-14571917
 ] 

Duo Zhang commented on HBASE-13811:
---

{code}
flushOpSeqId = getNextSequenceId(wal);
flushedSeqId = getFlushedSequenceId(encodedRegionName, flushOpSeqId);
{code}

I think the problem is here...
Before the patch, getFlushedSequenceId will return flushOpSeqId because 
getEarliestMemstoreSeqNum will return HConstants.NO_SEQNUM. But now we modify 
getEarliestMemstoreSeqNum, it will also consider the sequenceIds recorded in 
flushingSequenceIds so it will not return HConstants.NO_SEQNUM even if we 
decided to flush all stores. This may cause we replay unnecessary edits I think.

So the problem is here we only want to consider the sequenceIds in 
lowestUnflushedSequenceIds, so maybe a new method?

Thanks. [~stack]

 Splitting WALs, we are filtering out too many edits - DATALOSS
 ---

 Key: HBASE-13811
 URL: https://issues.apache.org/jira/browse/HBASE-13811
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 2.0.0, 1.2.0
Reporter: stack
Assignee: stack
Priority: Critical
 Fix For: 2.0.0, 1.2.0

 Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt, 
 HBASE-13811-v1.testcase.patch, HBASE-13811.testcase.patch


 I've been running ITBLLs against branch-1 around HBASE-13616 (move of 
 ServerShutdownHandler to pv2). I have come across an instance of dataloss. My 
 patch for HBASE-13616 was in place so can only think it the cause (but cannot 
 see how). When we split the logs, we are skipping legit edits. Digging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS

2015-06-05 Thread Duo Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574311#comment-14574311
]

Duo Zhang commented on HBASE-13811:
---

{quote}
We can actually tell when we are flushing if we should do all of the region,
right? If the passed in families are null or equal in number to region stores,
we are doing a full region flush so we should use the flush sequence id, the
result of the getNextSequenceId call. Otherwise, we want the getEarliest for
the region because are doing a column family only flush...
{quote}
If the stores which we do not flush are all empty, then we should still use the
flush sequence id returned by getNextSequenceId(getEarliest should return
NO_SEQ_NUM in this case).

Splitting WALs, we are filtering out too many edits - DATALOSS
---

Key: HBASE-13811
URL: https://issues.apache.org/jira/browse/HBASE-13811
Project: HBase
Issue Type: Bug
Components: wal
Affects Versions: 2.0.0, 1.2.0
Reporter: stack
Assignee: stack
Priority: Critical
Fix For: 2.0.0, 1.2.0

Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt,
13811.v2.branch-1.txt, 13811.v3.branch-1.txt, 13811.v3.branch-1.txt,
13811.v4.branch-1.txt, 13811.v5.branch-1.txt, 13811.v6.branch-1.txt,
13811.v6.branch-1.txt, 13811.v7.branch-1.txt, HBASE-13811-v1.testcase.patch,
HBASE-13811.testcase.patch, startCacheFlush.diff

I've been running ITBLLs against branch-1 around HBASE-13616 (move of
ServerShutdownHandler to pv2). I have come across an instance of dataloss. My
patch for HBASE-13616 was in place so can only think it the cause (but cannot
see how). When we split the logs, we are skipping legit edits. Digging.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS

2015-06-05 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574227#comment-14574227
 ] 

Duo Zhang commented on HBASE-13811:
---

{quote}
Pardon me, I don't see the problem here?
{quote}
Not a big problem... The only thing is I do not like java.lang.Long...I can 
prepare a patch to explain how to implement it.

{quote}
What do you mean by 'startCacheFlush is restricted'.
{quote}
'startCacheFlush' is a exact name which tell us what it will do and I think 
people will only call it when they want to flush a region, so it is less hurt 
to change its behavior. 'getEarliestMemstoreSeqNum'  is a more general name, 
people can use it everywhere when they want to get the value.

 Splitting WALs, we are filtering out too many edits - DATALOSS
 ---

 Key: HBASE-13811
 URL: https://issues.apache.org/jira/browse/HBASE-13811
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 2.0.0, 1.2.0
Reporter: stack
Assignee: stack
Priority: Critical
 Fix For: 2.0.0, 1.2.0

 Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt, 
 13811.v2.branch-1.txt, 13811.v3.branch-1.txt, 13811.v3.branch-1.txt, 
 13811.v4.branch-1.txt, 13811.v5.branch-1.txt, 13811.v6.branch-1.txt, 
 13811.v6.branch-1.txt, 13811.v7.branch-1.txt, HBASE-13811-v1.testcase.patch, 
 HBASE-13811.testcase.patch


 I've been running ITBLLs against branch-1 around HBASE-13616 (move of 
 ServerShutdownHandler to pv2). I have come across an instance of dataloss. My 
 patch for HBASE-13616 was in place so can only think it the cause (but cannot 
 see how). When we split the logs, we are skipping legit edits. Digging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS

2015-06-05 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-13811:
--
Attachment: startCacheFlush.diff

Implement flushedSeqId calculation inside startCacheFlush.

 Splitting WALs, we are filtering out too many edits - DATALOSS
 ---

 Key: HBASE-13811
 URL: https://issues.apache.org/jira/browse/HBASE-13811
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 2.0.0, 1.2.0
Reporter: stack
Assignee: stack
Priority: Critical
 Fix For: 2.0.0, 1.2.0

 Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt, 
 13811.v2.branch-1.txt, 13811.v3.branch-1.txt, 13811.v3.branch-1.txt, 
 13811.v4.branch-1.txt, 13811.v5.branch-1.txt, 13811.v6.branch-1.txt, 
 13811.v6.branch-1.txt, 13811.v7.branch-1.txt, HBASE-13811-v1.testcase.patch, 
 HBASE-13811.testcase.patch, startCacheFlush.diff


 I've been running ITBLLs against branch-1 around HBASE-13616 (move of 
 ServerShutdownHandler to pv2). I have come across an instance of dataloss. My 
 patch for HBASE-13616 was in place so can only think it the cause (but cannot 
 see how). When we split the logs, we are skipping legit edits. Digging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS

2015-06-05 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575398#comment-14575398
 ] 

Duo Zhang commented on HBASE-13811:
---

+1 on the latest version. Hope it can pass ITBLL.

 Splitting WALs, we are filtering out too many edits - DATALOSS
 ---

 Key: HBASE-13811
 URL: https://issues.apache.org/jira/browse/HBASE-13811
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 2.0.0, 1.2.0
Reporter: stack
Assignee: stack
Priority: Critical
 Fix For: 2.0.0, 1.2.0

 Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt, 
 13811.v2.branch-1.txt, 13811.v3.branch-1.txt, 13811.v3.branch-1.txt, 
 13811.v4.branch-1.txt, 13811.v5.branch-1.txt, 13811.v6.branch-1.txt, 
 13811.v6.branch-1.txt, 13811.v7.branch-1.txt, 13811.v8.branch-1.txt, 
 13811.v9.branch-1.txt, HBASE-13811-v1.testcase.patch, 
 HBASE-13811.testcase.patch, startCacheFlush.diff


 I've been running ITBLLs against branch-1 around HBASE-13616 (move of 
 ServerShutdownHandler to pv2). I have come across an instance of dataloss. My 
 patch for HBASE-13616 was in place so can only think it the cause (but cannot 
 see how). When we split the logs, we are skipping legit edits. Digging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13853) ITBLL data loss in HBase-1.1

2015-06-07 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-13853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576573#comment-14576573
 ] 

Duo Zhang commented on HBASE-13853:
---

HBASE-13811 ?

 ITBLL data loss in HBase-1.1
 

 Key: HBASE-13853
 URL: https://issues.apache.org/jira/browse/HBASE-13853
 Project: HBase
  Issue Type: Bug
Reporter: Enis Soztutar
Assignee: Enis Soztutar
Priority: Blocker
 Fix For: 2.0.0, 1.1.1


 After 2 days of log mining with ITBLL Search tool (thanks Stack, BTW for the 
 tool), the sequence of events are like this: 
 One of mappers from Search reportedly finds missing keys in some WAL file:
 {code}
 2015-06-07 06:49:24,573 INFO [main] org.apache.hadoop.mapred.MapTask: 
 Processing split: 
 hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643
  (-9223372036854775808:9223372036854775807) length:132679657 2015-06-07 
 06:49:24,639 INFO [main] org.apache.hadoop.hbase.mapreduce.WALInputFormat: 
 Opening reader for 
 hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643
  (-9223372036854775808:9223372036854775807) length:132679657
 2015-06-07 06:50:16,383 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Loaded keys 
 to find: count=2384870 2015-06-07 06:50:16,607 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:prev/1433637873293/Put/vlen=16/seqid=0
  2015-06-07 06:50:16,663 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:count/1433637873293/Put/vlen=8/seqid=0
 2015-06-07 06:50:16,664 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:client/1433637873293/Put/vlen=71/seqid=0
  2015-06-07 06:50:16,671 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:prev/1433637873293/Put/vlen=16/seqid=0
 2015-06-07 06:50:16,672 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:count/1433637873293/Put/vlen=8/seqid=0
  2015-06-07 06:50:16,678 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:client/1433637873293/Put/vlen=71/seqid=0
 2015-06-07 06:50:16,679 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=\x1F\x80#\x95\xAE:i=)S\x01\x08 
 \xD6\xFF\xD5/meta:prev/1433637873293/Put/vlen=16/seqid=0
 {code}
 {{hlog -p}} confirms the missing keys are there in the file: 
 {code}
 Sequence=7276 , region=b086e29f909c9790446cd457c1ea7674 at write 
 timestamp=Sun Jun 07 00:44:33 UTC 2015
 row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:prev
 value: \x1B\xF5\xF3^\x8D\xB4\x85\xE3\xF4wS\x9A]\x0D\xABe
 row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:count
 value: \x00\x00\x00\x00\x002\x87l
 row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:client
 value: Job: job_1433466891829_0002 Task: 
 attempt_1433466891829_0002_m_03_0
 {code}
 When the RS gets killing from CM, master does WAL splitting: 
 {code}
 ./hbase-hbase-regionserver-os-enis-dal-test-jun-4-2.log:2015-06-07 
 00:46:12,581 INFO  [RS_LOG_REPLAY_OPS-os-enis-dal-test-jun-4-2:16020-0] 
 wal.WALSplitter: Processed 2971 edits across 4 regions; edits skipped=740; 
 log 
 file=hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-4.openstacklocal,16020,1433636320201-splitting/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643,
  length=132679657, corrupted=false, progress failed=false
 {code}
 The edits with Sequence=7276 should be in this recovered.edits file: 
 {code}
 2015-06-07 00:46:12,574 INFO  [split-log-closeStream-2] wal.WALSplitter: 
 Closed wap 
 hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674/recovered.edits/0007276.temp
  (wrote 739 edits in 3950ms)
 2015-06-07 00:46:12,580 INFO  [split-log-closeStream-2] wal.WALSplitter: 
 Rename 
 hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674/recovered.edits/0007276.temp
  to

[jira] [Commented] (HBASE-13853) ITBLL data loss in HBase-1.1

2015-06-07 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-13853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576576#comment-14576576
 ] 

Duo Zhang commented on HBASE-13853:
---

And what's the failures with DLR? DLR doesn't get sequence id from master I 
think, it gets sequence id from zookeeper which is uploaded by the regionserver 
that holds the region? Thanks.

 ITBLL data loss in HBase-1.1
 

 Key: HBASE-13853
 URL: https://issues.apache.org/jira/browse/HBASE-13853
 Project: HBase
  Issue Type: Bug
Reporter: Enis Soztutar
Assignee: Enis Soztutar
Priority: Blocker
 Fix For: 2.0.0, 1.1.1


 After 2 days of log mining with ITBLL Search tool (thanks Stack, BTW for the 
 tool), the sequence of events are like this: 
 One of mappers from Search reportedly finds missing keys in some WAL file:
 {code}
 2015-06-07 06:49:24,573 INFO [main] org.apache.hadoop.mapred.MapTask: 
 Processing split: 
 hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643
  (-9223372036854775808:9223372036854775807) length:132679657 2015-06-07 
 06:49:24,639 INFO [main] org.apache.hadoop.hbase.mapreduce.WALInputFormat: 
 Opening reader for 
 hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643
  (-9223372036854775808:9223372036854775807) length:132679657
 2015-06-07 06:50:16,383 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Loaded keys 
 to find: count=2384870 2015-06-07 06:50:16,607 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:prev/1433637873293/Put/vlen=16/seqid=0
  2015-06-07 06:50:16,663 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:count/1433637873293/Put/vlen=8/seqid=0
 2015-06-07 06:50:16,664 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:client/1433637873293/Put/vlen=71/seqid=0
  2015-06-07 06:50:16,671 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:prev/1433637873293/Put/vlen=16/seqid=0
 2015-06-07 06:50:16,672 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:count/1433637873293/Put/vlen=8/seqid=0
  2015-06-07 06:50:16,678 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:client/1433637873293/Put/vlen=71/seqid=0
 2015-06-07 06:50:16,679 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=\x1F\x80#\x95\xAE:i=)S\x01\x08 
 \xD6\xFF\xD5/meta:prev/1433637873293/Put/vlen=16/seqid=0
 {code}
 {{hlog -p}} confirms the missing keys are there in the file: 
 {code}
 Sequence=7276 , region=b086e29f909c9790446cd457c1ea7674 at write 
 timestamp=Sun Jun 07 00:44:33 UTC 2015
 row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:prev
 value: \x1B\xF5\xF3^\x8D\xB4\x85\xE3\xF4wS\x9A]\x0D\xABe
 row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:count
 value: \x00\x00\x00\x00\x002\x87l
 row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:client
 value: Job: job_1433466891829_0002 Task: 
 attempt_1433466891829_0002_m_03_0
 {code}
 When the RS gets killing from CM, master does WAL splitting: 
 {code}
 ./hbase-hbase-regionserver-os-enis-dal-test-jun-4-2.log:2015-06-07 
 00:46:12,581 INFO  [RS_LOG_REPLAY_OPS-os-enis-dal-test-jun-4-2:16020-0] 
 wal.WALSplitter: Processed 2971 edits across 4 regions; edits skipped=740; 
 log 
 file=hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-4.openstacklocal,16020,1433636320201-splitting/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643,
  length=132679657, corrupted=false, progress failed=false
 {code}
 The edits with Sequence=7276 should be in this recovered.edits file: 
 {code}
 2015-06-07 00:46:12,574 INFO  [split-log-closeStream-2] wal.WALSplitter: 
 Closed wap 
 hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674/recovered.edits/0007276.temp
  (wrote 739 edits in 3950ms)
 2015-06-07 00:46:12,580 INFO  [split-log-closeStream-2] wal.WALSplitter: 
 Rename

[jira] [Commented] (HBASE-13853) ITBLL data loss in HBase-1.1

2015-06-07 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-13853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576585#comment-14576585
 ] 

Duo Zhang commented on HBASE-13853:
---

I indicated in HBASE-13811 that this issue could also cause data loss on 
branch-1.1 even if we turn off Per-CF flush and mentioned [~ndimiduk], but he 
has not replied yet...

 ITBLL data loss in HBase-1.1
 

 Key: HBASE-13853
 URL: https://issues.apache.org/jira/browse/HBASE-13853
 Project: HBase
  Issue Type: Bug
Reporter: Enis Soztutar
Assignee: Enis Soztutar
Priority: Blocker
 Fix For: 2.0.0, 1.1.1

 Attachments: hbase-13853_v1-branch-1.1.patch


 After 2 days of log mining with ITBLL Search tool (thanks Stack, BTW for the 
 tool), the sequence of events are like this: 
 One of mappers from Search reportedly finds missing keys in some WAL file:
 {code}
 2015-06-07 06:49:24,573 INFO [main] org.apache.hadoop.mapred.MapTask: 
 Processing split: 
 hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643
  (-9223372036854775808:9223372036854775807) length:132679657 2015-06-07 
 06:49:24,639 INFO [main] org.apache.hadoop.hbase.mapreduce.WALInputFormat: 
 Opening reader for 
 hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643
  (-9223372036854775808:9223372036854775807) length:132679657
 2015-06-07 06:50:16,383 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Loaded keys 
 to find: count=2384870 2015-06-07 06:50:16,607 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:prev/1433637873293/Put/vlen=16/seqid=0
  2015-06-07 06:50:16,663 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:count/1433637873293/Put/vlen=8/seqid=0
 2015-06-07 06:50:16,664 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:client/1433637873293/Put/vlen=71/seqid=0
  2015-06-07 06:50:16,671 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:prev/1433637873293/Put/vlen=16/seqid=0
 2015-06-07 06:50:16,672 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:count/1433637873293/Put/vlen=8/seqid=0
  2015-06-07 06:50:16,678 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:client/1433637873293/Put/vlen=71/seqid=0
 2015-06-07 06:50:16,679 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=\x1F\x80#\x95\xAE:i=)S\x01\x08 
 \xD6\xFF\xD5/meta:prev/1433637873293/Put/vlen=16/seqid=0
 {code}
 {{hlog -p}} confirms the missing keys are there in the file: 
 {code}
 Sequence=7276 , region=b086e29f909c9790446cd457c1ea7674 at write 
 timestamp=Sun Jun 07 00:44:33 UTC 2015
 row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:prev
 value: \x1B\xF5\xF3^\x8D\xB4\x85\xE3\xF4wS\x9A]\x0D\xABe
 row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:count
 value: \x00\x00\x00\x00\x002\x87l
 row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:client
 value: Job: job_1433466891829_0002 Task: 
 attempt_1433466891829_0002_m_03_0
 {code}
 When the RS gets killing from CM, master does WAL splitting: 
 {code}
 ./hbase-hbase-regionserver-os-enis-dal-test-jun-4-2.log:2015-06-07 
 00:46:12,581 INFO  [RS_LOG_REPLAY_OPS-os-enis-dal-test-jun-4-2:16020-0] 
 wal.WALSplitter: Processed 2971 edits across 4 regions; edits skipped=740; 
 log 
 file=hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-4.openstacklocal,16020,1433636320201-splitting/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643,
  length=132679657, corrupted=false, progress failed=false
 {code}
 The edits with Sequence=7276 should be in this recovered.edits file: 
 {code}
 2015-06-07 00:46:12,574 INFO  [split-log-closeStream-2] wal.WALSplitter: 
 Closed wap 
 hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674/recovered.edits/0007276.temp
  (wrote 739 edits in 3950ms)
 2015-06-07 00:46:12,580 INFO  [split-log-closeStream-2]

[jira] [Commented] (HBASE-13853) ITBLL data loss in HBase-1.1

2015-06-07 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-13853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576598#comment-14576598
 ] 

Duo Zhang commented on HBASE-13853:
---

I think we could back port the patch on HBASE-13811 to branch-1.1 without the 
clean up and abstraction of SequenceIdAccounting? We'd better keep the WAL 
interface same if possible?

 ITBLL data loss in HBase-1.1
 

 Key: HBASE-13853
 URL: https://issues.apache.org/jira/browse/HBASE-13853
 Project: HBase
  Issue Type: Bug
Reporter: Enis Soztutar
Assignee: Enis Soztutar
Priority: Blocker
 Fix For: 2.0.0, 1.1.1

 Attachments: hbase-13853_v1-branch-1.1.patch


 After 2 days of log mining with ITBLL Search tool (thanks Stack, BTW for the 
 tool), the sequence of events are like this: 
 One of mappers from Search reportedly finds missing keys in some WAL file:
 {code}
 2015-06-07 06:49:24,573 INFO [main] org.apache.hadoop.mapred.MapTask: 
 Processing split: 
 hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643
  (-9223372036854775808:9223372036854775807) length:132679657 2015-06-07 
 06:49:24,639 INFO [main] org.apache.hadoop.hbase.mapreduce.WALInputFormat: 
 Opening reader for 
 hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643
  (-9223372036854775808:9223372036854775807) length:132679657
 2015-06-07 06:50:16,383 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Loaded keys 
 to find: count=2384870 2015-06-07 06:50:16,607 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:prev/1433637873293/Put/vlen=16/seqid=0
  2015-06-07 06:50:16,663 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:count/1433637873293/Put/vlen=8/seqid=0
 2015-06-07 06:50:16,664 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:client/1433637873293/Put/vlen=71/seqid=0
  2015-06-07 06:50:16,671 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:prev/1433637873293/Put/vlen=16/seqid=0
 2015-06-07 06:50:16,672 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:count/1433637873293/Put/vlen=8/seqid=0
  2015-06-07 06:50:16,678 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:client/1433637873293/Put/vlen=71/seqid=0
 2015-06-07 06:50:16,679 INFO [main] 
 org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found 
 cell=\x1F\x80#\x95\xAE:i=)S\x01\x08 
 \xD6\xFF\xD5/meta:prev/1433637873293/Put/vlen=16/seqid=0
 {code}
 {{hlog -p}} confirms the missing keys are there in the file: 
 {code}
 Sequence=7276 , region=b086e29f909c9790446cd457c1ea7674 at write 
 timestamp=Sun Jun 07 00:44:33 UTC 2015
 row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:prev
 value: \x1B\xF5\xF3^\x8D\xB4\x85\xE3\xF4wS\x9A]\x0D\xABe
 row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:count
 value: \x00\x00\x00\x00\x002\x87l
 row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:client
 value: Job: job_1433466891829_0002 Task: 
 attempt_1433466891829_0002_m_03_0
 {code}
 When the RS gets killing from CM, master does WAL splitting: 
 {code}
 ./hbase-hbase-regionserver-os-enis-dal-test-jun-4-2.log:2015-06-07 
 00:46:12,581 INFO  [RS_LOG_REPLAY_OPS-os-enis-dal-test-jun-4-2:16020-0] 
 wal.WALSplitter: Processed 2971 edits across 4 regions; edits skipped=740; 
 log 
 file=hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-4.openstacklocal,16020,1433636320201-splitting/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643,
  length=132679657, corrupted=false, progress failed=false
 {code}
 The edits with Sequence=7276 should be in this recovered.edits file: 
 {code}
 2015-06-07 00:46:12,574 INFO  [split-log-closeStream-2] wal.WALSplitter: 
 Closed wap 
 hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674/recovered.edits/0007276.temp
  (wrote 739 edits in 3950ms)
 2015-06-07 00:46:12,580 INFO  [split-log-closeStream-2]

[jira] [Updated] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS

2015-06-08 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-13811:
--
Attachment: HBASE-13811-branch-1.1.patch

[~enis] patch for 1.1.

 Splitting WALs, we are filtering out too many edits - DATALOSS
 ---

 Key: HBASE-13811
 URL: https://issues.apache.org/jira/browse/HBASE-13811
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 2.0.0, 1.2.0
Reporter: stack
Assignee: stack
Priority: Critical
 Fix For: 2.0.0, 1.2.0

 Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt, 
 13811.v2.branch-1.txt, 13811.v3.branch-1.txt, 13811.v3.branch-1.txt, 
 13811.v4.branch-1.txt, 13811.v5.branch-1.txt, 13811.v6.branch-1.txt, 
 13811.v6.branch-1.txt, 13811.v7.branch-1.txt, 13811.v8.branch-1.txt, 
 13811.v9.branch-1.txt, HBASE-13811-branch-1.1.patch, 
 HBASE-13811-v1.testcase.patch, HBASE-13811.testcase.patch, 
 startCacheFlush.diff


 I've been running ITBLLs against branch-1 around HBASE-13616 (move of 
 ServerShutdownHandler to pv2). I have come across an instance of dataloss. My 
 patch for HBASE-13616 was in place so can only think it the cause (but cannot 
 see how). When we split the logs, we are skipping legit edits. Digging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS

2015-06-02 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570013#comment-14570013
 ] 

Duo Zhang commented on HBASE-13811:
---

My fault, haven't seen hbase jira these days...Thanks [~stack] for digging.

+1 on the change in getEarliestMemstoreSeqNum  but I think we should add a unit 
test for it?

 Splitting WALs, we are filtering out too many edits - DATALOSS
 ---

 Key: HBASE-13811
 URL: https://issues.apache.org/jira/browse/HBASE-13811
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 2.0.0, 1.2.0
Reporter: stack
Assignee: stack
Priority: Critical
 Fix For: 2.0.0, 1.2.0

 Attachments: 13811.txt


 I've been running ITBLLs against branch-1 around HBASE-13616 (move of 
 ServerShutdownHandler to pv2). I have come across an instance of dataloss. My 
 patch for HBASE-13616 was in place so can only think it the cause (but cannot 
 see how). When we split the logs, we are skipping legit edits. Digging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS

2015-06-04 Thread Duo Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572388#comment-14572388
]

Duo Zhang commented on HBASE-13811:
---

{quote}
I just set it to the flush sequenceid
{quote}
I think this would fail
TestPerColumnFamilyFlush.testLogReplayWithDistributedLogSplit?

The logic here should be

1. If we flush all stores or the unflushed stores are all empty, then just use
flushOpSeqId.
2. Else, find the lowest seqId of the edits in unflushed stores and then minus
1.

The original version of getEarliestMemstoreSeqNum can just fit into the logic,
but it will cause data loss as described above. So I think we still need the
original version of getEarliestMemstoreSeqNum, but we can rename it to more
suitable name? Maybe calcFlushedSeqId? Thanks.

Splitting WALs, we are filtering out too many edits - DATALOSS
---

Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt,
13811.v2.branch-1.txt, HBASE-13811-v1.testcase.patch,
HBASE-13811.testcase.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS

2015-06-03 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572051#comment-14572051
 ] 

Duo Zhang commented on HBASE-13811:
---

If we flush all data in memstore, then use the flushid is reasonable? The 
flushedSeqId is calculated at the preparing stage of a flush, and actually 
assigned to region.maxFlushedSeqId after we successfully commit the flush. We 
will not report it to master before the actual assign happen, so I think it is 
safe?

 Splitting WALs, we are filtering out too many edits - DATALOSS
 ---

 Key: HBASE-13811
 URL: https://issues.apache.org/jira/browse/HBASE-13811
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 2.0.0, 1.2.0
Reporter: stack
Assignee: stack
Priority: Critical
 Fix For: 2.0.0, 1.2.0

 Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt, 
 HBASE-13811-v1.testcase.patch, HBASE-13811.testcase.patch


 I've been running ITBLLs against branch-1 around HBASE-13616 (move of 
 ServerShutdownHandler to pv2). I have come across an instance of dataloss. My 
 patch for HBASE-13616 was in place so can only think it the cause (but cannot 
 see how). When we split the logs, we are skipping legit edits. Digging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13877) Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL

2015-06-09 Thread Duo Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579783#comment-14579783
]

Duo Zhang commented on HBASE-13877:
---

In MemStoreFlusher we catch DroppedSnapshotException and abort the
regionserver, and the javadoc of Region.flush says that
DroppedSnapshotException is thrown when abort is required. So I think it is the
caller's duty to abort the regionserver? If we abort the server inside
Region.flush, do we still need to throw DroppedSnapshotException out? Thanks.

Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL

Key: HBASE-13877
URL: https://issues.apache.org/jira/browse/HBASE-13877
Project: HBase
Issue Type: Bug
Reporter: Enis Soztutar
Assignee: Enis Soztutar
Priority: Blocker
Fix For: 2.0.0, 1.2.0, 1.1.1

Attachments: hbase-13877_v1.patch, hbase-13877_v2-branch-1.1.patch

ITBLL with 1.25B rows failed for me (and Stack as reported in
https://issues.apache.org/jira/browse/HBASE-13811?focusedCommentId=14577834page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14577834)

HBASE-13811 and HBASE-13853 fixed an issue with WAL edit filtering.
The root cause this time seems to be different. It is due to procedure based
flush interrupting the flush request in case the procedure is cancelled from
an exception elsewhere. This leaves the memstore snapshot intact without
aborting the server. The next flush, then flushes the previous memstore with
the current seqId (as opposed to seqId from the memstore snapshot). This
creates an hfile with larger seqId than what its contents are. Previous
behavior in 0.98 and 1.0 (I believe) is that after flush prepare and
interruption / exception will cause RS abort.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13877) Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL

2015-06-09 Thread Duo Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579870#comment-14579870
]

Duo Zhang commented on HBASE-13877:
---

{code}
i've added extra abort() statement to safeguard against cases where the caller
does not handle this exception explicitly.
{code}
Then this is the caller's fault? Why not fix it at the caller side?
Or we could remove the abort regionserver semantics of
DroppedSnapshotException so the caller do not need to deal with it anymore?
We'd better make the rule clear otherwise people may get confusing...

Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL

Key: HBASE-13877
URL: https://issues.apache.org/jira/browse/HBASE-13877
Project: HBase
Issue Type: Bug
Reporter: Enis Soztutar
Assignee: Enis Soztutar
Priority: Blocker
Fix For: 2.0.0, 1.2.0, 1.1.1

Attachments: hbase-13877_v1.patch, hbase-13877_v2-branch-1.1.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HBASE-13877) Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL

2015-06-09 Thread Duo Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579870#comment-14579870
]

Duo Zhang edited comment on HBASE-13877 at 6/10/15 2:16 AM:

{quote}
i've added extra abort() statement to safeguard against cases where the caller
does not handle this exception explicitly.
{quote}
Then this is the caller's fault? Why not fix it at the caller side?
Or we could remove the abort regionserver semantics of
DroppedSnapshotException so the caller do not need to deal with it anymore?
We'd better make the rule clear otherwise people may get confusing...

was (Author: apache9):
{code}
i've added extra abort() statement to safeguard against cases where the caller
does not handle this exception explicitly.
{code}
Then this is the caller's fault? Why not fix it at the caller side?
Or we could remove the abort regionserver semantics of
DroppedSnapshotException so the caller do not need to deal with it anymore?
We'd better make the rule clear otherwise people may get confusing...

Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL

Key: HBASE-13877
URL: https://issues.apache.org/jira/browse/HBASE-13877
Project: HBase
Issue Type: Bug
Reporter: Enis Soztutar
Assignee: Enis Soztutar
Priority: Blocker
Fix For: 2.0.0, 1.2.0, 1.1.1

Attachments: hbase-13877_v1.patch, hbase-13877_v2-branch-1.1.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-19 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594157#comment-14594157
 ] 

Duo Zhang commented on HBASE-13937:
---

{quote}
thus guaranteeing that the region server cannot accept any more writes.
{quote}
But still readable right? Does HBase guarantee this level of consistency?
1. A writes row to HBase.
2  A tells B to read the row.
3. B can read the row from HBase.

If not, then I think remove recoverLease is enough here. Otherwise we still to 
make sure that the regionserver can not process any request.

And I think this discussion should be in the parent issue, so +1 on patch v3 :)

 Partially revert HBASE-13172 
 -

 Key: HBASE-13937
 URL: https://issues.apache.org/jira/browse/HBASE-13937
 Project: HBase
  Issue Type: Sub-task
  Components: Region Assignment
Reporter: Enis Soztutar
Assignee: Enis Soztutar
 Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0

 Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, 
 hbase-13937_v3.patch


 HBASE-13172 is supposed to fix a UT issue, but causes other problems that 
 parent jira (HBASE-13605) is attempting to fix. 
 However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, 
 to put it mildly, major design flaws in AM / RS. 
 Regardless of 13605, the issue with 13172 is that we catch 
 {{ServerNotRunningYetException}} from {{isServerReachable()}} and return 
 false, which then puts the Server to the {{RegionStates.deadServers}} list. 
 Once it is in that list, we can still assign and unassign regions to the RS 
 after it has started (because regular assignment does not check whether the 
 server is in  {{RegionStates.deadServers}}. However, after the first assign 
 and unassign, we cannot assign the region again since then the check for the 
 lastServer will think that the server is dead. 
 It turns out that a proper patch for 13605 is very hard without fixing rest 
 of  broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a 
 colorful history). For 1.1.1, I think we should just revert parts of 
 HBASE-13172 for now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-13970) NPE during compaction in trunk

2015-06-25 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-13970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-13970:
--
Attachment: HBASE-13970.patch

A simple patch that add an increment AtomicInteger as the suffix of compaction 
name.

[~ram_krish] Could you please test whether this patch works? Thanks.
And also, thanks for the great digging.

 NPE during compaction in trunk
 --

 Key: HBASE-13970
 URL: https://issues.apache.org/jira/browse/HBASE-13970
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: ramkrishna.s.vasudevan
Assignee: ramkrishna.s.vasudevan
 Fix For: 2.0.0

 Attachments: HBASE-13970.patch


 Updated the trunk.. Loaded the table with PE tool.  Trigger a flush to ensure 
 all data is flushed out to disk. When the first compaction is triggered we 
 get an NPE and this is very easy to reproduce
 {code}
 015-06-25 21:33:46,041 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
 event: /hbase/flush-table-proc/acquired
 2015-06-25 21:33:46,051 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HRegion: Flushing 1/1 column families, memstore=76.91 MB
 2015-06-25 21:33:46,159 ERROR 
 [regionserver/stobdtserver3/10.224.54.70:16040-longCompactions-1435248183945] 
 regionserver.CompactSplitThread: Compaction failed Request = 
 regionName=TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4.,
  storeName=info, fileCount=3, fileSize=343.4 M (114.5 M, 114.5 M, 114.5 M), 
 priority=3, time=7536968291719985
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController$ActiveCompaction.access$700(PressureAwareCompactionThroughputController.java:79)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController.finish(PressureAwareCompactionThroughputController.java:238)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:306)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:106)
 at 
 org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:112)
 at 
 org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1202)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1792)
 at 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:524)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 2015-06-25 21:33:46,745 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.DefaultStoreFlusher: Flushed, sequenceid=1534, memsize=76.9 M, 
 hasBloomFilter=true, into tmp file 
 hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/.tmp/942ba0831a0047a08987439e34361a0c
 2015-06-25 21:33:46,772 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HStore: Added 
 hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/info/942ba0831a0047a08987439e34361a0c,
  entries=68116, sequenceid=1534, filesize=68.7 M
 2015-06-25 21:33:46,773 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HRegion: Finished memstore flush of ~76.91 MB/80649344, 
 currentsize=0 B/0 for region 
 TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4.
  in 723ms, sequenceid=1534, compaction requested=true
 2015-06-25 21:33:46,780 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received created 
 event:/hbase/flush-table-proc/reached/TestTable
 2015-06-25 21:33:46,790 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received created 
 event:/hbase/flush-table-proc/abort/TestTable
 2015-06-25 21:33:46,791 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure abort children changed 
 event: /hbase/flush-table-proc/abort
 2015-06-25 21:33:46,803 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
 event: /hbase/flush-table-proc/acquired
 2015-06-25 21:33:46,818 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure abort children changed 
 event: /hbase/flush-table-proc/abort
 {code}
 Will check this on what is the reason behind it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13970) NPE during compaction in trunk

2015-06-25 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-13970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601295#comment-14601295
 ] 

Duo Zhang commented on HBASE-13970:
---

Oh, this should be a bug on all branches which contains HBASE-8329. I used to 
assume that compactions can not be executed parallel on the same store, so the 
compationName only contains regionName and storeName. I think we could add a 
counter in the name to avoid conflict.

 NPE during compaction in trunk
 --

 Key: HBASE-13970
 URL: https://issues.apache.org/jira/browse/HBASE-13970
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: ramkrishna.s.vasudevan
Assignee: ramkrishna.s.vasudevan
 Fix For: 2.0.0


 Updated the trunk.. Loaded the table with PE tool.  Trigger a flush to ensure 
 all data is flushed out to disk. When the first compaction is triggered we 
 get an NPE and this is very easy to reproduce
 {code}
 015-06-25 21:33:46,041 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
 event: /hbase/flush-table-proc/acquired
 2015-06-25 21:33:46,051 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HRegion: Flushing 1/1 column families, memstore=76.91 MB
 2015-06-25 21:33:46,159 ERROR 
 [regionserver/stobdtserver3/10.224.54.70:16040-longCompactions-1435248183945] 
 regionserver.CompactSplitThread: Compaction failed Request = 
 regionName=TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4.,
  storeName=info, fileCount=3, fileSize=343.4 M (114.5 M, 114.5 M, 114.5 M), 
 priority=3, time=7536968291719985
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController$ActiveCompaction.access$700(PressureAwareCompactionThroughputController.java:79)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController.finish(PressureAwareCompactionThroughputController.java:238)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:306)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:106)
 at 
 org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:112)
 at 
 org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1202)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1792)
 at 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:524)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 2015-06-25 21:33:46,745 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.DefaultStoreFlusher: Flushed, sequenceid=1534, memsize=76.9 M, 
 hasBloomFilter=true, into tmp file 
 hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/.tmp/942ba0831a0047a08987439e34361a0c
 2015-06-25 21:33:46,772 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HStore: Added 
 hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/info/942ba0831a0047a08987439e34361a0c,
  entries=68116, sequenceid=1534, filesize=68.7 M
 2015-06-25 21:33:46,773 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HRegion: Finished memstore flush of ~76.91 MB/80649344, 
 currentsize=0 B/0 for region 
 TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4.
  in 723ms, sequenceid=1534, compaction requested=true
 2015-06-25 21:33:46,780 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received created 
 event:/hbase/flush-table-proc/reached/TestTable
 2015-06-25 21:33:46,790 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received created 
 event:/hbase/flush-table-proc/abort/TestTable
 2015-06-25 21:33:46,791 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure abort children changed 
 event: /hbase/flush-table-proc/abort
 2015-06-25 21:33:46,803 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
 event: /hbase/flush-table-proc/acquired
 2015-06-25 21:33:46,818 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure abort children changed 
 event: /hbase/flush-table-proc/abort
 {code}
 Will check this on what is the reason behind it. 



--
This message was sent by

[jira] [Updated] (HBASE-13970) NPE during compaction in trunk

2015-06-25 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-13970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-13970:
--
Fix Version/s: 1.1.2
   1.2.0
   0.98.14
Affects Version/s: 1.1.1
   1.2.0
   0.98.13
   Status: Patch Available  (was: Open)

 NPE during compaction in trunk
 --

 Key: HBASE-13970
 URL: https://issues.apache.org/jira/browse/HBASE-13970
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.13, 2.0.0, 1.2.0, 1.1.1
Reporter: ramkrishna.s.vasudevan
Assignee: ramkrishna.s.vasudevan
 Fix For: 2.0.0, 0.98.14, 1.2.0, 1.1.2

 Attachments: HBASE-13970.patch


 Updated the trunk.. Loaded the table with PE tool.  Trigger a flush to ensure 
 all data is flushed out to disk. When the first compaction is triggered we 
 get an NPE and this is very easy to reproduce
 {code}
 015-06-25 21:33:46,041 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
 event: /hbase/flush-table-proc/acquired
 2015-06-25 21:33:46,051 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HRegion: Flushing 1/1 column families, memstore=76.91 MB
 2015-06-25 21:33:46,159 ERROR 
 [regionserver/stobdtserver3/10.224.54.70:16040-longCompactions-1435248183945] 
 regionserver.CompactSplitThread: Compaction failed Request = 
 regionName=TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4.,
  storeName=info, fileCount=3, fileSize=343.4 M (114.5 M, 114.5 M, 114.5 M), 
 priority=3, time=7536968291719985
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController$ActiveCompaction.access$700(PressureAwareCompactionThroughputController.java:79)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController.finish(PressureAwareCompactionThroughputController.java:238)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:306)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:106)
 at 
 org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:112)
 at 
 org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1202)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1792)
 at 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:524)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 2015-06-25 21:33:46,745 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.DefaultStoreFlusher: Flushed, sequenceid=1534, memsize=76.9 M, 
 hasBloomFilter=true, into tmp file 
 hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/.tmp/942ba0831a0047a08987439e34361a0c
 2015-06-25 21:33:46,772 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HStore: Added 
 hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/info/942ba0831a0047a08987439e34361a0c,
  entries=68116, sequenceid=1534, filesize=68.7 M
 2015-06-25 21:33:46,773 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HRegion: Finished memstore flush of ~76.91 MB/80649344, 
 currentsize=0 B/0 for region 
 TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4.
  in 723ms, sequenceid=1534, compaction requested=true
 2015-06-25 21:33:46,780 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received created 
 event:/hbase/flush-table-proc/reached/TestTable
 2015-06-25 21:33:46,790 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received created 
 event:/hbase/flush-table-proc/abort/TestTable
 2015-06-25 21:33:46,791 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure abort children changed 
 event: /hbase/flush-table-proc/abort
 2015-06-25 21:33:46,803 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
 event: /hbase/flush-table-proc/acquired
 2015-06-25 21:33:46,818 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure abort children changed 
 event: /hbase/flush-table-proc/abort
 {code}
 Will check this on what is the reason behind it. 



--
This message was sent by Atlassian JIRA

[jira] [Updated] (HBASE-13970) NPE during compaction in trunk

2015-06-25 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-13970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-13970:
--
Attachment: HBASE-13970-v1.patch

Overflow is OK as what [~anoopsamjohn] said, we only need unique Strings. But 
yes negative counter is bit ugly. I rotated the value to 0 if we reached 
Integer.MAX_VALUE in the new patch.
Thanks.

 NPE during compaction in trunk
 --

 Key: HBASE-13970
 URL: https://issues.apache.org/jira/browse/HBASE-13970
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.0, 0.98.13, 1.2.0, 1.1.1
Reporter: ramkrishna.s.vasudevan
Assignee: ramkrishna.s.vasudevan
 Fix For: 2.0.0, 0.98.14, 1.2.0, 1.1.2

 Attachments: HBASE-13970-v1.patch, HBASE-13970.patch


 Updated the trunk.. Loaded the table with PE tool.  Trigger a flush to ensure 
 all data is flushed out to disk. When the first compaction is triggered we 
 get an NPE and this is very easy to reproduce
 {code}
 015-06-25 21:33:46,041 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
 event: /hbase/flush-table-proc/acquired
 2015-06-25 21:33:46,051 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HRegion: Flushing 1/1 column families, memstore=76.91 MB
 2015-06-25 21:33:46,159 ERROR 
 [regionserver/stobdtserver3/10.224.54.70:16040-longCompactions-1435248183945] 
 regionserver.CompactSplitThread: Compaction failed Request = 
 regionName=TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4.,
  storeName=info, fileCount=3, fileSize=343.4 M (114.5 M, 114.5 M, 114.5 M), 
 priority=3, time=7536968291719985
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController$ActiveCompaction.access$700(PressureAwareCompactionThroughputController.java:79)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController.finish(PressureAwareCompactionThroughputController.java:238)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:306)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:106)
 at 
 org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:112)
 at 
 org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1202)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1792)
 at 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:524)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 2015-06-25 21:33:46,745 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.DefaultStoreFlusher: Flushed, sequenceid=1534, memsize=76.9 M, 
 hasBloomFilter=true, into tmp file 
 hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/.tmp/942ba0831a0047a08987439e34361a0c
 2015-06-25 21:33:46,772 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HStore: Added 
 hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/info/942ba0831a0047a08987439e34361a0c,
  entries=68116, sequenceid=1534, filesize=68.7 M
 2015-06-25 21:33:46,773 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HRegion: Finished memstore flush of ~76.91 MB/80649344, 
 currentsize=0 B/0 for region 
 TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4.
  in 723ms, sequenceid=1534, compaction requested=true
 2015-06-25 21:33:46,780 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received created 
 event:/hbase/flush-table-proc/reached/TestTable
 2015-06-25 21:33:46,790 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received created 
 event:/hbase/flush-table-proc/abort/TestTable
 2015-06-25 21:33:46,791 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure abort children changed 
 event: /hbase/flush-table-proc/abort
 2015-06-25 21:33:46,803 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
 event: /hbase/flush-table-proc/acquired
 2015-06-25 21:33:46,818 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure abort children changed 
 event: /hbase/flush-table-proc/abort
 {code}
 Will check this on what is the reason behind it. 



--
This message was

[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-18 Thread Duo Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592928#comment-14592928
]

Duo Zhang commented on HBASE-13937:
---

{quote}
In theory we should not have isServerReachable() at all. It is a parallel
cluster membership mechanism to the already existing zk based one
{quote}
I'm not familiar with the code of AM and SM, but I think zk is not enough to
keep things consistency. The EPHEMERAL node on zookeeper disappeared does not
mean the server is really dead. We still need a method like isServerReachable
to decide whether the server is really dead. One example is in HDFS HA, a
fencing is needed before transforming standby namenode to active namenode.

{quote}
If we get a connection exception or smt, we cannot assume the server is dead.
{quote}
Agree(and excuse me, what is smt?), in general only the server tells us it is
dead then we can make sure the server is dead. And after googling I think
connection refused is also not that stable(a firewall or backlog queue full can
also cause connection refused). So I think we need fencing like what HDFS HA
does?

Yeah, you may challenge that if the machine is crashed, how can we make sure
the server is dead...Honestly I do not have perfect solution. Maybe we could
introduce a DeadServerManager and make several levels of consistency, the
lowest level does not do fencing at all, the medium level do fencing with a
timeout, and the highest level will do fencing for ever(let a person stop it
maybe)

Thanks.

Partially revert HBASE-13172
-

Key: HBASE-13937
URL: https://issues.apache.org/jira/browse/HBASE-13937
Project: HBase
Issue Type: Sub-task
Components: Region Assignment
Reporter: Enis Soztutar
Assignee: Enis Soztutar
Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0

Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch

HBASE-13172 is supposed to fix a UT issue, but causes other problems that
parent jira (HBASE-13605) is attempting to fix.
However, HBASE-13605 patch v4 uncovers at least 2 different issues which are,
to put it mildly, major design flaws in AM / RS.
Regardless of 13605, the issue with 13172 is that we catch
{{ServerNotRunningYetException}} from {{isServerReachable()}} and return
false, which then puts the Server to the {{RegionStates.deadServers}} list.
Once it is in that list, we can still assign and unassign regions to the RS
after it has started (because regular assignment does not check whether the
server is in {{RegionStates.deadServers}}. However, after the first assign
and unassign, we cannot assign the region again since then the check for the
lastServer will think that the server is dead.
It turns out that a proper patch for 13605 is very hard without fixing rest
of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a
colorful history). For 1.1.1, I think we should just revert parts of
HBASE-13172 for now.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172

2015-06-18 Thread Duo Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592814#comment-14592814
]

Duo Zhang commented on HBASE-13937:
---

For isServerReachable, I think we can say a server is 'dead' if one of the
following conditions is satisfied

1. Server tells us it is dead(I'm not sure whether a
RegionServerStoppedException is enough, maybe it will be thrown before
regionserver completely shutdown?)
2. It is a server with another start code.
3. We get a connection refused(not connect timeout).

So I think remove the code is reasonable, but we should catch a connection
refused exception then(Does our rpc framework throw this exception out, and
also we do not need retry here...)? Otherwise if we do not restart a
regionserver then we will be stuck in the loop for a long time...

And also I do not think it is safe to return 'not reachable' if timeout...

Thanks.

Partially revert HBASE-13172
-

Attachments: hbase-13937_v1.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13899) Jacoco instrumentation fails under jdk8

2015-06-14 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-13899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585042#comment-14585042
 ] 

Duo Zhang commented on HBASE-13899:
---

Seems our jacoco version is really old

{code}
jacoco.version0.6.2.201302030002/jacoco.version
{code}

It's released at Feb 2013. The newest version is 0.7.5 now. I change the mvn 
command to

{noformat}
install -PrunAllTests -Dmaven.test.redirectTestOutputToFile=true 
-Dit.test=noItTest -Dhbase.skip-jacoco=false -Djacoco.version=0.7.5.201505241946
{noformat}

Seems worked.

 Jacoco instrumentation fails under jdk8
 ---

 Key: HBASE-13899
 URL: https://issues.apache.org/jira/browse/HBASE-13899
 Project: HBase
  Issue Type: Sub-task
  Components: build, test
Affects Versions: 1.2.0
Reporter: Sean Busbey
  Labels: coverage
 Fix For: 2.0.0, 1.2.0


 Moving the post-commit build for master to also cover jdk8 shows failures 
 when attempting to instrument test for jacoco coverage.
 example: 
 {code}
 Exception in thread main java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:483)
   at 
 sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:386)
   at 
 sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:401)
 Caused by: java.lang.RuntimeException: Class java/util/UUID could not be 
 instrumented.
   at 
 org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:138)
   at 
 org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:99)
   at 
 org.jacoco.agent.rt.internal_5d10cad.PreMain.createRuntime(PreMain.java:51)
   at org.jacoco.agent.rt.internal_5d10cad.PreMain.premain(PreMain.java:43)
   ... 6 more
 Caused by: java.lang.NoSuchFieldException: $jacocoAccess
   at java.lang.Class.getField(Class.java:1695)
   at 
 org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:136)
   ... 9 more
 FATAL ERROR in native method: processing of -javaagent failed
 Aborted
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-13899) Jacoco instrumentation fails under jdk8

2015-06-14 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-13899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-13899:
--
 Assignee: Duo Zhang
Affects Version/s: 2.0.0
   Status: Patch Available  (was: Open)

 Jacoco instrumentation fails under jdk8
 ---

 Key: HBASE-13899
 URL: https://issues.apache.org/jira/browse/HBASE-13899
 Project: HBase
  Issue Type: Sub-task
  Components: build, test
Affects Versions: 2.0.0, 1.2.0
Reporter: Sean Busbey
Assignee: Duo Zhang
  Labels: coverage
 Fix For: 2.0.0, 1.2.0

 Attachments: HBASE-13899.patch


 Moving the post-commit build for master to also cover jdk8 shows failures 
 when attempting to instrument test for jacoco coverage.
 example: 
 {code}
 Exception in thread main java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:483)
   at 
 sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:386)
   at 
 sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:401)
 Caused by: java.lang.RuntimeException: Class java/util/UUID could not be 
 instrumented.
   at 
 org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:138)
   at 
 org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:99)
   at 
 org.jacoco.agent.rt.internal_5d10cad.PreMain.createRuntime(PreMain.java:51)
   at org.jacoco.agent.rt.internal_5d10cad.PreMain.premain(PreMain.java:43)
   ... 6 more
 Caused by: java.lang.NoSuchFieldException: $jacocoAccess
   at java.lang.Class.getField(Class.java:1695)
   at 
 org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:136)
   ... 9 more
 FATAL ERROR in native method: processing of -javaagent failed
 Aborted
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-13899) Jacoco instrumentation fails under jdk8

2015-06-14 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-13899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-13899:
--
Attachment: HBASE-13899.patch

One line patch.

 Jacoco instrumentation fails under jdk8
 ---

 Key: HBASE-13899
 URL: https://issues.apache.org/jira/browse/HBASE-13899
 Project: HBase
  Issue Type: Sub-task
  Components: build, test
Affects Versions: 1.2.0
Reporter: Sean Busbey
  Labels: coverage
 Fix For: 2.0.0, 1.2.0

 Attachments: HBASE-13899.patch


 Moving the post-commit build for master to also cover jdk8 shows failures 
 when attempting to instrument test for jacoco coverage.
 example: 
 {code}
 Exception in thread main java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:483)
   at 
 sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:386)
   at 
 sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:401)
 Caused by: java.lang.RuntimeException: Class java/util/UUID could not be 
 instrumented.
   at 
 org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:138)
   at 
 org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:99)
   at 
 org.jacoco.agent.rt.internal_5d10cad.PreMain.createRuntime(PreMain.java:51)
   at org.jacoco.agent.rt.internal_5d10cad.PreMain.premain(PreMain.java:43)
   ... 6 more
 Caused by: java.lang.NoSuchFieldException: $jacocoAccess
   at java.lang.Class.getField(Class.java:1695)
   at 
 org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:136)
   ... 9 more
 FATAL ERROR in native method: processing of -javaagent failed
 Aborted
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-13899) Jacoco instrumentation fails under jdk8

2015-06-14 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-13899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-13899:
--
  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Pushed to master and branch-1.

Thanks [~busbey].

 Jacoco instrumentation fails under jdk8
 ---

 Key: HBASE-13899
 URL: https://issues.apache.org/jira/browse/HBASE-13899
 Project: HBase
  Issue Type: Sub-task
  Components: build, test
Affects Versions: 2.0.0, 1.2.0
Reporter: Sean Busbey
Assignee: Duo Zhang
  Labels: coverage
 Fix For: 2.0.0, 1.2.0

 Attachments: HBASE-13899.patch


 Moving the post-commit build for master to also cover jdk8 shows failures 
 when attempting to instrument test for jacoco coverage.
 example: 
 {code}
 Exception in thread main java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:483)
   at 
 sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:386)
   at 
 sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:401)
 Caused by: java.lang.RuntimeException: Class java/util/UUID could not be 
 instrumented.
   at 
 org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:138)
   at 
 org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:99)
   at 
 org.jacoco.agent.rt.internal_5d10cad.PreMain.createRuntime(PreMain.java:51)
   at org.jacoco.agent.rt.internal_5d10cad.PreMain.premain(PreMain.java:43)
   ... 6 more
 Caused by: java.lang.NoSuchFieldException: $jacocoAccess
   at java.lang.Class.getField(Class.java:1695)
   at 
 org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:136)
   ... 9 more
 FATAL ERROR in native method: processing of -javaagent failed
 Aborted
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13903) Speedup IdLock

2015-06-15 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-13903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587266#comment-14587266
 ] 

Duo Zhang commented on HBASE-13903:
---

What about just use a StripedLock and give each stripe an individual HashMap?

ConcurrentHashMap is fast because it does not require locking when getting 
entry from map at most time, but here we use putIfAbsent which will always lock 
a segment.

And for the profiling result, is lots of 'time' or lots of 'cpu'  spent in 
IdLock? This is important. If we spent lots of time waiting a lock, then the 
problem is contention, not the implementation of IdLock.

Thanks.

 Speedup IdLock
 --

 Key: HBASE-13903
 URL: https://issues.apache.org/jira/browse/HBASE-13903
 Project: HBase
  Issue Type: Improvement
  Components: regionserver
Affects Versions: 2.0.0, 1.0.1, 1.1.0, 0.98.13
Reporter: Matteo Bertozzi
Assignee: Matteo Bertozzi
Priority: Minor
 Fix For: 2.0.0, 1.2.0

 Attachments: HBASE-13903-v0.patch, IdLockPerf.java


 while testing the read path, I ended up with the profiler showing a lot of 
 time spent in IdLock.
 The IdLock is used by the HFileReader and the BucketCache, so you'll see a 
 lot of it when you have an hotspot on a hfile.
 we end up locked by that synchronized() and with too many calls to 
 map.putIfAbsent()
 {code}
 public Entry getLockEntry(long id) throws IOException {
   while ((existing = map.putIfAbsent(entry.id, entry)) != null) {
 synchronized (existing) {
   ...
 }
 // If the entry is not locked, it might already be deleted from the
 // map, so we cannot return it. We need to get our entry into the map
 // or get someone else's locked entry.
   }
 }
 public void releaseLockEntry(Entry entry) {
   synchronized (entry) {
 ...
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13877) Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL

2015-06-11 Thread Duo Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582742#comment-14582742
]

Duo Zhang commented on HBASE-13877:
---

+1 on patch v3.

Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL

Key: HBASE-13877
URL: https://issues.apache.org/jira/browse/HBASE-13877
Project: HBase
Issue Type: Bug
Reporter: Enis Soztutar
Assignee: Enis Soztutar
Priority: Blocker
Fix For: 2.0.0, 1.2.0, 1.1.1

Attachments: hbase-13877_v1.patch, hbase-13877_v2-branch-1.1.patch,
hbase-13877_v3-branch-1.1.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS

2015-06-02 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570198#comment-14570198
 ] 

Duo Zhang edited comment on HBASE-13811 at 6/3/15 3:02 AM:
---

Write a testcase. It fails for me with
{noformat}
java.lang.AssertionError: actual array was null
at 
org.apache.hadoop.hbase.regionserver.TestSplitWalDataLoss.test(TestSplitWalDataLoss.java:153)
{noformat}

I think the fix should also be applied to branch-1.1 since it will cause data 
loss even if we turn off flush per CF(but no issues if you turn on DLR...). 
[~ndimiduk]


was (Author: apache9):
Write a testcase. It fails for me with
{noformat}
java.lang.AssertionError: actual array was null
at 
org.apache.hadoop.hbase.regionserver.TestSplitWalDataLoss.test(TestSplitWalDataLoss.java:153)
{noformat}

I think the fix should also be applied to branch-1.1 since it will cause data 
loss even if we turn off flush per CF. [~ndimiduk]

 Splitting WALs, we are filtering out too many edits - DATALOSS
 ---

 Key: HBASE-13811
 URL: https://issues.apache.org/jira/browse/HBASE-13811
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 2.0.0, 1.2.0
Reporter: stack
Assignee: stack
Priority: Critical
 Fix For: 2.0.0, 1.2.0

 Attachments: 13811.branch-1.txt, 13811.txt, HBASE-13811.testcase.patch


 I've been running ITBLLs against branch-1 around HBASE-13616 (move of 
 ServerShutdownHandler to pv2). I have come across an instance of dataloss. My 
 patch for HBASE-13616 was in place so can only think it the cause (but cannot 
 see how). When we split the logs, we are skipping legit edits. Digging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS

2015-06-02 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-13811:
--
Attachment: HBASE-13811.testcase.patch

Write a testcase. It fails for me with
{noformat}
java.lang.AssertionError: actual array was null
at 
org.apache.hadoop.hbase.regionserver.TestSplitWalDataLoss.test(TestSplitWalDataLoss.java:153)
{noformat}

I think the fix should also be applied to branch-1.1 since it will cause data 
loss even if we turn off flush per CF. [~ndimiduk]

 Splitting WALs, we are filtering out too many edits - DATALOSS
 ---

 Key: HBASE-13811
 URL: https://issues.apache.org/jira/browse/HBASE-13811
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 2.0.0, 1.2.0
Reporter: stack
Assignee: stack
Priority: Critical
 Fix For: 2.0.0, 1.2.0

 Attachments: 13811.branch-1.txt, 13811.txt, HBASE-13811.testcase.patch


 I've been running ITBLLs against branch-1 around HBASE-13616 (move of 
 ServerShutdownHandler to pv2). I have come across an instance of dataloss. My 
 patch for HBASE-13616 was in place so can only think it the cause (but cannot 
 see how). When we split the logs, we are skipping legit edits. Digging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS

2015-06-02 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570273#comment-14570273
 ] 

Duo Zhang commented on HBASE-13811:
---

Yeah, with the patch in place
{code}
for (;;) {
  RegionStoreSequenceIds ids =
  testUtil.getMiniHBaseCluster().getMaster().getServerManager()
  
.getLastFlushedSequenceId(region.getRegionInfo().getEncodedNameAsBytes());
  if (ids.getStoreSequenceId(0).getSequenceId()  
oldestSeqIdOfStore) {
break;
  }
  Thread.sleep(100);
}
{code}

This will not happen so the testcase does not quit...Let me see...

 Splitting WALs, we are filtering out too many edits - DATALOSS
 ---

 Key: HBASE-13811
 URL: https://issues.apache.org/jira/browse/HBASE-13811
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 2.0.0, 1.2.0
Reporter: stack
Assignee: stack
Priority: Critical
 Fix For: 2.0.0, 1.2.0

 Attachments: 13811.branch-1.txt, 13811.txt, HBASE-13811.testcase.patch


 I've been running ITBLLs against branch-1 around HBASE-13616 (move of 
 ServerShutdownHandler to pv2). I have come across an instance of dataloss. My 
 patch for HBASE-13616 was in place so can only think it the cause (but cannot 
 see how). When we split the logs, we are skipping legit edits. Digging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS

2015-06-02 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-13811:
--
Attachment: HBASE-13811-v1.testcase.patch

[~stack] Try this one? Just make a RegionServerReport manually instead of 
waiting the regionserver thread to do it for us.

 I modified getEarliestMemstoreSeqNum and it passed locally.

 Splitting WALs, we are filtering out too many edits - DATALOSS
 ---

 Key: HBASE-13811
 URL: https://issues.apache.org/jira/browse/HBASE-13811
 Project: HBase
  Issue Type: Bug
  Components: wal
Affects Versions: 2.0.0, 1.2.0
Reporter: stack
Assignee: stack
Priority: Critical
 Fix For: 2.0.0, 1.2.0

 Attachments: 13811.branch-1.txt, 13811.txt, 
 HBASE-13811-v1.testcase.patch, HBASE-13811.testcase.patch


 I've been running ITBLLs against branch-1 around HBASE-13616 (move of 
 ServerShutdownHandler to pv2). I have come across an instance of dataloss. My 
 patch for HBASE-13616 was in place so can only think it the cause (but cannot 
 see how). When we split the logs, we are skipping legit edits. Digging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13970) NPE during compaction in trunk

2015-06-30 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-13970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609386#comment-14609386
 ] 

Duo Zhang commented on HBASE-13970:
---

[~ram_krish] Any progress? Thanks.

 NPE during compaction in trunk
 --

 Key: HBASE-13970
 URL: https://issues.apache.org/jira/browse/HBASE-13970
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.0, 0.98.13, 1.2.0, 1.1.1
Reporter: ramkrishna.s.vasudevan
Assignee: ramkrishna.s.vasudevan
 Fix For: 2.0.0, 0.98.14, 1.2.0, 1.1.2

 Attachments: HBASE-13970-v1.patch, HBASE-13970.patch


 Updated the trunk.. Loaded the table with PE tool.  Trigger a flush to ensure 
 all data is flushed out to disk. When the first compaction is triggered we 
 get an NPE and this is very easy to reproduce
 {code}
 015-06-25 21:33:46,041 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
 event: /hbase/flush-table-proc/acquired
 2015-06-25 21:33:46,051 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HRegion: Flushing 1/1 column families, memstore=76.91 MB
 2015-06-25 21:33:46,159 ERROR 
 [regionserver/stobdtserver3/10.224.54.70:16040-longCompactions-1435248183945] 
 regionserver.CompactSplitThread: Compaction failed Request = 
 regionName=TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4.,
  storeName=info, fileCount=3, fileSize=343.4 M (114.5 M, 114.5 M, 114.5 M), 
 priority=3, time=7536968291719985
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController$ActiveCompaction.access$700(PressureAwareCompactionThroughputController.java:79)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController.finish(PressureAwareCompactionThroughputController.java:238)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:306)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:106)
 at 
 org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:112)
 at 
 org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1202)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1792)
 at 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:524)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 2015-06-25 21:33:46,745 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.DefaultStoreFlusher: Flushed, sequenceid=1534, memsize=76.9 M, 
 hasBloomFilter=true, into tmp file 
 hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/.tmp/942ba0831a0047a08987439e34361a0c
 2015-06-25 21:33:46,772 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HStore: Added 
 hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/info/942ba0831a0047a08987439e34361a0c,
  entries=68116, sequenceid=1534, filesize=68.7 M
 2015-06-25 21:33:46,773 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HRegion: Finished memstore flush of ~76.91 MB/80649344, 
 currentsize=0 B/0 for region 
 TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4.
  in 723ms, sequenceid=1534, compaction requested=true
 2015-06-25 21:33:46,780 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received created 
 event:/hbase/flush-table-proc/reached/TestTable
 2015-06-25 21:33:46,790 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received created 
 event:/hbase/flush-table-proc/abort/TestTable
 2015-06-25 21:33:46,791 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure abort children changed 
 event: /hbase/flush-table-proc/abort
 2015-06-25 21:33:46,803 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
 event: /hbase/flush-table-proc/acquired
 2015-06-25 21:33:46,818 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure abort children changed 
 event: /hbase/flush-table-proc/abort
 {code}
 Will check this on what is the reason behind it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-13970) NPE during compaction in trunk

2015-07-02 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-13970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-13970:
--
   Resolution: Fixed
 Assignee: Duo Zhang  (was: ramkrishna.s.vasudevan)
 Hadoop Flags: Reviewed
Fix Version/s: (was: 1.0.2)
   Status: Resolved  (was: Patch Available)

Pushed to all branches except branch-1.0(HBASE-8329 has not applied to 
branch-1.0).

Thanks [~anoopsamjohn] and [~ram_krish].

 NPE during compaction in trunk
 --

 Key: HBASE-13970
 URL: https://issues.apache.org/jira/browse/HBASE-13970
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.0, 0.98.13, 1.2.0, 1.1.1
Reporter: ramkrishna.s.vasudevan
Assignee: Duo Zhang
 Fix For: 2.0.0, 0.98.14, 1.1.2, 1.3.0, 1.2.1

 Attachments: HBASE-13970-v1.patch, HBASE-13970.patch


 Updated the trunk.. Loaded the table with PE tool.  Trigger a flush to ensure 
 all data is flushed out to disk. When the first compaction is triggered we 
 get an NPE and this is very easy to reproduce
 {code}
 015-06-25 21:33:46,041 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
 event: /hbase/flush-table-proc/acquired
 2015-06-25 21:33:46,051 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HRegion: Flushing 1/1 column families, memstore=76.91 MB
 2015-06-25 21:33:46,159 ERROR 
 [regionserver/stobdtserver3/10.224.54.70:16040-longCompactions-1435248183945] 
 regionserver.CompactSplitThread: Compaction failed Request = 
 regionName=TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4.,
  storeName=info, fileCount=3, fileSize=343.4 M (114.5 M, 114.5 M, 114.5 M), 
 priority=3, time=7536968291719985
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController$ActiveCompaction.access$700(PressureAwareCompactionThroughputController.java:79)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController.finish(PressureAwareCompactionThroughputController.java:238)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:306)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:106)
 at 
 org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:112)
 at 
 org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1202)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1792)
 at 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:524)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 2015-06-25 21:33:46,745 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.DefaultStoreFlusher: Flushed, sequenceid=1534, memsize=76.9 M, 
 hasBloomFilter=true, into tmp file 
 hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/.tmp/942ba0831a0047a08987439e34361a0c
 2015-06-25 21:33:46,772 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HStore: Added 
 hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/info/942ba0831a0047a08987439e34361a0c,
  entries=68116, sequenceid=1534, filesize=68.7 M
 2015-06-25 21:33:46,773 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HRegion: Finished memstore flush of ~76.91 MB/80649344, 
 currentsize=0 B/0 for region 
 TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4.
  in 723ms, sequenceid=1534, compaction requested=true
 2015-06-25 21:33:46,780 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received created 
 event:/hbase/flush-table-proc/reached/TestTable
 2015-06-25 21:33:46,790 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received created 
 event:/hbase/flush-table-proc/abort/TestTable
 2015-06-25 21:33:46,791 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure abort children changed 
 event: /hbase/flush-table-proc/abort
 2015-06-25 21:33:46,803 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
 event: /hbase/flush-table-proc/acquired
 2015-06-25 21:33:46,818 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure abort children changed 
 event:

[jira] [Commented] (HBASE-14262) Big Trunk unit tests failing with OutOfMemoryError: unable to create new native thread

2015-08-19 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704233#comment-14704233
 ] 

Duo Zhang commented on HBASE-14262:
---

{quote}
If I revert HBASE-13065, TestAcidGuarantees passes on my local mac...
{quote}
This could happen since reduce heap size also means increase native memory size 
we could use so we can create more threads...

And I used to deal with TestAcidGuarantees when the new AsyncRpcClient is 
introduced. I remember that finally I create a static EventLoopGroup and make 
all AsyncRpcClient share it to reduce thread number...

So I suggest we verify the jstack result first to see if we really create too 
many threads? Maybe we are reaching the precipice, so sometimes it fails and 
sometimes not...

Thanks.

 Big Trunk unit tests failing with OutOfMemoryError: unable to create new 
 native thread
 

 Key: HBASE-14262
 URL: https://issues.apache.org/jira/browse/HBASE-14262
 Project: HBase
  Issue Type: Bug
  Components: test
Reporter: stack
Assignee: stack

 The bit unit tests are coming in with OOME, can't create native threads.
 I was also getting the OOME locally running on MBP. git bisect got me to 
 HBASE-13065, where we upped the test heap for TestDistributedLogSplitting 
 back in feb. Around the time that this went in, we had similar OOME issues 
 but then it was because we were doing 32bit JVMs. It does not seem to be the 
 case here.
 A recent run failed all the below and most are OOME:
 {code}
  {color:red}-1 core tests{color}.  The patch failed these unit tests:

 org.apache.hadoop.hbase.replication.TestReplicationEndpoint
   
 org.apache.hadoop.hbase.replication.TestPerTableCFReplication
   
 org.apache.hadoop.hbase.wal.TestBoundedRegionGroupingProvider
   
 org.apache.hadoop.hbase.replication.TestReplicationKillMasterRSCompressed
   
 org.apache.hadoop.hbase.replication.regionserver.TestRegionReplicaReplicationEndpointNoMaster
   
 org.apache.hadoop.hbase.replication.TestReplicationKillSlaveRS
   
 org.apache.hadoop.hbase.replication.regionserver.TestRegionReplicaReplicationEndpoint
   org.apache.hadoop.hbase.replication.TestMasterReplication
   org.apache.hadoop.hbase.mapred.TestTableMapReduce
   
 org.apache.hadoop.hbase.regionserver.TestRegionMergeTransactionOnCluster
   org.apache.hadoop.hbase.regionserver.TestRegionFavoredNodes
   
 org.apache.hadoop.hbase.replication.TestMultiSlaveReplication
   org.apache.hadoop.hbase.zookeeper.TestZKLeaderManager
   
 org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable
   org.apache.hadoop.hbase.TestGlobalMemStoreSize
   org.apache.hadoop.hbase.wal.TestWALFiltering
   
 org.apache.hadoop.hbase.replication.TestReplicationSmallTests
   
 org.apache.hadoop.hbase.replication.TestReplicationSyncUpTool
   org.apache.hadoop.hbase.replication.TestReplicationWithTags
   
 org.apache.hadoop.hbase.master.procedure.TestTruncateTableProcedure
   
 org.apache.hadoop.hbase.replication.TestReplicationChangingPeerRegionservers
   
 org.apache.hadoop.hbase.wal.TestDefaultWALProviderWithHLogKey
   
 org.apache.hadoop.hbase.regionserver.TestPerColumnFamilyFlush
   
 org.apache.hadoop.hbase.snapshot.TestMobRestoreFlushSnapshotFromClient
   
 org.apache.hadoop.hbase.master.procedure.TestCreateTableProcedure
   org.apache.hadoop.hbase.wal.TestWALFactory
   
 org.apache.hadoop.hbase.master.procedure.TestServerCrashProcedure
   
 org.apache.hadoop.hbase.replication.TestReplicationDisableInactivePeer
   
 org.apache.hadoop.hbase.master.procedure.TestAddColumnFamilyProcedure
   
 org.apache.hadoop.hbase.mapred.TestMultiTableSnapshotInputFormat
   
 org.apache.hadoop.hbase.master.procedure.TestEnableTableProcedure
   
 org.apache.hadoop.hbase.master.TestMasterFailoverBalancerPersistence
   org.apache.hadoop.hbase.TestStochasticBalancerJmxMetrics
  {color:red}-1 core zombie tests{color}.  There are 16 zombie test(s):
   at 
 org.apache.hadoop.hbase.security.visibility.TestVisibilityLabels.testVisibilityLabelsWithComplexLabels(TestVisibilityLabels.java:216)
 at 
 org.apache.hadoop.hbase.mapred.TestTableInputFormat.testTableRecordReaderScannerFail(TestTableInputFormat.java:281)
 at

[jira] [Commented] (HBASE-13970) NPE during compaction in trunk

2015-06-26 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-13970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602704#comment-14602704
 ] 

Duo Zhang commented on HBASE-13970:
---

No, counter can not be same for 2 threads. For a given value of counter, only 
one CAS operation can succeed. The failed thread will try another round and the 
value of counter will be changed.

This is the getAndIncrement methods in AtomicInteger. The only difference is 
how to calculate the next value.
{code}
/**
 * Atomically increments by one the current value.
 *
 * @return the previous value
 */
public final int getAndIncrement() {
for (;;) {
int current = get();
int next = current + 1;
if (compareAndSet(current, next))
return current;
}
}
{code}

 NPE during compaction in trunk
 --

 Key: HBASE-13970
 URL: https://issues.apache.org/jira/browse/HBASE-13970
 Project: HBase
  Issue Type: Bug
Affects Versions: 2.0.0, 0.98.13, 1.2.0, 1.1.1
Reporter: ramkrishna.s.vasudevan
Assignee: ramkrishna.s.vasudevan
 Fix For: 2.0.0, 0.98.14, 1.2.0, 1.1.2

 Attachments: HBASE-13970-v1.patch, HBASE-13970.patch


 Updated the trunk.. Loaded the table with PE tool.  Trigger a flush to ensure 
 all data is flushed out to disk. When the first compaction is triggered we 
 get an NPE and this is very easy to reproduce
 {code}
 015-06-25 21:33:46,041 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure start children changed 
 event: /hbase/flush-table-proc/acquired
 2015-06-25 21:33:46,051 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HRegion: Flushing 1/1 column families, memstore=76.91 MB
 2015-06-25 21:33:46,159 ERROR 
 [regionserver/stobdtserver3/10.224.54.70:16040-longCompactions-1435248183945] 
 regionserver.CompactSplitThread: Compaction failed Request = 
 regionName=TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4.,
  storeName=info, fileCount=3, fileSize=343.4 M (114.5 M, 114.5 M, 114.5 M), 
 priority=3, time=7536968291719985
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController$ActiveCompaction.access$700(PressureAwareCompactionThroughputController.java:79)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController.finish(PressureAwareCompactionThroughputController.java:238)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:306)
 at 
 org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:106)
 at 
 org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:112)
 at 
 org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1202)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1792)
 at 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:524)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 2015-06-25 21:33:46,745 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.DefaultStoreFlusher: Flushed, sequenceid=1534, memsize=76.9 M, 
 hasBloomFilter=true, into tmp file 
 hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/.tmp/942ba0831a0047a08987439e34361a0c
 2015-06-25 21:33:46,772 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HStore: Added 
 hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/info/942ba0831a0047a08987439e34361a0c,
  entries=68116, sequenceid=1534, filesize=68.7 M
 2015-06-25 21:33:46,773 INFO  
 [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] 
 regionserver.HRegion: Finished memstore flush of ~76.91 MB/80649344, 
 currentsize=0 B/0 for region 
 TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4.
  in 723ms, sequenceid=1534, compaction requested=true
 2015-06-25 21:33:46,780 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received created 
 event:/hbase/flush-table-proc/reached/TestTable
 2015-06-25 21:33:46,790 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received created 
 event:/hbase/flush-table-proc/abort/TestTable
 2015-06-25 21:33:46,791 INFO  [main-EventThread] 
 procedure.ZKProcedureMemberRpcs: Received procedure

[jira] [Commented] (HBASE-14178) regionserver blocks because of waiting for offsetLock

2015-08-02 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14650729#comment-14650729
 ] 

Duo Zhang commented on HBASE-14178:
---

I think the problem here is we only allow one thread read a HFileBlock at the 
same time even if we disable the BlockCache.
If BlockCache is enabled, this is useful since other threads can get the block 
directly from BlockCache after one thread successfully read the block and put 
it into BlockCache.
This is a needless overhead if we disable BlockCache although I think one 
should always turn on BlockCache if there are many requests to the same 
HFileBlock. We should just read the HFileBlock from HDFS.

[~chenheng], could you please prepare a patch for master first? Thanks.

 regionserver blocks because of waiting for offsetLock
 -

 Key: HBASE-14178
 URL: https://issues.apache.org/jira/browse/HBASE-14178
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.6
Reporter: Heng Chen
Priority: Critical
 Fix For: 0.98.6

 Attachments: HBASE-14178-0.98.patch, jstack


 My regionserver blocks, and all client rpc timeout. 
 I print the regionserver's jstack,  it seems a lot of thread was blocked for 
 waiting offsetLock, detail infomation belows:
 PS:  my table's block cache is off
 {code}
 B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 
 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
 - locked 0x000773af7c18 (a 
 org.apache.hadoop.hbase.util.IdLock$Entry)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
 at 
 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820)
 - locked 0x0005e5c55ad0 (a 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
 at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
 at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
 - 0x0005e5c55c08 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14178) regionserver blocks because of waiting for offsetLock

2015-08-04 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653201#comment-14653201
 ] 

Duo Zhang commented on HBASE-14178:
---

[~anoopsamjohn]

{{CacheConfig}} is a bit confusing I think. {{family.isBlockCacheEnabled}} is 
only equal to {{cacheDataOnRead}}, and we still have chance to put data into 
{{BlockCache}} if we set {{cacheDataOnWrite}} or {{prefetchOnOpen}} to {{true}} 
even if we set  {{cacheDataOnRead}} to {{false}}?

So I suggest here we make a new method called {{shouldReadBlockFromCache}}, and 
check all the possibility that we may put a block into {{BlockCache}}?

Thanks.

 regionserver blocks because of waiting for offsetLock
 -

 Key: HBASE-14178
 URL: https://issues.apache.org/jira/browse/HBASE-14178
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.6
Reporter: Heng Chen
Priority: Critical
 Fix For: 0.98.6

 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, 
 HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, 
 HBASE-14178_v4.patch, jstack


 My regionserver blocks, and all client rpc timeout. 
 I print the regionserver's jstack,  it seems a lot of threads were blocked 
 for waiting offsetLock, detail infomation belows:
 PS:  my table's block cache is off
 {code}
 B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 
 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
 - locked 0x000773af7c18 (a 
 org.apache.hadoop.hbase.util.IdLock$Entry)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
 at 
 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820)
 - locked 0x0005e5c55ad0 (a 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
 at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
 at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
 - 0x0005e5c55c08 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14178) regionserver blocks because of waiting for offsetLock

2015-08-04 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653415#comment-14653415
 ] 

Duo Zhang commented on HBASE-14178:
---

Yes, the problem here is the lock, not when to read from cache...So if we can 
make sure the block will not be put into cache after we fetch it from HDFS, 
then we can bypass the locking step.

 regionserver blocks because of waiting for offsetLock
 -

 Key: HBASE-14178
 URL: https://issues.apache.org/jira/browse/HBASE-14178
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.6
Reporter: Heng Chen
Priority: Critical
 Fix For: 0.98.6

 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, 
 HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, 
 HBASE-14178_v4.patch, jstack


 My regionserver blocks, and all client rpc timeout. 
 I print the regionserver's jstack,  it seems a lot of threads were blocked 
 for waiting offsetLock, detail infomation belows:
 PS:  my table's block cache is off
 {code}
 B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 
 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
 - locked 0x000773af7c18 (a 
 org.apache.hadoop.hbase.util.IdLock$Entry)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
 at 
 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820)
 - locked 0x0005e5c55ad0 (a 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
 at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
 at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
 - 0x0005e5c55c08 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14178) regionserver blocks because of waiting for offsetLock

2015-08-05 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-14178:
--
Attachment: (was: HBASE-14178_v6.patch)

 regionserver blocks because of waiting for offsetLock
 -

 Key: HBASE-14178
 URL: https://issues.apache.org/jira/browse/HBASE-14178
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.6
Reporter: Heng Chen
Priority: Critical
 Fix For: 0.98.6

 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, 
 HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, 
 HBASE-14178_v4.patch, HBASE-14178_v5.patch, jstack


 My regionserver blocks, and all client rpc timeout. 
 I print the regionserver's jstack,  it seems a lot of threads were blocked 
 for waiting offsetLock, detail infomation belows:
 PS:  my table's block cache is off
 {code}
 B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 
 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
 - locked 0x000773af7c18 (a 
 org.apache.hadoop.hbase.util.IdLock$Entry)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
 at 
 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820)
 - locked 0x0005e5c55ad0 (a 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
 at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
 at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
 - 0x0005e5c55c08 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14178) regionserver blocks because of waiting for offsetLock

2015-08-05 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-14178:
--
Attachment: HBASE-14178_v6.patch

A flaky test(TestFastFail) failed in pre-commit. Retry.

 regionserver blocks because of waiting for offsetLock
 -

 Key: HBASE-14178
 URL: https://issues.apache.org/jira/browse/HBASE-14178
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.6
Reporter: Heng Chen
Priority: Critical
 Fix For: 0.98.6

 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, 
 HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, 
 HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack


 My regionserver blocks, and all client rpc timeout. 
 I print the regionserver's jstack,  it seems a lot of threads were blocked 
 for waiting offsetLock, detail infomation belows:
 PS:  my table's block cache is off
 {code}
 B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 
 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
 - locked 0x000773af7c18 (a 
 org.apache.hadoop.hbase.util.IdLock$Entry)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
 at 
 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820)
 - locked 0x0005e5c55ad0 (a 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
 at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
 at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
 - 0x0005e5c55c08 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14178) regionserver blocks because of waiting for offsetLock

2015-08-05 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-14178:
--
Attachment: (was: HBASE-14178_v6.patch)

 regionserver blocks because of waiting for offsetLock
 -

 Key: HBASE-14178
 URL: https://issues.apache.org/jira/browse/HBASE-14178
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.6
Reporter: Heng Chen
Priority: Critical
 Fix For: 0.98.6

 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, 
 HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, 
 HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack


 My regionserver blocks, and all client rpc timeout. 
 I print the regionserver's jstack,  it seems a lot of threads were blocked 
 for waiting offsetLock, detail infomation belows:
 PS:  my table's block cache is off
 {code}
 B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 
 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
 - locked 0x000773af7c18 (a 
 org.apache.hadoop.hbase.util.IdLock$Entry)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
 at 
 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820)
 - locked 0x0005e5c55ad0 (a 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
 at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
 at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
 - 0x0005e5c55c08 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14178) regionserver blocks because of waiting for offsetLock

2015-08-05 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-14178:
--
Attachment: HBASE-14178_v6.patch

Retry.

 regionserver blocks because of waiting for offsetLock
 -

 Key: HBASE-14178
 URL: https://issues.apache.org/jira/browse/HBASE-14178
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.6
Reporter: Heng Chen
Priority: Critical
 Fix For: 0.98.6

 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, 
 HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, 
 HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack


 My regionserver blocks, and all client rpc timeout. 
 I print the regionserver's jstack,  it seems a lot of threads were blocked 
 for waiting offsetLock, detail infomation belows:
 PS:  my table's block cache is off
 {code}
 B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 
 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
 - locked 0x000773af7c18 (a 
 org.apache.hadoop.hbase.util.IdLock$Entry)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
 at 
 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820)
 - locked 0x0005e5c55ad0 (a 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
 at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
 at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
 - 0x0005e5c55c08 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14178) regionserver blocks because of waiting for offsetLock

2015-08-04 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654613#comment-14654613
 ] 

Duo Zhang commented on HBASE-14178:
---

Oh, I looked into wrong ling number...

It is the same reason, if {{blockType}} is {{null}}, the safe way is to read it 
from {{BlockCache}} first. Maybe it is an {{INDEX}} or {{BLOOM}} block.

Thanks.

 regionserver blocks because of waiting for offsetLock
 -

 Key: HBASE-14178
 URL: https://issues.apache.org/jira/browse/HBASE-14178
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.6
Reporter: Heng Chen
Priority: Critical
 Fix For: 0.98.6

 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, 
 HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, 
 HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack


 My regionserver blocks, and all client rpc timeout. 
 I print the regionserver's jstack,  it seems a lot of threads were blocked 
 for waiting offsetLock, detail infomation belows:
 PS:  my table's block cache is off
 {code}
 B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 
 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
 - locked 0x000773af7c18 (a 
 org.apache.hadoop.hbase.util.IdLock$Entry)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
 at 
 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820)
 - locked 0x0005e5c55ad0 (a 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
 at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
 at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
 - 0x0005e5c55c08 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14178) regionserver blocks because of waiting for offsetLock

2015-08-04 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-14178:
--
Attachment: HBASE-14178_v6.patch

Retry.

 regionserver blocks because of waiting for offsetLock
 -

 Key: HBASE-14178
 URL: https://issues.apache.org/jira/browse/HBASE-14178
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.6
Reporter: Heng Chen
Priority: Critical
 Fix For: 0.98.6

 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, 
 HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, 
 HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack


 My regionserver blocks, and all client rpc timeout. 
 I print the regionserver's jstack,  it seems a lot of threads were blocked 
 for waiting offsetLock, detail infomation belows:
 PS:  my table's block cache is off
 {code}
 B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 
 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
 - locked 0x000773af7c18 (a 
 org.apache.hadoop.hbase.util.IdLock$Entry)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
 at 
 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820)
 - locked 0x0005e5c55ad0 (a 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
 at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
 at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
 - 0x0005e5c55c08 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14178) regionserver blocks because of waiting for offsetLock

2015-08-04 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-14178:
--
Attachment: (was: HBASE-14178_v6.patch)

 regionserver blocks because of waiting for offsetLock
 -

 Key: HBASE-14178
 URL: https://issues.apache.org/jira/browse/HBASE-14178
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.6
Reporter: Heng Chen
Priority: Critical
 Fix For: 0.98.6

 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, 
 HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, 
 HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack


 My regionserver blocks, and all client rpc timeout. 
 I print the regionserver's jstack,  it seems a lot of threads were blocked 
 for waiting offsetLock, detail infomation belows:
 PS:  my table's block cache is off
 {code}
 B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 
 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
 - locked 0x000773af7c18 (a 
 org.apache.hadoop.hbase.util.IdLock$Entry)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
 at 
 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820)
 - locked 0x0005e5c55ad0 (a 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
 at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
 at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
 - 0x0005e5c55c08 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14178) regionserver blocks because of waiting for offsetLock

2015-08-04 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654607#comment-14654607
 ] 

Duo Zhang commented on HBASE-14178:
---

[~tedyu] If {{blockType}} is {{null}} then we can not determine if we could 
cache the block until we actually read it from HDFS. So I think it is 
appropriate to return false here? We should always acquire the lock unless we 
can make sure the block will not be cached.

Thanks.

 regionserver blocks because of waiting for offsetLock
 -

 Key: HBASE-14178
 URL: https://issues.apache.org/jira/browse/HBASE-14178
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.6
Reporter: Heng Chen
Priority: Critical
 Fix For: 0.98.6

 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, 
 HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, 
 HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack


 My regionserver blocks, and all client rpc timeout. 
 I print the regionserver's jstack,  it seems a lot of threads were blocked 
 for waiting offsetLock, detail infomation belows:
 PS:  my table's block cache is off
 {code}
 B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 
 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
 - locked 0x000773af7c18 (a 
 org.apache.hadoop.hbase.util.IdLock$Entry)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
 at 
 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820)
 - locked 0x0005e5c55ad0 (a 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
 at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
 at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
 - 0x0005e5c55c08 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13408) HBase In-Memory Memstore Compaction

2015-07-30 Thread Duo Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-13408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647348#comment-14647348
]

Duo Zhang commented on HBASE-13408:
---

OK, I get your point. After a memstore compaction we may drop some old cells so
set a new value of the {{oldestUnflushedSeqId}} in WAL is reasonable. And yes,
this can avoid WAL triggers a flush for log truncating under your cases.

But I still think you can find a way to set it without changing the semantics
of flush... Flush is a very critical operation in HBase so you should keep away
from it as much as possible unless you have to...

Or a more difficult way, remove the old flush operation and introduce some new
operations such as reduce your memory usage and persist old cells and so
on. You can put your compaction logic in the reduce your memory usage
operation.

Thanks.

HBase In-Memory Memstore Compaction
---

Key: HBASE-13408
URL: https://issues.apache.org/jira/browse/HBASE-13408
Project: HBase
Issue Type: New Feature
Reporter: Eshcar Hillel
Attachments:
HBaseIn-MemoryMemstoreCompactionDesignDocument-ver02.pdf,
HBaseIn-MemoryMemstoreCompactionDesignDocument.pdf,
InMemoryMemstoreCompactionEvaluationResults.pdf

A store unit holds a column family in a region, where the memstore is its
in-memory component. The memstore absorbs all updates to the store; from time
to time these updates are flushed to a file on disk, where they are
compacted. Unlike disk components, the memstore is not compacted until it is
written to the filesystem and optionally to block-cache. This may result in
underutilization of the memory due to duplicate entries per row, for example,
when hot data is continuously updated.
Generally, the faster the data is accumulated in memory, more flushes are
triggered, the data sinks to disk more frequently, slowing down retrieval of
data, even if very recent.
In high-churn workloads, compacting the memstore can help maintain the data
in memory, and thereby speed up data retrieval.
We suggest a new compacted memstore with the following principles:
1.The data is kept in memory for as long as possible
2.Memstore data is either compacted or in process of being compacted
3.Allow a panic mode, which may interrupt an in-progress compaction and
force a flush of part of the memstore.
We suggest applying this optimization only to in-memory column families.
A design document is attached.
This feature was previously discussed in HBASE-5311.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13408) HBase In-Memory Memstore Compaction

2015-07-29 Thread Duo Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-13408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646101#comment-14646101
]

Duo Zhang commented on HBASE-13408:
---

Things we talking here are all 'In Memory', so I do not think we need to modify
WAL...

I think all the logic could be down in a special memstore implementation? For
example, you can set the flush-size to 128M, and introduce a compact-size which
only consider the active set size to 32M. When you find the active set reaches
32M then you put it into pipeline and try to compact segments in pipeline to
reduce memory usage. The upper layer does not care about how many segments you
have, it only cares about the total memstore size. If it reaches 128M then a
flush request is coming, then you should flush all data to disk. If there are
many redundant cells then the total memstore will never reaches 128M, I think
this is exactly what we want here? And this way you do not change the semantic
of flush, the log truncating should also work as well.

And I think you can use some more compact data structures instead of skip list
since the segments in pipeline are read only? This may bring some benefits even
if we do not have many redundant cells.

What do you think? [~eshcar]. Sorry a bit late. Thanks.

HBase In-Memory Memstore Compaction
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14178) regionserver blocks because of waiting for offsetLock

2015-08-05 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659524#comment-14659524
 ] 

Duo Zhang commented on HBASE-14178:
---

Yes, we doing more round of check with lock because maybe another thread has 
already cache the block for us. Things happen here is we disable BC for the 
given family, so it is impossible that another thread will do the work for us, 
so we just read from HDFS and bypass the second checking BC round.

And as I mentioned above, there are lots of configurations for BC, and 
{{family.isBlockCacheEnabled()}} is treated as {{cacheDataOnRead}} (You can see 
the code pasted by [~chenheng], maybe it is a mistake but it is not important 
for this issue I think, we could open another issue for it). So the safe way to 
determine if we need the second 'read BC with lock' round is to check if we 
will put the block back to BC after we read it from HDFS. This is why we 
introduce a {{shouldLockOnCacheMiss}} method here. Maybe we cound change the 
name to {{shouldReadAgainWithLockOnCacheMiss}}?

Thanks.

 regionserver blocks because of waiting for offsetLock
 -

 Key: HBASE-14178
 URL: https://issues.apache.org/jira/browse/HBASE-14178
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.6
Reporter: Heng Chen
Priority: Critical
 Fix For: 0.98.6

 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, 
 HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, 
 HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack


 My regionserver blocks, and all client rpc timeout. 
 I print the regionserver's jstack,  it seems a lot of threads were blocked 
 for waiting offsetLock, detail infomation belows:
 PS:  my table's block cache is off
 {code}
 B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 
 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
 - locked 0x000773af7c18 (a 
 org.apache.hadoop.hbase.util.IdLock$Entry)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
 at 
 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820)
 - locked 0x0005e5c55ad0 (a 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
 at

[jira] [Comment Edited] (HBASE-14178) regionserver blocks because of waiting for offsetLock

2015-08-06 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659574#comment-14659574
 ] 

Duo Zhang edited comment on HBASE-14178 at 8/6/15 6:46 AM:
---

[~ram_krish] A bit strange but now, {{family.isBlockCacheEnabled}} only means 
{{cacheDataOnRead}} ... {{cacheDataOnWrite}} and some other configurations such 
as {{prefetchOnOpen}} are separated which means you can set them to {{true}} 
and will take effect even if you explicitly call 
{{family.setBlockCacheEnabled(false)}}. So theoretically even if you disable BC 
at family level, it is still possible that we could find the block in BC...


was (Author: apache9):
[~ram_krish] A bit strange but now, {{family.isBlockCacheEnabled}} only means 
{{cacheDataOnRead }}... {{cacheDataOnWrite}} and some other configurations such 
as {{prefetchOnOpen}} are separated which means you can set them to {{true}} 
and will take effect even if you explicitly call 
{{family.setBlockCacheEnabled(false)}}. So theoretically even if you disable BC 
at family level, it is still possible that we could find the block in BC...

 regionserver blocks because of waiting for offsetLock
 -

 Key: HBASE-14178
 URL: https://issues.apache.org/jira/browse/HBASE-14178
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.6
Reporter: Heng Chen
Priority: Critical
 Fix For: 0.98.6

 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, 
 HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, 
 HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack


 My regionserver blocks, and all client rpc timeout. 
 I print the regionserver's jstack,  it seems a lot of threads were blocked 
 for waiting offsetLock, detail infomation belows:
 PS:  my table's block cache is off
 {code}
 B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 
 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
 - locked 0x000773af7c18 (a 
 org.apache.hadoop.hbase.util.IdLock$Entry)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
 at 
 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820)
 - locked 0x0005e5c55ad0 (a 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)

[jira] [Commented] (HBASE-14178) regionserver blocks because of waiting for offsetLock

2015-08-06 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659574#comment-14659574
 ] 

Duo Zhang commented on HBASE-14178:
---

[~ram_krish] A bit strange but now, {{family.isBlockCacheEnabled}} only means 
{{cacheDataOnRead }}... {{cacheDataOnWrite}} and some other configurations such 
as {{prefetchOnOpen}} are separated which means you can set them to {{true}} 
and will take effect even if you explicitly call 
{{family.setBlockCacheEnabled(false)}}. So theoretically even if you disable BC 
at family level, it is still possible that we could find the block in BC...

 regionserver blocks because of waiting for offsetLock
 -

 Key: HBASE-14178
 URL: https://issues.apache.org/jira/browse/HBASE-14178
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.6
Reporter: Heng Chen
Priority: Critical
 Fix For: 0.98.6

 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, 
 HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, 
 HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack


 My regionserver blocks, and all client rpc timeout. 
 I print the regionserver's jstack,  it seems a lot of threads were blocked 
 for waiting offsetLock, detail infomation belows:
 PS:  my table's block cache is off
 {code}
 B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 
 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
 - locked 0x000773af7c18 (a 
 org.apache.hadoop.hbase.util.IdLock$Entry)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
 at 
 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820)
 - locked 0x0005e5c55ad0 (a 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
 at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
 at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
 - 0x0005e5c55c08 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14178) regionserver blocks because of waiting for offsetLock

2015-08-06 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659614#comment-14659614
 ] 

Duo Zhang commented on HBASE-14178:
---

[~anoopsamjohn] [~ram_krish] Any concerns on patch v7 then? Thanks.

 regionserver blocks because of waiting for offsetLock
 -

 Key: HBASE-14178
 URL: https://issues.apache.org/jira/browse/HBASE-14178
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.6
Reporter: Heng Chen
Priority: Critical
 Fix For: 0.98.6

 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, 
 HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, 
 HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, 
 HBASE-14178_v7.patch, jstack


 My regionserver blocks, and all client rpc timeout. 
 I print the regionserver's jstack,  it seems a lot of threads were blocked 
 for waiting offsetLock, detail infomation belows:
 PS:  my table's block cache is off
 {code}
 B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 
 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
 - locked 0x000773af7c18 (a 
 org.apache.hadoop.hbase.util.IdLock$Entry)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
 at 
 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820)
 - locked 0x0005e5c55ad0 (a 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
 at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
 at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
 - 0x0005e5c55c08 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14178) regionserver blocks because of waiting for offsetLock

2015-08-06 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-14178:
--
 Assignee: Heng Chen
 Priority: Major  (was: Critical)
Fix Version/s: (was: 0.98.6)
   1.1.2
   1.2.0
   1.0.2
   0.98.14
   2.0.0

 regionserver blocks because of waiting for offsetLock
 -

 Key: HBASE-14178
 URL: https://issues.apache.org/jira/browse/HBASE-14178
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.6
Reporter: Heng Chen
Assignee: Heng Chen
 Fix For: 2.0.0, 0.98.14, 1.0.2, 1.2.0, 1.1.2

 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, 
 HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, 
 HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, 
 HBASE-14178_v7.patch, jstack


 My regionserver blocks, and all client rpc timeout. 
 I print the regionserver's jstack,  it seems a lot of threads were blocked 
 for waiting offsetLock, detail infomation belows:
 PS:  my table's block cache is off
 {code}
 B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 
 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
 - locked 0x000773af7c18 (a 
 org.apache.hadoop.hbase.util.IdLock$Entry)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
 at 
 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820)
 - locked 0x0005e5c55ad0 (a 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
 at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
 at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
 - 0x0005e5c55c08 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14178) regionserver blocks because of waiting for offsetLock

2015-08-06 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-14178:
--
  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Thanks [~chenheng] for the patch(and the quick turn around at last although I 
have done some revert commits...).

Thanks all guys who helped review the patch.

 regionserver blocks because of waiting for offsetLock
 -

 Key: HBASE-14178
 URL: https://issues.apache.org/jira/browse/HBASE-14178
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.6
Reporter: Heng Chen
Assignee: Heng Chen
 Fix For: 2.0.0, 0.98.14, 1.0.2, 1.2.0, 1.1.2

 Attachments: HBASE-14178-0.98.patch, HBASE-14178-0.98_v8.patch, 
 HBASE-14178-branch_1_v8.patch, HBASE-14178.patch, HBASE-14178_v1.patch, 
 HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, 
 HBASE-14178_v5.patch, HBASE-14178_v6.patch, HBASE-14178_v7.patch, 
 HBASE-14178_v8.patch, jstack


 My regionserver blocks, and all client rpc timeout. 
 I print the regionserver's jstack,  it seems a lot of threads were blocked 
 for waiting offsetLock, detail infomation belows:
 PS:  my table's block cache is off
 {code}
 B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 
 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
 - locked 0x000773af7c18 (a 
 org.apache.hadoop.hbase.util.IdLock$Entry)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
 at 
 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820)
 - locked 0x0005e5c55ad0 (a 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
 at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
 at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
 - 0x0005e5c55c08 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HBASE-14178) regionserver blocks because of waiting for offsetLock

2015-08-06 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659728#comment-14659728
 ] 

Duo Zhang edited comment on HBASE-14178 at 8/6/15 9:23 AM:
---

Pushed to all branches.

Thanks [~chenheng] for the patch(and the quick turn around at last although I 
have done some revert commits...).

Thanks all guys who helped review the patch.


was (Author: apache9):
Thanks [~chenheng] for the patch(and the quick turn around at last although I 
have done some revert commits...).

Thanks all guys who helped review the patch.

 regionserver blocks because of waiting for offsetLock
 -

 Key: HBASE-14178
 URL: https://issues.apache.org/jira/browse/HBASE-14178
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.6
Reporter: Heng Chen
Assignee: Heng Chen
 Fix For: 2.0.0, 0.98.14, 1.0.2, 1.2.0, 1.1.2

 Attachments: HBASE-14178-0.98.patch, HBASE-14178-0.98_v8.patch, 
 HBASE-14178-branch_1_v8.patch, HBASE-14178.patch, HBASE-14178_v1.patch, 
 HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, 
 HBASE-14178_v5.patch, HBASE-14178_v6.patch, HBASE-14178_v7.patch, 
 HBASE-14178_v8.patch, jstack


 My regionserver blocks, and all client rpc timeout. 
 I print the regionserver's jstack,  it seems a lot of threads were blocked 
 for waiting offsetLock, detail infomation belows:
 PS:  my table's block cache is off
 {code}
 B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 
 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
 - locked 0x000773af7c18 (a 
 org.apache.hadoop.hbase.util.IdLock$Entry)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
 at 
 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820)
 - locked 0x0005e5c55ad0 (a 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
 at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
 at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
 - 0x0005e5c55c08 (a

[jira] [Commented] (HBASE-14178) regionserver blocks because of waiting for offsetLock

2015-08-06 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659681#comment-14659681
 ] 

Duo Zhang commented on HBASE-14178:
---

[~chenheng] Could you please also prepare patch for branches other than master 
and 0.98? Thanks.

 regionserver blocks because of waiting for offsetLock
 -

 Key: HBASE-14178
 URL: https://issues.apache.org/jira/browse/HBASE-14178
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.6
Reporter: Heng Chen
Assignee: Heng Chen
 Fix For: 2.0.0, 0.98.14, 1.0.2, 1.2.0, 1.1.2

 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, 
 HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, 
 HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, 
 HBASE-14178_v7.patch, jstack


 My regionserver blocks, and all client rpc timeout. 
 I print the regionserver's jstack,  it seems a lot of threads were blocked 
 for waiting offsetLock, detail infomation belows:
 PS:  my table's block cache is off
 {code}
 B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 
 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
 - locked 0x000773af7c18 (a 
 org.apache.hadoop.hbase.util.IdLock$Entry)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
 at 
 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820)
 - locked 0x0005e5c55ad0 (a 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
 at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
 at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
 - 0x0005e5c55c08 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14178) regionserver blocks because of waiting for offsetLock

2015-08-05 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659274#comment-14659274
 ] 

Duo Zhang commented on HBASE-14178:
---

[~anoopsamjohn] Any concerns of patch v6? Thanks.

 regionserver blocks because of waiting for offsetLock
 -

 Key: HBASE-14178
 URL: https://issues.apache.org/jira/browse/HBASE-14178
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.98.6
Reporter: Heng Chen
Priority: Critical
 Fix For: 0.98.6

 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, 
 HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, 
 HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack


 My regionserver blocks, and all client rpc timeout. 
 I print the regionserver's jstack,  it seems a lot of threads were blocked 
 for waiting offsetLock, detail infomation belows:
 PS:  my table's block cache is off
 {code}
 B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 
 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79)
 - locked 0x000773af7c18 (a 
 org.apache.hadoop.hbase.util.IdLock$Entry)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524)
 at 
 org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
 at 
 org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
 at 
 org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
 at 
 org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
 at 
 org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820)
 - locked 0x0005e5c55ad0 (a 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl)
 at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779)
 at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753)
 at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916)
 at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583)
 at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
 at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
 at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
 at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
 at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
 - 0x0005e5c55c08 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14100) Fix high priority findbugs warnings

2015-07-16 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629735#comment-14629735
 ] 

Duo Zhang commented on HBASE-14100:
---

[~ram_krish] Fine. Is HBASE-12213 almost done? If not I think we could remove 
it first in this issue.

 Fix high priority findbugs warnings
 ---

 Key: HBASE-14100
 URL: https://issues.apache.org/jira/browse/HBASE-14100
 Project: HBase
  Issue Type: Bug
Reporter: Duo Zhang
Assignee: Duo Zhang

 See here:
 https://builds.apache.org/job/HBase-TRUNK/6654/findbugsResult/HIGH/
 We have 6 high priority findbugs warnings. A high priority findbugs warning 
 is usually a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14100) Fix high priority findbugs warnings

2015-07-16 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629695#comment-14629695
 ] 

Duo Zhang commented on HBASE-14100:
---


{code:title=WALSplitter.java}
1208   for (WriterThread t : writerThreads) {
1209 t.finish();
1210   }
1211   if (interrupt) {
1212 for (WriterThread t : writerThreads) {
1213   t.interrupt(); // interrupt the writer threads. We are stopping 
now.
1214 }
1215   }
1216 
1217   for (WriterThread t : writerThreads) {
1218 if (!progress_failed  reporter != null  !reporter.progress()) {
1219   progress_failed = true;
1220 }
1221 try {
1222   t.join();
1223 } catch (InterruptedException ie) {
1224   IOException iie = new InterruptedIOException();
1225   iie.initCause(ie);
1226   throw iie;
1227 }
1228   }
1229   controller.checkForErrors();
1230   LOG.info((this.writerThreads == null? 0: this.writerThreads.size()) 
+ // here we do a null check, but the above code use writeThreads directly 
without a null check
1231  split writers finished; closing...);
{code}

And here is the defination of writerThreads
{code:title=WALSplitter.java}
1131 protected final ListWriterThread writerThreads = 
Lists.newArrayList();
{code}

It is initialized at construction and marked as final, so I think we could just 
remove the null check?

 Fix high priority findbugs warnings
 ---

 Key: HBASE-14100
 URL: https://issues.apache.org/jira/browse/HBASE-14100
 Project: HBase
  Issue Type: Bug
Reporter: Duo Zhang
Assignee: Duo Zhang

 See here:
 https://builds.apache.org/job/HBase-TRUNK/6654/findbugsResult/HIGH/
 We have 6 high priority findbugs warnings. A high priority findbugs warning 
 is usually a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14100) Fix high priority findbugs warnings

2015-07-16 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629672#comment-14629672
 ] 

Duo Zhang commented on HBASE-14100:
---

{{DMI_INVOKING_TOSTRING_ON_ARRAY}}

{code:title=SplitLogManager.java}
300 FileStatus[] files = fs.listStatus(logDir);
301 if (files != null  files.length  0) {
302   LOG.warn(Returning success without actually splitting and 
303   + deleting all the log files in path  + logDir + :  + 
files, ioe); // files is an array, should we use Arrays.toString here? Is it 
too noisy?
304 } else {
305   LOG.warn(Unable to delete log src dir. Ignoring.  + logDir, 
ioe);
306 }
{code}

{code:title=SimpleRegionNormalizer.java}
152   LOG.debug(Table  + table + , largest region 
153 + largestRegion.getFirst().getRegionName() +  has size  // Should 
be getRegionNameAsString()?
154 + largestRegion.getSecond() + , more than 2 times than avg size, 
splitting);
{code}

 Fix high priority findbugs warnings
 ---

 Key: HBASE-14100
 URL: https://issues.apache.org/jira/browse/HBASE-14100
 Project: HBase
  Issue Type: Bug
Reporter: Duo Zhang
Assignee: Duo Zhang

 See here:
 https://builds.apache.org/job/HBase-TRUNK/6654/findbugsResult/HIGH/
 We have 6 high priority findbugs warnings. A high priority findbugs warning 
 is usually a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14100) Fix high priority findbugs warnings

2015-07-16 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-14100:
--
Status: Patch Available  (was: Open)

 Fix high priority findbugs warnings
 ---

 Key: HBASE-14100
 URL: https://issues.apache.org/jira/browse/HBASE-14100
 Project: HBase
  Issue Type: Bug
Reporter: Duo Zhang
Assignee: Duo Zhang
 Attachments: HBASE-14100.patch


 See here:
 https://builds.apache.org/job/HBase-TRUNK/6654/findbugsResult/HIGH/
 We have 6 high priority findbugs warnings. A high priority findbugs warning 
 is usually a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14100) Fix high priority findbugs warnings

2015-07-16 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-14100:
--
Attachment: HBASE-14100.patch

 Fix high priority findbugs warnings
 ---

 Key: HBASE-14100
 URL: https://issues.apache.org/jira/browse/HBASE-14100
 Project: HBase
  Issue Type: Bug
Reporter: Duo Zhang
Assignee: Duo Zhang
 Attachments: HBASE-14100.patch


 See here:
 https://builds.apache.org/job/HBase-TRUNK/6654/findbugsResult/HIGH/
 We have 6 high priority findbugs warnings. A high priority findbugs warning 
 is usually a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-14100) Fix high priority findbugs warnings

2015-07-16 Thread Duo Zhang (JIRA)

Duo Zhang created HBASE-14100:
-

 Summary: Fix high priority findbugs warnings
 Key: HBASE-14100
 URL: https://issues.apache.org/jira/browse/HBASE-14100
 Project: HBase
  Issue Type: Bug
Reporter: Duo Zhang
Assignee: Duo Zhang


See here:
https://builds.apache.org/job/HBase-TRUNK/6654/findbugsResult/HIGH/

We have 6 high priority findbugs warnings. A high priority findbugs warning is 
usually a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14100) Fix high priority findbugs warnings

2015-07-16 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629690#comment-14629690
 ] 

Duo Zhang commented on HBASE-14100:
---

{{IL_INFINITE_RECURSIVE_LOOP}}

{code:title=BloomFilterUtil.java}
195   /** Should only be used in tests */
196   public static boolean contains(byte[] buf, int offset, int length, 
ByteBuffer bloom) { // no reference to this method, so just remove it?
197 return contains(buf, offset, length, bloom);
198   }
{code}

{code:title=RateLimiter.java}
201   // This method is for strictly testing purpose only
202   @VisibleForTesting
203   public void setNextRefillTime(long nextRefillTime) {
204 this.setNextRefillTime(nextRefillTime);
205   }
206 
207   public long getNextRefillTime() {
208 return this.getNextRefillTime();
209   }
{code}

These two methods are all overridden by sub classes so no actual problem now. 
But I do not think it is a good idea to leave a confusing code here? Just mark 
the method as abstract? Or leave the implementation empty?

 Fix high priority findbugs warnings
 ---

 Key: HBASE-14100
 URL: https://issues.apache.org/jira/browse/HBASE-14100
 Project: HBase
  Issue Type: Bug
Reporter: Duo Zhang
Assignee: Duo Zhang

 See here:
 https://builds.apache.org/job/HBase-TRUNK/6654/findbugsResult/HIGH/
 We have 6 high priority findbugs warnings. A high priority findbugs warning 
 is usually a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14100) Fix high priority findbugs warnings

2015-07-16 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630679#comment-14630679
 ] 

Duo Zhang commented on HBASE-14100:
---

{{BloomFilterUtil}} is master only. {{RateLimiter}} is introduced after 
branch-1.1. And the changes of log are introduced after branch-1.2 or branch-1. 
So branch-1.0 and 0.98 are not patched.

 Fix high priority findbugs warnings
 ---

 Key: HBASE-14100
 URL: https://issues.apache.org/jira/browse/HBASE-14100
 Project: HBase
  Issue Type: Bug
Reporter: Duo Zhang
Assignee: Duo Zhang
 Fix For: 2.0.0, 1.2.0, 1.1.2, 1.3.0

 Attachments: HBASE-14100.patch


 See here:
 https://builds.apache.org/job/HBase-TRUNK/6654/findbugsResult/HIGH/
 We have 6 high priority findbugs warnings. A high priority findbugs warning 
 is usually a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (HBASE-14113) Compile failed on master

2015-07-16 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang resolved HBASE-14113.
---
Resolution: Cannot Reproduce

Fixed by running {{mvn clean}} first.

 Compile failed on master
 

 Key: HBASE-14113
 URL: https://issues.apache.org/jira/browse/HBASE-14113
 Project: HBase
  Issue Type: Bug
  Components: build
Reporter: Heng Chen

 mvn package -DskipTests
 Compile failed!  
 {code}
 [ERROR] Failed to execute goal 
 org.apache.maven.plugins:maven-compiler-plugin:3.2:compile (default-compile) 
 on project hbase-client: Compilation failure: Compilation failure:
 [ERROR] 
 /Users/chenheng/apache/hbase/hbase-client/src/main/java/org/apache/hadoop/hbase/filter/BinaryComparator.java:[54,3]
  method does not override or implement a method from a supertype
 [ERROR] 
 /Users/chenheng/apache/hbase/hbase-client/src/main/java/org/apache/hadoop/hbase/filter/NullComparator.java:[64,3]
  method does not override or implement a method from a supertype
 [ERROR] 
 /Users/chenheng/apache/hbase/hbase-client/src/main/java/org/apache/hadoop/hbase/filter/LongComparator.java:[51,5]
  method does not override or implement a method from a supertype
 [ERROR] 
 /Users/chenheng/apache/hbase/hbase-client/src/main/java/org/apache/hadoop/hbase/filter/BitComparator.java:[137,3]
  method does not override or implement a method from a supertype
 [ERROR] 
 /Users/chenheng/apache/hbase/hbase-client/src/main/java/org/apache/hadoop/hbase/filter/BinaryPrefixComparator.java:[56,3]
  method does not override or implement a method from a supertype
 {code}
 My java version:
 {code}
 java version 1.8.0_40
 Java(TM) SE Runtime Environment (build 1.8.0_40-b25)
 Java HotSpot(TM) 64-Bit Server VM (build 25.40-b25, mixed mode)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14100) Fix high priority findbugs warnings

2015-07-16 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-14100:
--
   Resolution: Fixed
Fix Version/s: 1.3.0
   1.1.2
   1.2.0
   2.0.0
   Status: Resolved  (was: Patch Available)

Pushed to branch-1.1+.

Thanks [~apurtell] and [~ram_krish].

 Fix high priority findbugs warnings
 ---

 Key: HBASE-14100
 URL: https://issues.apache.org/jira/browse/HBASE-14100
 Project: HBase
  Issue Type: Bug
Reporter: Duo Zhang
Assignee: Duo Zhang
 Fix For: 2.0.0, 1.2.0, 1.1.2, 1.3.0

 Attachments: HBASE-14100.patch


 See here:
 https://builds.apache.org/job/HBase-TRUNK/6654/findbugsResult/HIGH/
 We have 6 high priority findbugs warnings. A high priority findbugs warning 
 is usually a bug.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14059) We should add a RS to the dead servers list if admin calls fail more than a threshold

2015-07-19 Thread Duo Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-14059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633048#comment-14633048
]

Duo Zhang commented on HBASE-14059:
---

{quote}
In the issue I ran into it was a bad region causing the RS to be blocked for a
long time.
{quote}

More details? What does 'bad' mean? And you said the region is back to normal
when you kill the RS, so I think there maybe another bug?

In general, I agree with you, we should offline a RS if admin calls always
fail. But it should be used to fix a 'bad' RS, not a 'bad' region. If there is
a 'bad' region that can not be fixed by reassign, then as [~chenheng] said, the
'bad' region will kill all regionservers in your cluster...

Thanks.

We should add a RS to the dead servers list if admin calls fail more than a
threshold
-

Key: HBASE-14059
URL: https://issues.apache.org/jira/browse/HBASE-14059
Project: HBase
Issue Type: Bug
Components: master, regionserver, rpc
Affects Versions: 0.98.13
Reporter: Esteban Gutierrez
Assignee: Esteban Gutierrez
Priority: Critical

I ran into this problem twice this week: calls from the HBase master to a RS
can timeout since the RS call queue size has been maxed out, however since
the RS is not dead (ephemeral znode still present) the master keeps
attempting to perform admin tasks like trying to open or close a region but
those operations eventually fail after we run out of retries or the
assignment manager attempts to re-assign to other RSs. From the side effects
of this I've noticed master operations to be fully blocked or RITs since we
cannot close the region and open the region in a new location since RS is not
dead.
A potential solution for this is to add the RS to the list of dead RSs after
certain number of calls from the master to the RS fail.
I've noticed only the problem in 0.98.x but it should be present in all
versions.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14126) I'm using play framework, creating a sample Hbase project. And I keep getting this error: tried to access method com.google.common.base.Stopwatch.init()V from class

2015-07-20 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634301#comment-14634301
 ] 

Duo Zhang commented on HBASE-14126:
---

HBase uses an early version of guava. See this

https://github.com/google/guava/commit/fd0cbc2c5c90e85fb22c8e86ea19630032090943

The constructors are changed to package private at version 17 or 18.

The shaded hbase-client maybe a solution

http://mvnrepository.com/artifact/org.apache.hbase/hbase-shaded-client

But seems it is 1.1 only. So if you still want to use hbase 1.0, you could try 
adding a guava dependency explicitly in your pom.xml which version is lower 
than 17.

And finally, this is an hbase usage question. Next time please send email to 
the hbase-user or hbase-dev mailing-list first. If we confirm something could 
be down in hbase then we could open a issue here.

Thanks.

 I'm using play framework, creating a sample Hbase project. And I keep getting 
 this error: tried to access method com.google.common.base.Stopwatch.init()V 
 from class org.apache.hadoop.hbase.zookeeper.MetaTableLocator
 -

 Key: HBASE-14126
 URL: https://issues.apache.org/jira/browse/HBASE-14126
 Project: HBase
  Issue Type: Task
 Environment: Ubuntu; Play Framework 2.4.2; Hbase 1.0
Reporter: Lesley Cheung
Assignee: Lesley Cheung

 The simple Hbase project works in Maven. But when I use play framework , it 
 keeps showing this error. I modified the lib dependency many times, but I 
 just can't eliminate this error. Some one please help me! 
  
 java.lang.IllegalAccessError: tried to access method 
 com.google.common.base.Stopwatch.init()V from class 
 org.apache.hadoop.hbase.zookeeper.MetaTableLocator
   at 
 org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:434)
   at 
 org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:60)
   at 
 org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1123)
   at 
 org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1110)
   at 
 org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1262)
   at 
 org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1126)
   at 
 org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:369)
   at 
 org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:320)
   at 
 org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:206)
   at 
 org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:183)
   at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1496)
   at org.apache.hadoop.hbase.client.HTable.put(HTable.java:1119)
   at utils.BigQueue.run(BigQueue.java:68)
   at 
 akka.actor.LightArrayRevolverScheduler$$anon$2$$anon$1.run(Scheduler.scala:242)
   at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14004) [Replication] Inconsistency between Memstore and WAL may result in data in remote cluster that is not in the origin

2015-10-27 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977769#comment-14977769
 ] 

Duo Zhang commented on HBASE-14004:
---

The problem here not only effects replication.

{quote}
As a result, the handler will rollback the Memstore and the later flushed HFile 
will also skip this record.
{quote}

What if the regionserver crashed before flushing HFile? I think the record will 
come back since it has already been persisted in WAL.

Add a marker maybe a solution, but you need to check the marker everywhere when 
replaying WAL, and you still need to deal with the failure when placing 
marker... I do not think it is easy to do...

The basic problem here is we may have inconsistency between memstore and WAL 
when we fail to sync WAL.
A simple solution is killing the regionserver when we fail to sync WAL which 
means we will never rollback memstore but reconstruct it using WAL. We can make 
sure there is no difference between memstore and WAL under this situation.
If we want to keep regionserver alive when syncing failed, then I think we need 
to find the real result of the sync operation. Maybe we could close the WAL 
file and check its length? Of course, if we have lost the connection to 
namenode, I think there is no simple solution other than killing the 
regionserver...

Thanks.

> [Replication] Inconsistency between Memstore and WAL may result in data in 
> remote cluster that is not in the origin
> ---
>
> Key: HBASE-14004
> URL: https://issues.apache.org/jira/browse/HBASE-14004
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Reporter: He Liangliang
>Priority: Critical
>  Labels: replication, wal
>
> Looks like the current write path can cause inconsistency between 
> memstore/hfile and WAL which cause the slave cluster has more data than the 
> master cluster.
> The simplified write path looks like:
> 1. insert record into Memstore
> 2. write record to WAL
> 3. sync WAL
> 4. rollback Memstore if 3 fails
> It's possible that the HDFS sync RPC call fails, but the data is already  
> (may partially) transported to the DNs which finally get persisted. As a 
> result, the handler will rollback the Memstore and the later flushed HFile 
> will also skip this record.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14004) [Replication] Inconsistency between Memstore and WAL may result in data in remote cluster that is not in the origin

2015-10-28 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978108#comment-14978108
 ] 

Duo Zhang commented on HBASE-14004:
---

[~carp84] This is an inherent problem of RPC based systems due to temporary 
network failure. HBase use {{hflush}} to sync WAL,  I do not know the details 
that if hflush will call namenode to update length, but in any case, the last 
RPC call could fail at client side but succeed at server side(network failure 
when writing return value back).

And sure, this should be a bug in HBase. I checked the code, if an exception is 
thrown from hflush, {{FSHLog.SyncRunner}} simply passes it to upper layer. So 
it could happen that hflush is succeeded at HDFS, but HBase think it is failed 
and cause inconsistency.

I think we need to find a way to make sure whether the WAL is actually 
persisted at HDFS. And if DFSClient already has retry, then I think killing 
regionserver is enough? Any suggestions [~carp84]?

Thanks.

> [Replication] Inconsistency between Memstore and WAL may result in data in 
> remote cluster that is not in the origin
> ---
>
> Key: HBASE-14004
> URL: https://issues.apache.org/jira/browse/HBASE-14004
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Reporter: He Liangliang
>Priority: Critical
>  Labels: replication, wal
>
> Looks like the current write path can cause inconsistency between 
> memstore/hfile and WAL which cause the slave cluster has more data than the 
> master cluster.
> The simplified write path looks like:
> 1. insert record into Memstore
> 2. write record to WAL
> 3. sync WAL
> 4. rollback Memstore if 3 fails
> It's possible that the HDFS sync RPC call fails, but the data is already  
> (may partially) transported to the DNs which finally get persisted. As a 
> result, the handler will rollback the Memstore and the later flushed HFile 
> will also skip this record.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12959) Compact never end when table's dataBlockEncoding using PREFIX_TREE

2015-11-09 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-12959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997903#comment-14997903
 ] 

Duo Zhang commented on HBASE-12959:
---

I sent an email to [~bdifn] long time ago to confirm whether HBASE-12817 solve 
this problem but he/she haven't replied yet...

And PREFIX_TREE encoding has been used widely in our project for a long time. I 
haven't find any other problem after HBASE-12817.

Hope this could help. [~wang_yongqiang]

>  Compact never end when table's dataBlockEncoding using  PREFIX_TREE
> 
>
> Key: HBASE-12959
> URL: https://issues.apache.org/jira/browse/HBASE-12959
> Project: HBase
>  Issue Type: Bug
>  Components: hbase
>Affects Versions: 0.98.7
> Environment: hbase 0.98.7
> hadoop 2.5.1
>Reporter: wuchengzhi
>Priority: Critical
> Attachments: PrefixTreeCompact.java, txtfile-part1.txt.gz, 
> txtfile-part2.txt.gz, txtfile-part4.txt.gz, txtfile-part5.txt.gz, 
> txtfile-part6.txt.gz, txtfile-part7.txt.gz
>
>
> I upgraded the hbase from 0.96.1.1 to 0.98.7 and hadoop from 2.2.0 to 
> 2.5.1,some table encoding using prefix-tree was abnormal for compacting,  the 
> gui shows the table's Compaction status is MAJOR_AND_MINOR(MAJOR) all the 
> time.
> in the regionserver dump , there are some logs as below:
> Tasks:
> ===
> Task: Compacting info in 
> PREFIX_NOT_COMPACT,,1421954285670.41ef60e2c221772626e141d5080296c5.
> Status: RUNNING:Compacting store info
> Running for 1097s  (on the  site running more than 3 days)
> 
> Thread 197 (regionserver60020-smallCompactions-1421954341530):
>   State: RUNNABLE
>   Blocked count: 7
>   Waited count: 3
>   Stack:
> 
> org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.followFan(PrefixTreeArrayScanner.java:329)
> 
> org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArraySearcher.positionAtOrAfter(PrefixTreeArraySearcher.java:149)
> 
> org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArraySearcher.seekForwardToOrAfter(PrefixTreeArraySearcher.java:183)
> 
> org.apache.hadoop.hbase.codec.prefixtree.PrefixTreeSeeker.seekToOrBeforeUsingPositionAtOrAfter(PrefixTreeSeeker.java:199)
> 
> org.apache.hadoop.hbase.codec.prefixtree.PrefixTreeSeeker.seekToKeyInBlock(PrefixTreeSeeker.java:162)
> 
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.loadBlockAndSeekToKey(HFileReaderV2.java:1172)
> 
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:573)
> 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
> 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
> 
> org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
> 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
> 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.reseek(KeyValueHeap.java:257)
> 
> org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:697)
> 
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
> 
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
> 
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:222)
> 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:77)
> 
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:110)
> org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1099)
> org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1482)
> Thread 177 (regionserver60020-smallCompactions-1421954314809):
>   State: RUNNABLE
>   Blocked count: 40
>   Waited count: 60
>   Stack:
> 
> org.apache.hadoop.hbase.codec.prefixtree.decode.column.ColumnReader.populateBuffer(ColumnReader.java:81)
> 
> org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.populateQualifier(PrefixTreeArrayScanner.java:471)
> 
> org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.populateNonRowFields(PrefixTreeArrayScanner.java:452)
> 
> org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.nextRow(PrefixTreeArrayScanner.java:226)
> 
> org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.advance(PrefixTreeArrayScanner.java:208)
> 
> org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArraySearcher.positionAtQualifierTimestamp(PrefixTreeArraySearcher.java:244)
> 
>

[jira] [Commented] (HBASE-12959) Compact never end when table's dataBlockEncoding using PREFIX_TREE

2015-11-10 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-12959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998232#comment-14998232
 ] 

Duo Zhang commented on HBASE-12959:
---

No, I haven't met this problem yet.

>  Compact never end when table's dataBlockEncoding using  PREFIX_TREE
> 
>
> Key: HBASE-12959
> URL: https://issues.apache.org/jira/browse/HBASE-12959
> Project: HBase
>  Issue Type: Bug
>  Components: hbase
>Affects Versions: 0.98.7
> Environment: hbase 0.98.7
> hadoop 2.5.1
>Reporter: wuchengzhi
>Priority: Critical
> Attachments: PrefixTreeCompact.java, txtfile-part1.txt.gz, 
> txtfile-part2.txt.gz, txtfile-part4.txt.gz, txtfile-part5.txt.gz, 
> txtfile-part6.txt.gz, txtfile-part7.txt.gz
>
>
> I upgraded the hbase from 0.96.1.1 to 0.98.7 and hadoop from 2.2.0 to 
> 2.5.1,some table encoding using prefix-tree was abnormal for compacting,  the 
> gui shows the table's Compaction status is MAJOR_AND_MINOR(MAJOR) all the 
> time.
> in the regionserver dump , there are some logs as below:
> Tasks:
> ===
> Task: Compacting info in 
> PREFIX_NOT_COMPACT,,1421954285670.41ef60e2c221772626e141d5080296c5.
> Status: RUNNING:Compacting store info
> Running for 1097s  (on the  site running more than 3 days)
> 
> Thread 197 (regionserver60020-smallCompactions-1421954341530):
>   State: RUNNABLE
>   Blocked count: 7
>   Waited count: 3
>   Stack:
> 
> org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.followFan(PrefixTreeArrayScanner.java:329)
> 
> org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArraySearcher.positionAtOrAfter(PrefixTreeArraySearcher.java:149)
> 
> org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArraySearcher.seekForwardToOrAfter(PrefixTreeArraySearcher.java:183)
> 
> org.apache.hadoop.hbase.codec.prefixtree.PrefixTreeSeeker.seekToOrBeforeUsingPositionAtOrAfter(PrefixTreeSeeker.java:199)
> 
> org.apache.hadoop.hbase.codec.prefixtree.PrefixTreeSeeker.seekToKeyInBlock(PrefixTreeSeeker.java:162)
> 
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.loadBlockAndSeekToKey(HFileReaderV2.java:1172)
> 
> org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:573)
> 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257)
> 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173)
> 
> org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55)
> 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313)
> 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.reseek(KeyValueHeap.java:257)
> 
> org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:697)
> 
> org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683)
> 
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533)
> 
> org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:222)
> 
> org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:77)
> 
> org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:110)
> org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1099)
> org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1482)
> Thread 177 (regionserver60020-smallCompactions-1421954314809):
>   State: RUNNABLE
>   Blocked count: 40
>   Waited count: 60
>   Stack:
> 
> org.apache.hadoop.hbase.codec.prefixtree.decode.column.ColumnReader.populateBuffer(ColumnReader.java:81)
> 
> org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.populateQualifier(PrefixTreeArrayScanner.java:471)
> 
> org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.populateNonRowFields(PrefixTreeArrayScanner.java:452)
> 
> org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.nextRow(PrefixTreeArrayScanner.java:226)
> 
> org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.advance(PrefixTreeArrayScanner.java:208)
> 
> org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArraySearcher.positionAtQualifierTimestamp(PrefixTreeArraySearcher.java:244)
> 
> org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArraySearcher.positionAtOrAfter(PrefixTreeArraySearcher.java:123)
> 
> org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArraySearcher.seekForwardToOrAfter(PrefixTreeArraySearcher.java:183)
> 
>

[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only

2015-11-12 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003595#comment-15003595
 ] 

Duo Zhang commented on HBASE-14790:
---

HTTP/2 has its own problems that we haven't finish the read path yet but the 
write protocol is much more complex than read protocol... And also, if we plan 
to do it in HDFS, I think we should make it more general, and it is better to 
design a good event-driven FileSystem interface at the beginning. I do not 
think either of them is easy...

So my plan is to implement a simple version in HBase which only compatible with 
hadoop 2.x first, make sure it has some benefits and actually ship it with 
HBase. And then, we could start implementing a more general and more powerful 
event-driven FileSystem in HDFS. When the new FileSystem is out, we could move 
HBase to use the new FileSystem in HDFS and drop the old simple version.

What do you think? [~wheat9]

Thanks.

> Implement a new DFSOutputStream for logging WAL only
> 
>
> Key: HBASE-14790
> URL: https://issues.apache.org/jira/browse/HBASE-14790
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14806) Missing sources.jar for several modules when building HBase

2015-11-13 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-14806:
--
Attachment: HBASE-14806.patch

Seems only hbase-common and hbase-external-blockcache missing sources.jar.

I modified the pom.xml. But I'm not sure whether the output sources.jar 
contains correct LICENSE files.

[~busbey] Could you please help verifying the sources.jar generated with this 
patch?

Thanks.

> Missing sources.jar for several modules when building HBase
> ---
>
> Key: HBASE-14806
> URL: https://issues.apache.org/jira/browse/HBASE-14806
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Duo Zhang
> Attachments: HBASE-14806.patch
>
>
> Introduced by HBASE-14085. The problem is, for example, in 
> hbase-common/pom.xml, we have
> {code:title=pom.xml}
> 
>   org.apache.maven.plugins
>   maven-source-plugin
>   
> true
> 
>   src/main/java
>   ${project.build.outputDirectory}/META-INF
> 
>   
> 
> {code}
> But in fact, the path inside {{}} tag is relative to source 
> directories, not the project directory. So the maven-source-plugin always end 
> with
> {noformat}
> No sources in project. Archive not created.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14806) Missing sources.jar for several modules when building HBase

2015-11-13 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-14806:
--
 Assignee: Duo Zhang
Affects Version/s: 2.0.0
   Status: Patch Available  (was: Open)

> Missing sources.jar for several modules when building HBase
> ---
>
> Key: HBASE-14806
> URL: https://issues.apache.org/jira/browse/HBASE-14806
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Duo Zhang
>Assignee: Duo Zhang
> Attachments: HBASE-14806.patch
>
>
> Introduced by HBASE-14085. The problem is, for example, in 
> hbase-common/pom.xml, we have
> {code:title=pom.xml}
> 
>   org.apache.maven.plugins
>   maven-source-plugin
>   
> true
> 
>   src/main/java
>   ${project.build.outputDirectory}/META-INF
> 
>   
> 
> {code}
> But in fact, the path inside {{}} tag is relative to source 
> directories, not the project directory. So the maven-source-plugin always end 
> with
> {noformat}
> No sources in project. Archive not created.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-14806) Missing sources.jar for several modules when building HBase

2015-11-12 Thread Duo Zhang (JIRA)

Duo Zhang created HBASE-14806:
-

 Summary: Missing sources.jar for several modules when building 
HBase
 Key: HBASE-14806
 URL: https://issues.apache.org/jira/browse/HBASE-14806
 Project: HBase
  Issue Type: Bug
Reporter: Duo Zhang


Introduced by HBASE-14085. The problem is, for example, in 
hbase-common/pom.xml, we have
{code:title=pom.xml}

  org.apache.maven.plugins
  maven-source-plugin
  
true

  src/main/java
  ${project.build.outputDirectory}/META-INF

  

{code}
But in fact, the path inside {{}} tag is relative to source 
directories, not the project directory. So the maven-source-plugin always end 
with
{noformat}
No sources in project. Archive not created.
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only

2015-11-12 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003469#comment-15003469
 ] 

Duo Zhang commented on HBASE-14790:
---

The new implementation will be event-driven which means we could use a callback 
when sync so we do not need multiple sync threads any more.

So I think it is a big project if we want to implement this in HDFS since we 
introduce a new style of API. Even though we could only implement a new writer 
first, but I think it is still very important to design a good general 
interface first.

So I think we could implement this in HBase first and collect some perf 
results. Then we could start talking about move it into HDFS.

Thanks.

> Implement a new DFSOutputStream for logging WAL only
> 
>
> Key: HBASE-14790
> URL: https://issues.apache.org/jira/browse/HBASE-14790
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14806) Missing sources.jar for several modules when building HBase

2015-11-13 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005116#comment-15005116
 ] 

Duo Zhang commented on HBASE-14806:
---

The META-INF directory contained in the generated sources.jar contains the 
following files. Enough? [~busbey]

{noformat}
" zip.vim version v27
" Browsing zipfile 
/home/zhangduo/hbase/code/hbase-common/target/hbase-common-2.0.0-SNAPSHOT-sources.jar
" Select a file with cursor and press ENTER

META-INF/
META-INF/MANIFEST.MF
META-INF/LICENSE
META-INF/NOTICE
META-INF/DEPENDENCIES

" zip.vim version v27
" Browsing zipfile 
/home/zhangduo/hbase/code/hbase-external-blockcache/target/hbase-external-blockcache-2.0.0-SNAPSHOT-sources.jar
" Select a file with cursor and press ENTER

META-INF/
META-INF/MANIFEST.MF
META-INF/LICENSE
META-INF/NOTICE
META-INF/DEPENDENCIES
{noformat}



> Missing sources.jar for several modules when building HBase
> ---
>
> Key: HBASE-14806
> URL: https://issues.apache.org/jira/browse/HBASE-14806
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Duo Zhang
>Assignee: Duo Zhang
> Attachments: HBASE-14806.patch
>
>
> Introduced by HBASE-14085. The problem is, for example, in 
> hbase-common/pom.xml, we have
> {code:title=pom.xml}
> 
>   org.apache.maven.plugins
>   maven-source-plugin
>   
> true
> 
>   src/main/java
>   ${project.build.outputDirectory}/META-INF
> 
>   
> 
> {code}
> But in fact, the path inside {{}} tag is relative to source 
> directories, not the project directory. So the maven-source-plugin always end 
> with
> {noformat}
> No sources in project. Archive not created.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14806) Missing sources.jar for several modules when building HBase

2015-11-16 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-14806:
--
Affects Version/s: 1.1.3
   1.3.0
   1.2.0
   1.0.3
   0.98.16
Fix Version/s: 1.0.4
   0.98.17
   1.1.4
   1.3.0
   1.2.0
   2.0.0

> Missing sources.jar for several modules when building HBase
> ---
>
> Key: HBASE-14806
> URL: https://issues.apache.org/jira/browse/HBASE-14806
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.0.0, 1.2.0, 1.3.0, 1.0.3, 1.1.3, 0.98.16
>Reporter: Duo Zhang
>Assignee: Duo Zhang
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.4, 0.98.17, 1.0.4
>
> Attachments: HBASE-14806.patch
>
>
> Introduced by HBASE-14085. The problem is, for example, in 
> hbase-common/pom.xml, we have
> {code:title=pom.xml}
> 
>   org.apache.maven.plugins
>   maven-source-plugin
>   
> true
> 
>   src/main/java
>   ${project.build.outputDirectory}/META-INF
> 
>   
> 
> {code}
> But in fact, the path inside {{}} tag is relative to source 
> directories, not the project directory. So the maven-source-plugin always end 
> with
> {noformat}
> No sources in project. Archive not created.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14806) Missing sources.jar for several modules when building HBase

2015-11-16 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-14806:
--
  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Pushed to all branches. Thanks [~busbey] for reviewing.

> Missing sources.jar for several modules when building HBase
> ---
>
> Key: HBASE-14806
> URL: https://issues.apache.org/jira/browse/HBASE-14806
> Project: HBase
>  Issue Type: Bug
>Affects Versions: 2.0.0, 1.2.0, 1.3.0, 1.0.3, 1.1.3, 0.98.16
>Reporter: Duo Zhang
>Assignee: Duo Zhang
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.4, 0.98.17, 1.0.4
>
> Attachments: HBASE-14806.patch
>
>
> Introduced by HBASE-14085. The problem is, for example, in 
> hbase-common/pom.xml, we have
> {code:title=pom.xml}
> 
>   org.apache.maven.plugins
>   maven-source-plugin
>   
> true
> 
>   src/main/java
>   ${project.build.outputDirectory}/META-INF
> 
>   
> 
> {code}
> But in fact, the path inside {{}} tag is relative to source 
> directories, not the project directory. So the maven-source-plugin always end 
> with
> {noformat}
> No sources in project. Archive not created.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only

2015-11-10 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998397#comment-14998397
 ] 

Duo Zhang commented on HBASE-14790:
---

The root reason of HBASE-14004 is that, HBase and HDFS may not reach an 
agreement on the length of WAL file. {{DFSOutputStream}} itself does lots of 
thing to repair the inconsistency when error occurred, but if these approaches 
are all failed, then we have no idea what to do next...

In fact, the simplest way to reach an agreement on file length is closing the 
file, and I think it is enough for logging WAL.
So a simple solution is, when logging failed, try closing the file(just make a 
call to namenode with confirmed length), if still failing, retry forever until 
success and set a flag to tell upper layer the WAL is broken so we could reject 
write request immediately to prevent OOM.

Thanks.

> Implement a new DFSOutputStream for logging WAL only
> 
>
> Key: HBASE-14790
> URL: https://issues.apache.org/jira/browse/HBASE-14790
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only

2015-11-10 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998407#comment-14998407
 ] 

Duo Zhang commented on HBASE-14790:
---

{quote}
As for this, IMO we can set a limit, if exceed the limits, we can think HDFS is 
broken and close the RS directly. wdyt? 
{quote}

Agree. But the timeout value should be carefully chosen.

> Implement a new DFSOutputStream for logging WAL only
> 
>
> Key: HBASE-14790
> URL: https://issues.apache.org/jira/browse/HBASE-14790
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only

2015-11-10 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1417#comment-1417
 ] 

Duo Zhang commented on HBASE-14790:
---

{{DFSOutputStream}} has pipeline recovery, so it is hard to say what is the 
actual state if the recovery is failed, and its {{close}} method also does lots 
of things other than just calling {{completeFile}}. So it is hard to control 
the behavior exactly.

Thanks.

> Implement a new DFSOutputStream for logging WAL only
> 
>
> Key: HBASE-14790
> URL: https://issues.apache.org/jira/browse/HBASE-14790
> Project: HBase
>  Issue Type: Improvement
>Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only

2015-11-09 Thread Duo Zhang (JIRA)

Duo Zhang created HBASE-14790:
-

 Summary: Implement a new DFSOutputStream for logging WAL only
 Key: HBASE-14790
 URL: https://issues.apache.org/jira/browse/HBASE-14790
 Project: HBase
  Issue Type: Improvement
Reporter: Duo Zhang


The original {{DFSOutputStream}} is very powerful and aims to serve all 
purposes. But in fact, we do not need most of the features if we only want to 
log WAL. For example, we do not need pipeline recovery since we could just 
close the old logger and open a new one. And also, we do not need to write 
multiple blocks since we could also open a new logger if the old file is too 
large.

And the most important thing is that, it is hard to handle all the corner cases 
to avoid data loss or data inconsistency(such as HBASE-14004) when using 
original DFSOutputStream due to its complicated logic. And the complicated 
logic also force us to use some magical tricks to increase performance. For 
example, we need to use multiple threads to call {{hflush}} when logging, and 
now we use 5 threads. But why 5 not 10 or 100?

So here, I propose we should implement our own {{DFSOutputStream}} when logging 
WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14759) Avoid using Math.abs when selecting SyncRunner in FSHLog

2015-11-04 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990744#comment-14990744
 ] 

Duo Zhang commented on HBASE-14759:
---

{code}
this.syncRunnerIndex = (this.syncRunnerIndex + 1) % this.syncRunners.length;
{code}

I think this will update syncRunnerIndex? [~enis]

> Avoid using Math.abs when selecting SyncRunner in FSHLog
> 
>
> Key: HBASE-14759
> URL: https://issues.apache.org/jira/browse/HBASE-14759
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
> Attachments: HBASE-14759.patch
>
>
> {code:title=FSHLog.java}
> int index = Math.abs(this.syncRunnerIndex++) % this.syncRunners.length;
>   try {
> this.syncRunners[index].offer(sequence, this.syncFutures, 
> this.syncFuturesCount);
>   } catch (Exception e) {
> // Should NEVER get here.
> requestLogRoll();
> this.exception = new DamagedWALException("Failed offering sync", 
> e);
>   }
> {code}
> Math.abs will return Integer.MIN_VALUE if you pass Integer.MIN_VALUE in since 
> the actual absolute value of Integer.MIN_VALUE is out of range.
> I think {{this.syncRunnerIndex++}} will overflow eventually if we keep the 
> regionserver running for enough time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14759) Avoid using Math.abs when selecting SyncRunner in FSHLog

2015-11-04 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-14759:
--
Affects Version/s: 1.2.0
   2.0.0
   1.0.2
   1.1.2
Fix Version/s: 1.1.3
   1.0.3
   1.2.0
   2.0.0

Set fix versions, 0.98 does not have this problem.

> Avoid using Math.abs when selecting SyncRunner in FSHLog
> 
>
> Key: HBASE-14759
> URL: https://issues.apache.org/jira/browse/HBASE-14759
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 2.0.0, 1.0.2, 1.2.0, 1.1.2
>Reporter: Duo Zhang
>Assignee: Duo Zhang
> Fix For: 2.0.0, 1.2.0, 1.0.3, 1.1.3
>
> Attachments: HBASE-14759.patch
>
>
> {code:title=FSHLog.java}
> int index = Math.abs(this.syncRunnerIndex++) % this.syncRunners.length;
>   try {
> this.syncRunners[index].offer(sequence, this.syncFutures, 
> this.syncFuturesCount);
>   } catch (Exception e) {
> // Should NEVER get here.
> requestLogRoll();
> this.exception = new DamagedWALException("Failed offering sync", 
> e);
>   }
> {code}
> Math.abs will return Integer.MIN_VALUE if you pass Integer.MIN_VALUE in since 
> the actual absolute value of Integer.MIN_VALUE is out of range.
> I think {{this.syncRunnerIndex++}} will overflow eventually if we keep the 
> regionserver running for enough time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14759) Avoid using Math.abs when selecting SyncRunner in FSHLog

2015-11-06 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994772#comment-14994772
 ] 

Duo Zhang commented on HBASE-14759:
---

OK, Thanks. Let me commit this.

> Avoid using Math.abs when selecting SyncRunner in FSHLog
> 
>
> Key: HBASE-14759
> URL: https://issues.apache.org/jira/browse/HBASE-14759
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 2.0.0, 1.0.2, 1.2.0, 1.1.2, 1.3.0
>Reporter: Duo Zhang
>Assignee: Duo Zhang
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.0.3, 1.1.3
>
> Attachments: HBASE-14759.patch
>
>
> {code:title=FSHLog.java}
> int index = Math.abs(this.syncRunnerIndex++) % this.syncRunners.length;
>   try {
> this.syncRunners[index].offer(sequence, this.syncFutures, 
> this.syncFuturesCount);
>   } catch (Exception e) {
> // Should NEVER get here.
> requestLogRoll();
> this.exception = new DamagedWALException("Failed offering sync", 
> e);
>   }
> {code}
> Math.abs will return Integer.MIN_VALUE if you pass Integer.MIN_VALUE in since 
> the actual absolute value of Integer.MIN_VALUE is out of range.
> I think {{this.syncRunnerIndex++}} will overflow eventually if we keep the 
> regionserver running for enough time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14759) Avoid using Math.abs when selecting SyncRunner in FSHLog

2015-11-06 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-14759:
--
  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Pushed to 1.0+. Thanks [~yuzhih...@gmail.com] and [~enis] for reviewing.

> Avoid using Math.abs when selecting SyncRunner in FSHLog
> 
>
> Key: HBASE-14759
> URL: https://issues.apache.org/jira/browse/HBASE-14759
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 2.0.0, 1.0.2, 1.2.0, 1.1.2, 1.3.0
>Reporter: Duo Zhang
>Assignee: Duo Zhang
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.0.3, 1.1.3
>
> Attachments: HBASE-14759.patch
>
>
> {code:title=FSHLog.java}
> int index = Math.abs(this.syncRunnerIndex++) % this.syncRunners.length;
>   try {
> this.syncRunners[index].offer(sequence, this.syncFutures, 
> this.syncFuturesCount);
>   } catch (Exception e) {
> // Should NEVER get here.
> requestLogRoll();
> this.exception = new DamagedWALException("Failed offering sync", 
> e);
>   }
> {code}
> Math.abs will return Integer.MIN_VALUE if you pass Integer.MIN_VALUE in since 
> the actual absolute value of Integer.MIN_VALUE is out of range.
> I think {{this.syncRunnerIndex++}} will overflow eventually if we keep the 
> regionserver running for enough time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14759) Avoid using Math.abs when selecting SyncRunner in FSHLog

2015-11-05 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-14759:
--
Affects Version/s: 1.3.0
Fix Version/s: 1.3.0

> Avoid using Math.abs when selecting SyncRunner in FSHLog
> 
>
> Key: HBASE-14759
> URL: https://issues.apache.org/jira/browse/HBASE-14759
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Affects Versions: 2.0.0, 1.0.2, 1.2.0, 1.1.2, 1.3.0
>Reporter: Duo Zhang
>Assignee: Duo Zhang
> Fix For: 2.0.0, 1.2.0, 1.3.0, 1.0.3, 1.1.3
>
> Attachments: HBASE-14759.patch
>
>
> {code:title=FSHLog.java}
> int index = Math.abs(this.syncRunnerIndex++) % this.syncRunners.length;
>   try {
> this.syncRunners[index].offer(sequence, this.syncFutures, 
> this.syncFuturesCount);
>   } catch (Exception e) {
> // Should NEVER get here.
> requestLogRoll();
> this.exception = new DamagedWALException("Failed offering sync", 
> e);
>   }
> {code}
> Math.abs will return Integer.MIN_VALUE if you pass Integer.MIN_VALUE in since 
> the actual absolute value of Integer.MIN_VALUE is out of range.
> I think {{this.syncRunnerIndex++}} will overflow eventually if we keep the 
> regionserver running for enough time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HBASE-14759) Avoid using Math.abs when selecting SyncRunner in FSHLog

2015-11-04 Thread Duo Zhang (JIRA)

Duo Zhang created HBASE-14759:
-

 Summary: Avoid using Math.abs when selecting SyncRunner in FSHLog
 Key: HBASE-14759
 URL: https://issues.apache.org/jira/browse/HBASE-14759
 Project: HBase
  Issue Type: Bug
  Components: wal
Reporter: Duo Zhang


{code:title=FSHLog.java}
int index = Math.abs(this.syncRunnerIndex++) % this.syncRunners.length;
  try {
this.syncRunners[index].offer(sequence, this.syncFutures, 
this.syncFuturesCount);
  } catch (Exception e) {
// Should NEVER get here.
requestLogRoll();
this.exception = new DamagedWALException("Failed offering sync", e);
  }
{code}
Math.abs will return Integer.MIN_VALUE if you pass Integer.MIN_VALUE in since 
the actual absolute value of Integer.MIN_VALUE is out of range.

I think {{this.syncRunnerIndex++}} will overflow eventually if we keep the 
regionserver running for enough time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14759) Avoid using Math.abs when selecting SyncRunner in FSHLog

2015-11-04 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989229#comment-14989229
 ] 

Duo Zhang commented on HBASE-14759:
---

Add a simple testcase in TestFSHLog

{code}

  @Test
  public void testSyncRunnerIndexOverflow() throws IOException, 
NoSuchFieldException,
  SecurityException, IllegalArgumentException, IllegalAccessException {
final String name = "testSyncRunnerIndexOverflow";
FSHLog log =
new FSHLog(fs, FSUtils.getRootDir(conf), name, 
HConstants.HREGION_OLDLOGDIR_NAME, conf,
null, true, null, null);
Field ringBufferEventHandlerField = 
FSHLog.class.getDeclaredField("ringBufferEventHandler");
ringBufferEventHandlerField.setAccessible(true);
FSHLog.RingBufferEventHandler ringBufferEventHandler =
(FSHLog.RingBufferEventHandler) ringBufferEventHandlerField.get(log);
Field syncRunnerIndexField =
FSHLog.RingBufferEventHandler.class.getDeclaredField("syncRunnerIndex");
syncRunnerIndexField.setAccessible(true);
syncRunnerIndexField.set(ringBufferEventHandler, Integer.MAX_VALUE);
HTableDescriptor htd =
new HTableDescriptor(TableName.valueOf("t1")).addFamily(new 
HColumnDescriptor("row"));
HRegionInfo hri =
new HRegionInfo(htd.getTableName(), HConstants.EMPTY_START_ROW, 
HConstants.EMPTY_END_ROW);
MultiVersionConcurrencyControl mvcc = new MultiVersionConcurrencyControl();
addEdits(log, hri, htd, 1, mvcc);
addEdits(log, hri, htd, 1, mvcc);
  }
{code}

It will fail with
{noformat}
org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: On sync
at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1782)
at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1)
at 
com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:128)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Failed 
offering sync
at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1776)
... 5 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: -3
at 
org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1772)
... 5 more
{noformat}

> Avoid using Math.abs when selecting SyncRunner in FSHLog
> 
>
> Key: HBASE-14759
> URL: https://issues.apache.org/jira/browse/HBASE-14759
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Reporter: Duo Zhang
>
> {code:title=FSHLog.java}
> int index = Math.abs(this.syncRunnerIndex++) % this.syncRunners.length;
>   try {
> this.syncRunners[index].offer(sequence, this.syncFutures, 
> this.syncFuturesCount);
>   } catch (Exception e) {
> // Should NEVER get here.
> requestLogRoll();
> this.exception = new DamagedWALException("Failed offering sync", 
> e);
>   }
> {code}
> Math.abs will return Integer.MIN_VALUE if you pass Integer.MIN_VALUE in since 
> the actual absolute value of Integer.MIN_VALUE is out of range.
> I think {{this.syncRunnerIndex++}} will overflow eventually if we keep the 
> regionserver running for enough time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14759) Avoid using Math.abs when selecting SyncRunner in FSHLog

2015-11-04 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-14759:
--
Attachment: HBASE-14759.patch

> Avoid using Math.abs when selecting SyncRunner in FSHLog
> 
>
> Key: HBASE-14759
> URL: https://issues.apache.org/jira/browse/HBASE-14759
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Reporter: Duo Zhang
> Attachments: HBASE-14759.patch
>
>
> {code:title=FSHLog.java}
> int index = Math.abs(this.syncRunnerIndex++) % this.syncRunners.length;
>   try {
> this.syncRunners[index].offer(sequence, this.syncFutures, 
> this.syncFuturesCount);
>   } catch (Exception e) {
> // Should NEVER get here.
> requestLogRoll();
> this.exception = new DamagedWALException("Failed offering sync", 
> e);
>   }
> {code}
> Math.abs will return Integer.MIN_VALUE if you pass Integer.MIN_VALUE in since 
> the actual absolute value of Integer.MIN_VALUE is out of range.
> I think {{this.syncRunnerIndex++}} will overflow eventually if we keep the 
> regionserver running for enough time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HBASE-14759) Avoid using Math.abs when selecting SyncRunner in FSHLog

2015-11-04 Thread Duo Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang updated HBASE-14759:
--
Assignee: Duo Zhang
  Status: Patch Available  (was: Open)

> Avoid using Math.abs when selecting SyncRunner in FSHLog
> 
>
> Key: HBASE-14759
> URL: https://issues.apache.org/jira/browse/HBASE-14759
> Project: HBase
>  Issue Type: Bug
>  Components: wal
>Reporter: Duo Zhang
>Assignee: Duo Zhang
> Attachments: HBASE-14759.patch
>
>
> {code:title=FSHLog.java}
> int index = Math.abs(this.syncRunnerIndex++) % this.syncRunners.length;
>   try {
> this.syncRunners[index].offer(sequence, this.syncFutures, 
> this.syncFuturesCount);
>   } catch (Exception e) {
> // Should NEVER get here.
> requestLogRoll();
> this.exception = new DamagedWALException("Failed offering sync", 
> e);
>   }
> {code}
> Math.abs will return Integer.MIN_VALUE if you pass Integer.MIN_VALUE in since 
> the actual absolute value of Integer.MIN_VALUE is out of range.
> I think {{this.syncRunnerIndex++}} will overflow eventually if we keep the 
> regionserver running for enough time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-14586) Use a maven profile to run Jacoco analysis

2015-10-13 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956135#comment-14956135
 ] 

Duo Zhang commented on HBASE-14586:
---

+1

> Use a maven profile to run Jacoco analysis
> --
>
> Key: HBASE-14586
> URL: https://issues.apache.org/jira/browse/HBASE-14586
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 1.1.0
>Reporter: Andrew Wang
>Assignee: Andrew Wang
>Priority: Minor
> Attachments: hbase-14586.001.patch, hbase-14586.001.patch, 
> hbase-14586.002.patch
>
>
> The pom.xml has a line like this for the Surefire argLine, which has an extra 
> ${argLine} reference. Recommend changes like this:
> {noformat}
> -${hbase-surefire.argLine} ${argLine}
> +${hbase-surefire.argLine}
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13082) Coarsen StoreScanner locks to RegionScanner

2015-10-19 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-13082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964555#comment-14964555
 ] 

Duo Zhang commented on HBASE-13082:
---

I mean maybe we could have a fast path without locking for normal reading since 
for most scan operation there is no flush involved. And if a flush happen, then 
we change back to use a lock. I think this is something like a BiasedLock?

> Coarsen StoreScanner locks to RegionScanner
> ---
>
> Key: HBASE-13082
> URL: https://issues.apache.org/jira/browse/HBASE-13082
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Assignee: ramkrishna.s.vasudevan
> Attachments: 13082-test.txt, 13082-v2.txt, 13082-v3.txt, 
> 13082-v4.txt, 13082.txt, 13082.txt, HBASE-13082.pdf, HBASE-13082_1_WIP.patch, 
> HBASE-13082_2_WIP.patch, HBASE-13082_3.patch, HBASE-13082_4.patch, gc.png, 
> gc.png, gc.png, hits.png, next.png, next.png
>
>
> Continuing where HBASE-10015 left of.
> We can avoid locking (and memory fencing) inside StoreScanner by deferring to 
> the lock already held by the RegionScanner.
> In tests this shows quite a scan improvement and reduced CPU (the fences make 
> the cores wait for memory fetches).
> There are some drawbacks too:
> * All calls to RegionScanner need to be remain synchronized
> * Implementors of coprocessors need to be diligent in following the locking 
> contract. For example Phoenix does not lock RegionScanner.nextRaw() and 
> required in the documentation (not picking on Phoenix, this one is my fault 
> as I told them it's OK)
> * possible starving of flushes and compaction with heavy read load. 
> RegionScanner operations would keep getting the locks and the 
> flushes/compactions would not be able finalize the set of files.
> I'll have a patch soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-13082) Coarsen StoreScanner locks to RegionScanner

2015-10-19 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-13082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964481#comment-14964481
 ] 

Duo Zhang commented on HBASE-13082:
---

{quote}
On flush, what if we let all Scans complete before letting go the snapshot? 
More memory pressure. Simpler implementation.
{quote}
How much time a {{RegionScanner}} can continue sometimes depends on the user of 
HBase. Think of a {{Filter}} that matches nothing. A {{RegionScanner}} could 
scan the whole region if that {{Filter}} is applied. So I do not think having a 
ref count on snapshot is a good idea.

So I think resetting heap is necessary when there is a flush, and the problem 
is how to do it without a expensive overhead? Maybe something like a Biased 
Locking?

Thanks.

> Coarsen StoreScanner locks to RegionScanner
> ---
>
> Key: HBASE-13082
> URL: https://issues.apache.org/jira/browse/HBASE-13082
> Project: HBase
>  Issue Type: Bug
>Reporter: Lars Hofhansl
>Assignee: ramkrishna.s.vasudevan
> Attachments: 13082-test.txt, 13082-v2.txt, 13082-v3.txt, 
> 13082-v4.txt, 13082.txt, 13082.txt, HBASE-13082.pdf, HBASE-13082_1_WIP.patch, 
> HBASE-13082_2_WIP.patch, HBASE-13082_3.patch, HBASE-13082_4.patch, gc.png, 
> gc.png, gc.png, hits.png, next.png, next.png
>
>
> Continuing where HBASE-10015 left of.
> We can avoid locking (and memory fencing) inside StoreScanner by deferring to 
> the lock already held by the RegionScanner.
> In tests this shows quite a scan improvement and reduced CPU (the fences make 
> the cores wait for memory fetches).
> There are some drawbacks too:
> * All calls to RegionScanner need to be remain synchronized
> * Implementors of coprocessors need to be diligent in following the locking 
> contract. For example Phoenix does not lock RegionScanner.nextRaw() and 
> required in the documentation (not picking on Phoenix, this one is my fault 
> as I told them it's OK)
> * possible starving of flushes and compaction with heavy read load. 
> RegionScanner operations would keep getting the locks and the 
> flushes/compactions would not be able finalize the set of files.
> I'll have a patch soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 18703 matches

Mail list logo