[jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS
[ https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573900#comment-14573900 ] Duo Zhang commented on HBASE-13811: --- {quote} Rather than add a new method that does what the old getEarliestMemstoreSeqNum did, I changed getEarliestMemstoreSeqNum to be how the old version worked. {quote} Fine, I think it will work. But I still feel a little nervous to have two methods which have same name but different behaviors... And I remember that, when implmenting HBASE-10201 and HBASE-12405, actually I wanted to return the flushedSeqId when calling startCacheFlush first. But there are two problems. First is getNextSequenceId method is in HRegion, not in FSHLog, so a simple solution is return NO_SEQ_NUM when flushing all stores and let HRegion call getNextSequenceId. But here comes the second problem, startCacheFlush may fail which means we can not start a flush, so there are three types of return values, 'sequenceId', 'choose a sequenceId by yourself', 'give up flushing!'. I think it is ugly to have a '-2' or a null java.lang.Long to indicate a 'give up flushing' at that time so I gave up... Maybe we could consider this solution again? getEarliestMemstoreSeqNum can be used everywhere but startCacheFlush is restricted in the flushing scope I think. Thanks. Splitting WALs, we are filtering out too many edits - DATALOSS --- Key: HBASE-13811 URL: https://issues.apache.org/jira/browse/HBASE-13811 Project: HBase Issue Type: Bug Components: wal Affects Versions: 2.0.0, 1.2.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 2.0.0, 1.2.0 Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt, 13811.v2.branch-1.txt, 13811.v3.branch-1.txt, 13811.v3.branch-1.txt, 13811.v4.branch-1.txt, 13811.v5.branch-1.txt, 13811.v6.branch-1.txt, 13811.v6.branch-1.txt, HBASE-13811-v1.testcase.patch, HBASE-13811.testcase.patch I've been running ITBLLs against branch-1 around HBASE-13616 (move of ServerShutdownHandler to pv2). I have come across an instance of dataloss. My patch for HBASE-13616 was in place so can only think it the cause (but cannot see how). When we split the logs, we are skipping legit edits. Digging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS
[ https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14571917#comment-14571917 ] Duo Zhang commented on HBASE-13811: --- {code} flushOpSeqId = getNextSequenceId(wal); flushedSeqId = getFlushedSequenceId(encodedRegionName, flushOpSeqId); {code} I think the problem is here... Before the patch, getFlushedSequenceId will return flushOpSeqId because getEarliestMemstoreSeqNum will return HConstants.NO_SEQNUM. But now we modify getEarliestMemstoreSeqNum, it will also consider the sequenceIds recorded in flushingSequenceIds so it will not return HConstants.NO_SEQNUM even if we decided to flush all stores. This may cause we replay unnecessary edits I think. So the problem is here we only want to consider the sequenceIds in lowestUnflushedSequenceIds, so maybe a new method? Thanks. [~stack] Splitting WALs, we are filtering out too many edits - DATALOSS --- Key: HBASE-13811 URL: https://issues.apache.org/jira/browse/HBASE-13811 Project: HBase Issue Type: Bug Components: wal Affects Versions: 2.0.0, 1.2.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 2.0.0, 1.2.0 Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt, HBASE-13811-v1.testcase.patch, HBASE-13811.testcase.patch I've been running ITBLLs against branch-1 around HBASE-13616 (move of ServerShutdownHandler to pv2). I have come across an instance of dataloss. My patch for HBASE-13616 was in place so can only think it the cause (but cannot see how). When we split the logs, we are skipping legit edits. Digging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS
[ https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574311#comment-14574311 ] Duo Zhang commented on HBASE-13811: --- {quote} We can actually tell when we are flushing if we should do all of the region, right? If the passed in families are null or equal in number to region stores, we are doing a full region flush so we should use the flush sequence id, the result of the getNextSequenceId call. Otherwise, we want the getEarliest for the region because are doing a column family only flush... {quote} If the stores which we do not flush are all empty, then we should still use the flush sequence id returned by getNextSequenceId(getEarliest should return NO_SEQ_NUM in this case). Splitting WALs, we are filtering out too many edits - DATALOSS --- Key: HBASE-13811 URL: https://issues.apache.org/jira/browse/HBASE-13811 Project: HBase Issue Type: Bug Components: wal Affects Versions: 2.0.0, 1.2.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 2.0.0, 1.2.0 Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt, 13811.v2.branch-1.txt, 13811.v3.branch-1.txt, 13811.v3.branch-1.txt, 13811.v4.branch-1.txt, 13811.v5.branch-1.txt, 13811.v6.branch-1.txt, 13811.v6.branch-1.txt, 13811.v7.branch-1.txt, HBASE-13811-v1.testcase.patch, HBASE-13811.testcase.patch, startCacheFlush.diff I've been running ITBLLs against branch-1 around HBASE-13616 (move of ServerShutdownHandler to pv2). I have come across an instance of dataloss. My patch for HBASE-13616 was in place so can only think it the cause (but cannot see how). When we split the logs, we are skipping legit edits. Digging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS
[ https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574227#comment-14574227 ] Duo Zhang commented on HBASE-13811: --- {quote} Pardon me, I don't see the problem here? {quote} Not a big problem... The only thing is I do not like java.lang.Long...I can prepare a patch to explain how to implement it. {quote} What do you mean by 'startCacheFlush is restricted'. {quote} 'startCacheFlush' is a exact name which tell us what it will do and I think people will only call it when they want to flush a region, so it is less hurt to change its behavior. 'getEarliestMemstoreSeqNum' is a more general name, people can use it everywhere when they want to get the value. Splitting WALs, we are filtering out too many edits - DATALOSS --- Key: HBASE-13811 URL: https://issues.apache.org/jira/browse/HBASE-13811 Project: HBase Issue Type: Bug Components: wal Affects Versions: 2.0.0, 1.2.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 2.0.0, 1.2.0 Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt, 13811.v2.branch-1.txt, 13811.v3.branch-1.txt, 13811.v3.branch-1.txt, 13811.v4.branch-1.txt, 13811.v5.branch-1.txt, 13811.v6.branch-1.txt, 13811.v6.branch-1.txt, 13811.v7.branch-1.txt, HBASE-13811-v1.testcase.patch, HBASE-13811.testcase.patch I've been running ITBLLs against branch-1 around HBASE-13616 (move of ServerShutdownHandler to pv2). I have come across an instance of dataloss. My patch for HBASE-13616 was in place so can only think it the cause (but cannot see how). When we split the logs, we are skipping legit edits. Digging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS
[ https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-13811: -- Attachment: startCacheFlush.diff Implement flushedSeqId calculation inside startCacheFlush. Splitting WALs, we are filtering out too many edits - DATALOSS --- Key: HBASE-13811 URL: https://issues.apache.org/jira/browse/HBASE-13811 Project: HBase Issue Type: Bug Components: wal Affects Versions: 2.0.0, 1.2.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 2.0.0, 1.2.0 Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt, 13811.v2.branch-1.txt, 13811.v3.branch-1.txt, 13811.v3.branch-1.txt, 13811.v4.branch-1.txt, 13811.v5.branch-1.txt, 13811.v6.branch-1.txt, 13811.v6.branch-1.txt, 13811.v7.branch-1.txt, HBASE-13811-v1.testcase.patch, HBASE-13811.testcase.patch, startCacheFlush.diff I've been running ITBLLs against branch-1 around HBASE-13616 (move of ServerShutdownHandler to pv2). I have come across an instance of dataloss. My patch for HBASE-13616 was in place so can only think it the cause (but cannot see how). When we split the logs, we are skipping legit edits. Digging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS
[ https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575398#comment-14575398 ] Duo Zhang commented on HBASE-13811: --- +1 on the latest version. Hope it can pass ITBLL. Splitting WALs, we are filtering out too many edits - DATALOSS --- Key: HBASE-13811 URL: https://issues.apache.org/jira/browse/HBASE-13811 Project: HBase Issue Type: Bug Components: wal Affects Versions: 2.0.0, 1.2.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 2.0.0, 1.2.0 Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt, 13811.v2.branch-1.txt, 13811.v3.branch-1.txt, 13811.v3.branch-1.txt, 13811.v4.branch-1.txt, 13811.v5.branch-1.txt, 13811.v6.branch-1.txt, 13811.v6.branch-1.txt, 13811.v7.branch-1.txt, 13811.v8.branch-1.txt, 13811.v9.branch-1.txt, HBASE-13811-v1.testcase.patch, HBASE-13811.testcase.patch, startCacheFlush.diff I've been running ITBLLs against branch-1 around HBASE-13616 (move of ServerShutdownHandler to pv2). I have come across an instance of dataloss. My patch for HBASE-13616 was in place so can only think it the cause (but cannot see how). When we split the logs, we are skipping legit edits. Digging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13853) ITBLL data loss in HBase-1.1
[ https://issues.apache.org/jira/browse/HBASE-13853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576573#comment-14576573 ] Duo Zhang commented on HBASE-13853: --- HBASE-13811 ? ITBLL data loss in HBase-1.1 Key: HBASE-13853 URL: https://issues.apache.org/jira/browse/HBASE-13853 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Blocker Fix For: 2.0.0, 1.1.1 After 2 days of log mining with ITBLL Search tool (thanks Stack, BTW for the tool), the sequence of events are like this: One of mappers from Search reportedly finds missing keys in some WAL file: {code} 2015-06-07 06:49:24,573 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643 (-9223372036854775808:9223372036854775807) length:132679657 2015-06-07 06:49:24,639 INFO [main] org.apache.hadoop.hbase.mapreduce.WALInputFormat: Opening reader for hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643 (-9223372036854775808:9223372036854775807) length:132679657 2015-06-07 06:50:16,383 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Loaded keys to find: count=2384870 2015-06-07 06:50:16,607 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:prev/1433637873293/Put/vlen=16/seqid=0 2015-06-07 06:50:16,663 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:count/1433637873293/Put/vlen=8/seqid=0 2015-06-07 06:50:16,664 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:client/1433637873293/Put/vlen=71/seqid=0 2015-06-07 06:50:16,671 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:prev/1433637873293/Put/vlen=16/seqid=0 2015-06-07 06:50:16,672 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:count/1433637873293/Put/vlen=8/seqid=0 2015-06-07 06:50:16,678 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:client/1433637873293/Put/vlen=71/seqid=0 2015-06-07 06:50:16,679 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=\x1F\x80#\x95\xAE:i=)S\x01\x08 \xD6\xFF\xD5/meta:prev/1433637873293/Put/vlen=16/seqid=0 {code} {{hlog -p}} confirms the missing keys are there in the file: {code} Sequence=7276 , region=b086e29f909c9790446cd457c1ea7674 at write timestamp=Sun Jun 07 00:44:33 UTC 2015 row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:prev value: \x1B\xF5\xF3^\x8D\xB4\x85\xE3\xF4wS\x9A]\x0D\xABe row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:count value: \x00\x00\x00\x00\x002\x87l row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:client value: Job: job_1433466891829_0002 Task: attempt_1433466891829_0002_m_03_0 {code} When the RS gets killing from CM, master does WAL splitting: {code} ./hbase-hbase-regionserver-os-enis-dal-test-jun-4-2.log:2015-06-07 00:46:12,581 INFO [RS_LOG_REPLAY_OPS-os-enis-dal-test-jun-4-2:16020-0] wal.WALSplitter: Processed 2971 edits across 4 regions; edits skipped=740; log file=hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-4.openstacklocal,16020,1433636320201-splitting/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643, length=132679657, corrupted=false, progress failed=false {code} The edits with Sequence=7276 should be in this recovered.edits file: {code} 2015-06-07 00:46:12,574 INFO [split-log-closeStream-2] wal.WALSplitter: Closed wap hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674/recovered.edits/0007276.temp (wrote 739 edits in 3950ms) 2015-06-07 00:46:12,580 INFO [split-log-closeStream-2] wal.WALSplitter: Rename hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674/recovered.edits/0007276.temp to
[jira] [Commented] (HBASE-13853) ITBLL data loss in HBase-1.1
[ https://issues.apache.org/jira/browse/HBASE-13853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576576#comment-14576576 ] Duo Zhang commented on HBASE-13853: --- And what's the failures with DLR? DLR doesn't get sequence id from master I think, it gets sequence id from zookeeper which is uploaded by the regionserver that holds the region? Thanks. ITBLL data loss in HBase-1.1 Key: HBASE-13853 URL: https://issues.apache.org/jira/browse/HBASE-13853 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Blocker Fix For: 2.0.0, 1.1.1 After 2 days of log mining with ITBLL Search tool (thanks Stack, BTW for the tool), the sequence of events are like this: One of mappers from Search reportedly finds missing keys in some WAL file: {code} 2015-06-07 06:49:24,573 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643 (-9223372036854775808:9223372036854775807) length:132679657 2015-06-07 06:49:24,639 INFO [main] org.apache.hadoop.hbase.mapreduce.WALInputFormat: Opening reader for hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643 (-9223372036854775808:9223372036854775807) length:132679657 2015-06-07 06:50:16,383 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Loaded keys to find: count=2384870 2015-06-07 06:50:16,607 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:prev/1433637873293/Put/vlen=16/seqid=0 2015-06-07 06:50:16,663 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:count/1433637873293/Put/vlen=8/seqid=0 2015-06-07 06:50:16,664 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:client/1433637873293/Put/vlen=71/seqid=0 2015-06-07 06:50:16,671 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:prev/1433637873293/Put/vlen=16/seqid=0 2015-06-07 06:50:16,672 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:count/1433637873293/Put/vlen=8/seqid=0 2015-06-07 06:50:16,678 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:client/1433637873293/Put/vlen=71/seqid=0 2015-06-07 06:50:16,679 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=\x1F\x80#\x95\xAE:i=)S\x01\x08 \xD6\xFF\xD5/meta:prev/1433637873293/Put/vlen=16/seqid=0 {code} {{hlog -p}} confirms the missing keys are there in the file: {code} Sequence=7276 , region=b086e29f909c9790446cd457c1ea7674 at write timestamp=Sun Jun 07 00:44:33 UTC 2015 row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:prev value: \x1B\xF5\xF3^\x8D\xB4\x85\xE3\xF4wS\x9A]\x0D\xABe row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:count value: \x00\x00\x00\x00\x002\x87l row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:client value: Job: job_1433466891829_0002 Task: attempt_1433466891829_0002_m_03_0 {code} When the RS gets killing from CM, master does WAL splitting: {code} ./hbase-hbase-regionserver-os-enis-dal-test-jun-4-2.log:2015-06-07 00:46:12,581 INFO [RS_LOG_REPLAY_OPS-os-enis-dal-test-jun-4-2:16020-0] wal.WALSplitter: Processed 2971 edits across 4 regions; edits skipped=740; log file=hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-4.openstacklocal,16020,1433636320201-splitting/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643, length=132679657, corrupted=false, progress failed=false {code} The edits with Sequence=7276 should be in this recovered.edits file: {code} 2015-06-07 00:46:12,574 INFO [split-log-closeStream-2] wal.WALSplitter: Closed wap hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674/recovered.edits/0007276.temp (wrote 739 edits in 3950ms) 2015-06-07 00:46:12,580 INFO [split-log-closeStream-2] wal.WALSplitter: Rename
[jira] [Commented] (HBASE-13853) ITBLL data loss in HBase-1.1
[ https://issues.apache.org/jira/browse/HBASE-13853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576585#comment-14576585 ] Duo Zhang commented on HBASE-13853: --- I indicated in HBASE-13811 that this issue could also cause data loss on branch-1.1 even if we turn off Per-CF flush and mentioned [~ndimiduk], but he has not replied yet... ITBLL data loss in HBase-1.1 Key: HBASE-13853 URL: https://issues.apache.org/jira/browse/HBASE-13853 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Blocker Fix For: 2.0.0, 1.1.1 Attachments: hbase-13853_v1-branch-1.1.patch After 2 days of log mining with ITBLL Search tool (thanks Stack, BTW for the tool), the sequence of events are like this: One of mappers from Search reportedly finds missing keys in some WAL file: {code} 2015-06-07 06:49:24,573 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643 (-9223372036854775808:9223372036854775807) length:132679657 2015-06-07 06:49:24,639 INFO [main] org.apache.hadoop.hbase.mapreduce.WALInputFormat: Opening reader for hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643 (-9223372036854775808:9223372036854775807) length:132679657 2015-06-07 06:50:16,383 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Loaded keys to find: count=2384870 2015-06-07 06:50:16,607 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:prev/1433637873293/Put/vlen=16/seqid=0 2015-06-07 06:50:16,663 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:count/1433637873293/Put/vlen=8/seqid=0 2015-06-07 06:50:16,664 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:client/1433637873293/Put/vlen=71/seqid=0 2015-06-07 06:50:16,671 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:prev/1433637873293/Put/vlen=16/seqid=0 2015-06-07 06:50:16,672 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:count/1433637873293/Put/vlen=8/seqid=0 2015-06-07 06:50:16,678 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:client/1433637873293/Put/vlen=71/seqid=0 2015-06-07 06:50:16,679 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=\x1F\x80#\x95\xAE:i=)S\x01\x08 \xD6\xFF\xD5/meta:prev/1433637873293/Put/vlen=16/seqid=0 {code} {{hlog -p}} confirms the missing keys are there in the file: {code} Sequence=7276 , region=b086e29f909c9790446cd457c1ea7674 at write timestamp=Sun Jun 07 00:44:33 UTC 2015 row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:prev value: \x1B\xF5\xF3^\x8D\xB4\x85\xE3\xF4wS\x9A]\x0D\xABe row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:count value: \x00\x00\x00\x00\x002\x87l row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:client value: Job: job_1433466891829_0002 Task: attempt_1433466891829_0002_m_03_0 {code} When the RS gets killing from CM, master does WAL splitting: {code} ./hbase-hbase-regionserver-os-enis-dal-test-jun-4-2.log:2015-06-07 00:46:12,581 INFO [RS_LOG_REPLAY_OPS-os-enis-dal-test-jun-4-2:16020-0] wal.WALSplitter: Processed 2971 edits across 4 regions; edits skipped=740; log file=hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-4.openstacklocal,16020,1433636320201-splitting/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643, length=132679657, corrupted=false, progress failed=false {code} The edits with Sequence=7276 should be in this recovered.edits file: {code} 2015-06-07 00:46:12,574 INFO [split-log-closeStream-2] wal.WALSplitter: Closed wap hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674/recovered.edits/0007276.temp (wrote 739 edits in 3950ms) 2015-06-07 00:46:12,580 INFO [split-log-closeStream-2]
[jira] [Commented] (HBASE-13853) ITBLL data loss in HBase-1.1
[ https://issues.apache.org/jira/browse/HBASE-13853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14576598#comment-14576598 ] Duo Zhang commented on HBASE-13853: --- I think we could back port the patch on HBASE-13811 to branch-1.1 without the clean up and abstraction of SequenceIdAccounting? We'd better keep the WAL interface same if possible? ITBLL data loss in HBase-1.1 Key: HBASE-13853 URL: https://issues.apache.org/jira/browse/HBASE-13853 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Blocker Fix For: 2.0.0, 1.1.1 Attachments: hbase-13853_v1-branch-1.1.patch After 2 days of log mining with ITBLL Search tool (thanks Stack, BTW for the tool), the sequence of events are like this: One of mappers from Search reportedly finds missing keys in some WAL file: {code} 2015-06-07 06:49:24,573 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643 (-9223372036854775808:9223372036854775807) length:132679657 2015-06-07 06:49:24,639 INFO [main] org.apache.hadoop.hbase.mapreduce.WALInputFormat: Opening reader for hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643 (-9223372036854775808:9223372036854775807) length:132679657 2015-06-07 06:50:16,383 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Loaded keys to find: count=2384870 2015-06-07 06:50:16,607 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:prev/1433637873293/Put/vlen=16/seqid=0 2015-06-07 06:50:16,663 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:count/1433637873293/Put/vlen=8/seqid=0 2015-06-07 06:50:16,664 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:client/1433637873293/Put/vlen=71/seqid=0 2015-06-07 06:50:16,671 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:prev/1433637873293/Put/vlen=16/seqid=0 2015-06-07 06:50:16,672 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:count/1433637873293/Put/vlen=8/seqid=0 2015-06-07 06:50:16,678 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12\x04/meta:client/1433637873293/Put/vlen=71/seqid=0 2015-06-07 06:50:16,679 INFO [main] org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found cell=\x1F\x80#\x95\xAE:i=)S\x01\x08 \xD6\xFF\xD5/meta:prev/1433637873293/Put/vlen=16/seqid=0 {code} {{hlog -p}} confirms the missing keys are there in the file: {code} Sequence=7276 , region=b086e29f909c9790446cd457c1ea7674 at write timestamp=Sun Jun 07 00:44:33 UTC 2015 row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:prev value: \x1B\xF5\xF3^\x8D\xB4\x85\xE3\xF4wS\x9A]\x0D\xABe row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:count value: \x00\x00\x00\x00\x002\x87l row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:client value: Job: job_1433466891829_0002 Task: attempt_1433466891829_0002_m_03_0 {code} When the RS gets killing from CM, master does WAL splitting: {code} ./hbase-hbase-regionserver-os-enis-dal-test-jun-4-2.log:2015-06-07 00:46:12,581 INFO [RS_LOG_REPLAY_OPS-os-enis-dal-test-jun-4-2:16020-0] wal.WALSplitter: Processed 2971 edits across 4 regions; edits skipped=740; log file=hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-4.openstacklocal,16020,1433636320201-splitting/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643, length=132679657, corrupted=false, progress failed=false {code} The edits with Sequence=7276 should be in this recovered.edits file: {code} 2015-06-07 00:46:12,574 INFO [split-log-closeStream-2] wal.WALSplitter: Closed wap hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674/recovered.edits/0007276.temp (wrote 739 edits in 3950ms) 2015-06-07 00:46:12,580 INFO [split-log-closeStream-2]
[jira] [Updated] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS
[ https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-13811: -- Attachment: HBASE-13811-branch-1.1.patch [~enis] patch for 1.1. Splitting WALs, we are filtering out too many edits - DATALOSS --- Key: HBASE-13811 URL: https://issues.apache.org/jira/browse/HBASE-13811 Project: HBase Issue Type: Bug Components: wal Affects Versions: 2.0.0, 1.2.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 2.0.0, 1.2.0 Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt, 13811.v2.branch-1.txt, 13811.v3.branch-1.txt, 13811.v3.branch-1.txt, 13811.v4.branch-1.txt, 13811.v5.branch-1.txt, 13811.v6.branch-1.txt, 13811.v6.branch-1.txt, 13811.v7.branch-1.txt, 13811.v8.branch-1.txt, 13811.v9.branch-1.txt, HBASE-13811-branch-1.1.patch, HBASE-13811-v1.testcase.patch, HBASE-13811.testcase.patch, startCacheFlush.diff I've been running ITBLLs against branch-1 around HBASE-13616 (move of ServerShutdownHandler to pv2). I have come across an instance of dataloss. My patch for HBASE-13616 was in place so can only think it the cause (but cannot see how). When we split the logs, we are skipping legit edits. Digging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS
[ https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570013#comment-14570013 ] Duo Zhang commented on HBASE-13811: --- My fault, haven't seen hbase jira these days...Thanks [~stack] for digging. +1 on the change in getEarliestMemstoreSeqNum but I think we should add a unit test for it? Splitting WALs, we are filtering out too many edits - DATALOSS --- Key: HBASE-13811 URL: https://issues.apache.org/jira/browse/HBASE-13811 Project: HBase Issue Type: Bug Components: wal Affects Versions: 2.0.0, 1.2.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 2.0.0, 1.2.0 Attachments: 13811.txt I've been running ITBLLs against branch-1 around HBASE-13616 (move of ServerShutdownHandler to pv2). I have come across an instance of dataloss. My patch for HBASE-13616 was in place so can only think it the cause (but cannot see how). When we split the logs, we are skipping legit edits. Digging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS
[ https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572388#comment-14572388 ] Duo Zhang commented on HBASE-13811: --- {quote} I just set it to the flush sequenceid {quote} I think this would fail TestPerColumnFamilyFlush.testLogReplayWithDistributedLogSplit? The logic here should be 1. If we flush all stores or the unflushed stores are all empty, then just use flushOpSeqId. 2. Else, find the lowest seqId of the edits in unflushed stores and then minus 1. The original version of getEarliestMemstoreSeqNum can just fit into the logic, but it will cause data loss as described above. So I think we still need the original version of getEarliestMemstoreSeqNum, but we can rename it to more suitable name? Maybe calcFlushedSeqId? Thanks. Splitting WALs, we are filtering out too many edits - DATALOSS --- Key: HBASE-13811 URL: https://issues.apache.org/jira/browse/HBASE-13811 Project: HBase Issue Type: Bug Components: wal Affects Versions: 2.0.0, 1.2.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 2.0.0, 1.2.0 Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt, 13811.v2.branch-1.txt, HBASE-13811-v1.testcase.patch, HBASE-13811.testcase.patch I've been running ITBLLs against branch-1 around HBASE-13616 (move of ServerShutdownHandler to pv2). I have come across an instance of dataloss. My patch for HBASE-13616 was in place so can only think it the cause (but cannot see how). When we split the logs, we are skipping legit edits. Digging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS
[ https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572051#comment-14572051 ] Duo Zhang commented on HBASE-13811: --- If we flush all data in memstore, then use the flushid is reasonable? The flushedSeqId is calculated at the preparing stage of a flush, and actually assigned to region.maxFlushedSeqId after we successfully commit the flush. We will not report it to master before the actual assign happen, so I think it is safe? Splitting WALs, we are filtering out too many edits - DATALOSS --- Key: HBASE-13811 URL: https://issues.apache.org/jira/browse/HBASE-13811 Project: HBase Issue Type: Bug Components: wal Affects Versions: 2.0.0, 1.2.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 2.0.0, 1.2.0 Attachments: 13811.branch-1.txt, 13811.branch-1.txt, 13811.txt, HBASE-13811-v1.testcase.patch, HBASE-13811.testcase.patch I've been running ITBLLs against branch-1 around HBASE-13616 (move of ServerShutdownHandler to pv2). I have come across an instance of dataloss. My patch for HBASE-13616 was in place so can only think it the cause (but cannot see how). When we split the logs, we are skipping legit edits. Digging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13877) Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL
[ https://issues.apache.org/jira/browse/HBASE-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579783#comment-14579783 ] Duo Zhang commented on HBASE-13877: --- In MemStoreFlusher we catch DroppedSnapshotException and abort the regionserver, and the javadoc of Region.flush says that DroppedSnapshotException is thrown when abort is required. So I think it is the caller's duty to abort the regionserver? If we abort the server inside Region.flush, do we still need to throw DroppedSnapshotException out? Thanks. Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL Key: HBASE-13877 URL: https://issues.apache.org/jira/browse/HBASE-13877 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: hbase-13877_v1.patch, hbase-13877_v2-branch-1.1.patch ITBLL with 1.25B rows failed for me (and Stack as reported in https://issues.apache.org/jira/browse/HBASE-13811?focusedCommentId=14577834page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14577834) HBASE-13811 and HBASE-13853 fixed an issue with WAL edit filtering. The root cause this time seems to be different. It is due to procedure based flush interrupting the flush request in case the procedure is cancelled from an exception elsewhere. This leaves the memstore snapshot intact without aborting the server. The next flush, then flushes the previous memstore with the current seqId (as opposed to seqId from the memstore snapshot). This creates an hfile with larger seqId than what its contents are. Previous behavior in 0.98 and 1.0 (I believe) is that after flush prepare and interruption / exception will cause RS abort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13877) Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL
[ https://issues.apache.org/jira/browse/HBASE-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579870#comment-14579870 ] Duo Zhang commented on HBASE-13877: --- {code} i've added extra abort() statement to safeguard against cases where the caller does not handle this exception explicitly. {code} Then this is the caller's fault? Why not fix it at the caller side? Or we could remove the abort regionserver semantics of DroppedSnapshotException so the caller do not need to deal with it anymore? We'd better make the rule clear otherwise people may get confusing... Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL Key: HBASE-13877 URL: https://issues.apache.org/jira/browse/HBASE-13877 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: hbase-13877_v1.patch, hbase-13877_v2-branch-1.1.patch ITBLL with 1.25B rows failed for me (and Stack as reported in https://issues.apache.org/jira/browse/HBASE-13811?focusedCommentId=14577834page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14577834) HBASE-13811 and HBASE-13853 fixed an issue with WAL edit filtering. The root cause this time seems to be different. It is due to procedure based flush interrupting the flush request in case the procedure is cancelled from an exception elsewhere. This leaves the memstore snapshot intact without aborting the server. The next flush, then flushes the previous memstore with the current seqId (as opposed to seqId from the memstore snapshot). This creates an hfile with larger seqId than what its contents are. Previous behavior in 0.98 and 1.0 (I believe) is that after flush prepare and interruption / exception will cause RS abort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HBASE-13877) Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL
[ https://issues.apache.org/jira/browse/HBASE-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579870#comment-14579870 ] Duo Zhang edited comment on HBASE-13877 at 6/10/15 2:16 AM: {quote} i've added extra abort() statement to safeguard against cases where the caller does not handle this exception explicitly. {quote} Then this is the caller's fault? Why not fix it at the caller side? Or we could remove the abort regionserver semantics of DroppedSnapshotException so the caller do not need to deal with it anymore? We'd better make the rule clear otherwise people may get confusing... was (Author: apache9): {code} i've added extra abort() statement to safeguard against cases where the caller does not handle this exception explicitly. {code} Then this is the caller's fault? Why not fix it at the caller side? Or we could remove the abort regionserver semantics of DroppedSnapshotException so the caller do not need to deal with it anymore? We'd better make the rule clear otherwise people may get confusing... Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL Key: HBASE-13877 URL: https://issues.apache.org/jira/browse/HBASE-13877 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: hbase-13877_v1.patch, hbase-13877_v2-branch-1.1.patch ITBLL with 1.25B rows failed for me (and Stack as reported in https://issues.apache.org/jira/browse/HBASE-13811?focusedCommentId=14577834page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14577834) HBASE-13811 and HBASE-13853 fixed an issue with WAL edit filtering. The root cause this time seems to be different. It is due to procedure based flush interrupting the flush request in case the procedure is cancelled from an exception elsewhere. This leaves the memstore snapshot intact without aborting the server. The next flush, then flushes the previous memstore with the current seqId (as opposed to seqId from the memstore snapshot). This creates an hfile with larger seqId than what its contents are. Previous behavior in 0.98 and 1.0 (I believe) is that after flush prepare and interruption / exception will cause RS abort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594157#comment-14594157 ] Duo Zhang commented on HBASE-13937: --- {quote} thus guaranteeing that the region server cannot accept any more writes. {quote} But still readable right? Does HBase guarantee this level of consistency? 1. A writes row to HBase. 2 A tells B to read the row. 3. B can read the row from HBase. If not, then I think remove recoverLease is enough here. Otherwise we still to make sure that the regionserver can not process any request. And I think this discussion should be in the parent issue, so +1 on patch v3 :) Partially revert HBASE-13172 - Key: HBASE-13937 URL: https://issues.apache.org/jira/browse/HBASE-13937 Project: HBase Issue Type: Sub-task Components: Region Assignment Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch, hbase-13937_v3.patch HBASE-13172 is supposed to fix a UT issue, but causes other problems that parent jira (HBASE-13605) is attempting to fix. However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, to put it mildly, major design flaws in AM / RS. Regardless of 13605, the issue with 13172 is that we catch {{ServerNotRunningYetException}} from {{isServerReachable()}} and return false, which then puts the Server to the {{RegionStates.deadServers}} list. Once it is in that list, we can still assign and unassign regions to the RS after it has started (because regular assignment does not check whether the server is in {{RegionStates.deadServers}}. However, after the first assign and unassign, we cannot assign the region again since then the check for the lastServer will think that the server is dead. It turns out that a proper patch for 13605 is very hard without fixing rest of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a colorful history). For 1.1.1, I think we should just revert parts of HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13970) NPE during compaction in trunk
[ https://issues.apache.org/jira/browse/HBASE-13970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-13970: -- Attachment: HBASE-13970.patch A simple patch that add an increment AtomicInteger as the suffix of compaction name. [~ram_krish] Could you please test whether this patch works? Thanks. And also, thanks for the great digging. NPE during compaction in trunk -- Key: HBASE-13970 URL: https://issues.apache.org/jira/browse/HBASE-13970 Project: HBase Issue Type: Bug Affects Versions: 2.0.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 2.0.0 Attachments: HBASE-13970.patch Updated the trunk.. Loaded the table with PE tool. Trigger a flush to ensure all data is flushed out to disk. When the first compaction is triggered we get an NPE and this is very easy to reproduce {code} 015-06-25 21:33:46,041 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/flush-table-proc/acquired 2015-06-25 21:33:46,051 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HRegion: Flushing 1/1 column families, memstore=76.91 MB 2015-06-25 21:33:46,159 ERROR [regionserver/stobdtserver3/10.224.54.70:16040-longCompactions-1435248183945] regionserver.CompactSplitThread: Compaction failed Request = regionName=TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4., storeName=info, fileCount=3, fileSize=343.4 M (114.5 M, 114.5 M, 114.5 M), priority=3, time=7536968291719985 java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController$ActiveCompaction.access$700(PressureAwareCompactionThroughputController.java:79) at org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController.finish(PressureAwareCompactionThroughputController.java:238) at org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:306) at org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:106) at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:112) at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1202) at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1792) at org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:524) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-06-25 21:33:46,745 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.DefaultStoreFlusher: Flushed, sequenceid=1534, memsize=76.9 M, hasBloomFilter=true, into tmp file hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/.tmp/942ba0831a0047a08987439e34361a0c 2015-06-25 21:33:46,772 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HStore: Added hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/info/942ba0831a0047a08987439e34361a0c, entries=68116, sequenceid=1534, filesize=68.7 M 2015-06-25 21:33:46,773 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HRegion: Finished memstore flush of ~76.91 MB/80649344, currentsize=0 B/0 for region TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4. in 723ms, sequenceid=1534, compaction requested=true 2015-06-25 21:33:46,780 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/flush-table-proc/reached/TestTable 2015-06-25 21:33:46,790 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/flush-table-proc/abort/TestTable 2015-06-25 21:33:46,791 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure abort children changed event: /hbase/flush-table-proc/abort 2015-06-25 21:33:46,803 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/flush-table-proc/acquired 2015-06-25 21:33:46,818 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure abort children changed event: /hbase/flush-table-proc/abort {code} Will check this on what is the reason behind it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13970) NPE during compaction in trunk
[ https://issues.apache.org/jira/browse/HBASE-13970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601295#comment-14601295 ] Duo Zhang commented on HBASE-13970: --- Oh, this should be a bug on all branches which contains HBASE-8329. I used to assume that compactions can not be executed parallel on the same store, so the compationName only contains regionName and storeName. I think we could add a counter in the name to avoid conflict. NPE during compaction in trunk -- Key: HBASE-13970 URL: https://issues.apache.org/jira/browse/HBASE-13970 Project: HBase Issue Type: Bug Affects Versions: 2.0.0 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 2.0.0 Updated the trunk.. Loaded the table with PE tool. Trigger a flush to ensure all data is flushed out to disk. When the first compaction is triggered we get an NPE and this is very easy to reproduce {code} 015-06-25 21:33:46,041 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/flush-table-proc/acquired 2015-06-25 21:33:46,051 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HRegion: Flushing 1/1 column families, memstore=76.91 MB 2015-06-25 21:33:46,159 ERROR [regionserver/stobdtserver3/10.224.54.70:16040-longCompactions-1435248183945] regionserver.CompactSplitThread: Compaction failed Request = regionName=TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4., storeName=info, fileCount=3, fileSize=343.4 M (114.5 M, 114.5 M, 114.5 M), priority=3, time=7536968291719985 java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController$ActiveCompaction.access$700(PressureAwareCompactionThroughputController.java:79) at org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController.finish(PressureAwareCompactionThroughputController.java:238) at org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:306) at org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:106) at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:112) at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1202) at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1792) at org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:524) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-06-25 21:33:46,745 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.DefaultStoreFlusher: Flushed, sequenceid=1534, memsize=76.9 M, hasBloomFilter=true, into tmp file hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/.tmp/942ba0831a0047a08987439e34361a0c 2015-06-25 21:33:46,772 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HStore: Added hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/info/942ba0831a0047a08987439e34361a0c, entries=68116, sequenceid=1534, filesize=68.7 M 2015-06-25 21:33:46,773 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HRegion: Finished memstore flush of ~76.91 MB/80649344, currentsize=0 B/0 for region TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4. in 723ms, sequenceid=1534, compaction requested=true 2015-06-25 21:33:46,780 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/flush-table-proc/reached/TestTable 2015-06-25 21:33:46,790 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/flush-table-proc/abort/TestTable 2015-06-25 21:33:46,791 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure abort children changed event: /hbase/flush-table-proc/abort 2015-06-25 21:33:46,803 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/flush-table-proc/acquired 2015-06-25 21:33:46,818 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure abort children changed event: /hbase/flush-table-proc/abort {code} Will check this on what is the reason behind it. -- This message was sent by
[jira] [Updated] (HBASE-13970) NPE during compaction in trunk
[ https://issues.apache.org/jira/browse/HBASE-13970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-13970: -- Fix Version/s: 1.1.2 1.2.0 0.98.14 Affects Version/s: 1.1.1 1.2.0 0.98.13 Status: Patch Available (was: Open) NPE during compaction in trunk -- Key: HBASE-13970 URL: https://issues.apache.org/jira/browse/HBASE-13970 Project: HBase Issue Type: Bug Affects Versions: 0.98.13, 2.0.0, 1.2.0, 1.1.1 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 2.0.0, 0.98.14, 1.2.0, 1.1.2 Attachments: HBASE-13970.patch Updated the trunk.. Loaded the table with PE tool. Trigger a flush to ensure all data is flushed out to disk. When the first compaction is triggered we get an NPE and this is very easy to reproduce {code} 015-06-25 21:33:46,041 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/flush-table-proc/acquired 2015-06-25 21:33:46,051 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HRegion: Flushing 1/1 column families, memstore=76.91 MB 2015-06-25 21:33:46,159 ERROR [regionserver/stobdtserver3/10.224.54.70:16040-longCompactions-1435248183945] regionserver.CompactSplitThread: Compaction failed Request = regionName=TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4., storeName=info, fileCount=3, fileSize=343.4 M (114.5 M, 114.5 M, 114.5 M), priority=3, time=7536968291719985 java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController$ActiveCompaction.access$700(PressureAwareCompactionThroughputController.java:79) at org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController.finish(PressureAwareCompactionThroughputController.java:238) at org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:306) at org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:106) at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:112) at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1202) at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1792) at org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:524) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-06-25 21:33:46,745 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.DefaultStoreFlusher: Flushed, sequenceid=1534, memsize=76.9 M, hasBloomFilter=true, into tmp file hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/.tmp/942ba0831a0047a08987439e34361a0c 2015-06-25 21:33:46,772 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HStore: Added hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/info/942ba0831a0047a08987439e34361a0c, entries=68116, sequenceid=1534, filesize=68.7 M 2015-06-25 21:33:46,773 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HRegion: Finished memstore flush of ~76.91 MB/80649344, currentsize=0 B/0 for region TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4. in 723ms, sequenceid=1534, compaction requested=true 2015-06-25 21:33:46,780 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/flush-table-proc/reached/TestTable 2015-06-25 21:33:46,790 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/flush-table-proc/abort/TestTable 2015-06-25 21:33:46,791 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure abort children changed event: /hbase/flush-table-proc/abort 2015-06-25 21:33:46,803 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/flush-table-proc/acquired 2015-06-25 21:33:46,818 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure abort children changed event: /hbase/flush-table-proc/abort {code} Will check this on what is the reason behind it. -- This message was sent by Atlassian JIRA
[jira] [Updated] (HBASE-13970) NPE during compaction in trunk
[ https://issues.apache.org/jira/browse/HBASE-13970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-13970: -- Attachment: HBASE-13970-v1.patch Overflow is OK as what [~anoopsamjohn] said, we only need unique Strings. But yes negative counter is bit ugly. I rotated the value to 0 if we reached Integer.MAX_VALUE in the new patch. Thanks. NPE during compaction in trunk -- Key: HBASE-13970 URL: https://issues.apache.org/jira/browse/HBASE-13970 Project: HBase Issue Type: Bug Affects Versions: 2.0.0, 0.98.13, 1.2.0, 1.1.1 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 2.0.0, 0.98.14, 1.2.0, 1.1.2 Attachments: HBASE-13970-v1.patch, HBASE-13970.patch Updated the trunk.. Loaded the table with PE tool. Trigger a flush to ensure all data is flushed out to disk. When the first compaction is triggered we get an NPE and this is very easy to reproduce {code} 015-06-25 21:33:46,041 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/flush-table-proc/acquired 2015-06-25 21:33:46,051 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HRegion: Flushing 1/1 column families, memstore=76.91 MB 2015-06-25 21:33:46,159 ERROR [regionserver/stobdtserver3/10.224.54.70:16040-longCompactions-1435248183945] regionserver.CompactSplitThread: Compaction failed Request = regionName=TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4., storeName=info, fileCount=3, fileSize=343.4 M (114.5 M, 114.5 M, 114.5 M), priority=3, time=7536968291719985 java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController$ActiveCompaction.access$700(PressureAwareCompactionThroughputController.java:79) at org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController.finish(PressureAwareCompactionThroughputController.java:238) at org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:306) at org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:106) at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:112) at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1202) at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1792) at org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:524) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-06-25 21:33:46,745 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.DefaultStoreFlusher: Flushed, sequenceid=1534, memsize=76.9 M, hasBloomFilter=true, into tmp file hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/.tmp/942ba0831a0047a08987439e34361a0c 2015-06-25 21:33:46,772 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HStore: Added hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/info/942ba0831a0047a08987439e34361a0c, entries=68116, sequenceid=1534, filesize=68.7 M 2015-06-25 21:33:46,773 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HRegion: Finished memstore flush of ~76.91 MB/80649344, currentsize=0 B/0 for region TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4. in 723ms, sequenceid=1534, compaction requested=true 2015-06-25 21:33:46,780 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/flush-table-proc/reached/TestTable 2015-06-25 21:33:46,790 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/flush-table-proc/abort/TestTable 2015-06-25 21:33:46,791 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure abort children changed event: /hbase/flush-table-proc/abort 2015-06-25 21:33:46,803 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/flush-table-proc/acquired 2015-06-25 21:33:46,818 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure abort children changed event: /hbase/flush-table-proc/abort {code} Will check this on what is the reason behind it. -- This message was
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592928#comment-14592928 ] Duo Zhang commented on HBASE-13937: --- {quote} In theory we should not have isServerReachable() at all. It is a parallel cluster membership mechanism to the already existing zk based one {quote} I'm not familiar with the code of AM and SM, but I think zk is not enough to keep things consistency. The EPHEMERAL node on zookeeper disappeared does not mean the server is really dead. We still need a method like isServerReachable to decide whether the server is really dead. One example is in HDFS HA, a fencing is needed before transforming standby namenode to active namenode. {quote} If we get a connection exception or smt, we cannot assume the server is dead. {quote} Agree(and excuse me, what is smt?), in general only the server tells us it is dead then we can make sure the server is dead. And after googling I think connection refused is also not that stable(a firewall or backlog queue full can also cause connection refused). So I think we need fencing like what HDFS HA does? Yeah, you may challenge that if the machine is crashed, how can we make sure the server is dead...Honestly I do not have perfect solution. Maybe we could introduce a DeadServerManager and make several levels of consistency, the lowest level does not do fencing at all, the medium level do fencing with a timeout, and the highest level will do fencing for ever(let a person stop it maybe) Thanks. Partially revert HBASE-13172 - Key: HBASE-13937 URL: https://issues.apache.org/jira/browse/HBASE-13937 Project: HBase Issue Type: Sub-task Components: Region Assignment Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 Attachments: hbase-13937_v1.patch, hbase-13937_v2.patch HBASE-13172 is supposed to fix a UT issue, but causes other problems that parent jira (HBASE-13605) is attempting to fix. However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, to put it mildly, major design flaws in AM / RS. Regardless of 13605, the issue with 13172 is that we catch {{ServerNotRunningYetException}} from {{isServerReachable()}} and return false, which then puts the Server to the {{RegionStates.deadServers}} list. Once it is in that list, we can still assign and unassign regions to the RS after it has started (because regular assignment does not check whether the server is in {{RegionStates.deadServers}}. However, after the first assign and unassign, we cannot assign the region again since then the check for the lastServer will think that the server is dead. It turns out that a proper patch for 13605 is very hard without fixing rest of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a colorful history). For 1.1.1, I think we should just revert parts of HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13937) Partially revert HBASE-13172
[ https://issues.apache.org/jira/browse/HBASE-13937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592814#comment-14592814 ] Duo Zhang commented on HBASE-13937: --- For isServerReachable, I think we can say a server is 'dead' if one of the following conditions is satisfied 1. Server tells us it is dead(I'm not sure whether a RegionServerStoppedException is enough, maybe it will be thrown before regionserver completely shutdown?) 2. It is a server with another start code. 3. We get a connection refused(not connect timeout). So I think remove the code is reasonable, but we should catch a connection refused exception then(Does our rpc framework throw this exception out, and also we do not need retry here...)? Otherwise if we do not restart a regionserver then we will be stuck in the loop for a long time... And also I do not think it is safe to return 'not reachable' if timeout... Thanks. Partially revert HBASE-13172 - Key: HBASE-13937 URL: https://issues.apache.org/jira/browse/HBASE-13937 Project: HBase Issue Type: Sub-task Components: Region Assignment Reporter: Enis Soztutar Assignee: Enis Soztutar Fix For: 0.98.14, 1.2.0, 1.1.1, 1.3.0 Attachments: hbase-13937_v1.patch HBASE-13172 is supposed to fix a UT issue, but causes other problems that parent jira (HBASE-13605) is attempting to fix. However, HBASE-13605 patch v4 uncovers at least 2 different issues which are, to put it mildly, major design flaws in AM / RS. Regardless of 13605, the issue with 13172 is that we catch {{ServerNotRunningYetException}} from {{isServerReachable()}} and return false, which then puts the Server to the {{RegionStates.deadServers}} list. Once it is in that list, we can still assign and unassign regions to the RS after it has started (because regular assignment does not check whether the server is in {{RegionStates.deadServers}}. However, after the first assign and unassign, we cannot assign the region again since then the check for the lastServer will think that the server is dead. It turns out that a proper patch for 13605 is very hard without fixing rest of broken AM assumptions (see HBASE-13605, HBASE-13877 and HBASE-13895 for a colorful history). For 1.1.1, I think we should just revert parts of HBASE-13172 for now. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13899) Jacoco instrumentation fails under jdk8
[ https://issues.apache.org/jira/browse/HBASE-13899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585042#comment-14585042 ] Duo Zhang commented on HBASE-13899: --- Seems our jacoco version is really old {code} jacoco.version0.6.2.201302030002/jacoco.version {code} It's released at Feb 2013. The newest version is 0.7.5 now. I change the mvn command to {noformat} install -PrunAllTests -Dmaven.test.redirectTestOutputToFile=true -Dit.test=noItTest -Dhbase.skip-jacoco=false -Djacoco.version=0.7.5.201505241946 {noformat} Seems worked. Jacoco instrumentation fails under jdk8 --- Key: HBASE-13899 URL: https://issues.apache.org/jira/browse/HBASE-13899 Project: HBase Issue Type: Sub-task Components: build, test Affects Versions: 1.2.0 Reporter: Sean Busbey Labels: coverage Fix For: 2.0.0, 1.2.0 Moving the post-commit build for master to also cover jdk8 shows failures when attempting to instrument test for jacoco coverage. example: {code} Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:386) at sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:401) Caused by: java.lang.RuntimeException: Class java/util/UUID could not be instrumented. at org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:138) at org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:99) at org.jacoco.agent.rt.internal_5d10cad.PreMain.createRuntime(PreMain.java:51) at org.jacoco.agent.rt.internal_5d10cad.PreMain.premain(PreMain.java:43) ... 6 more Caused by: java.lang.NoSuchFieldException: $jacocoAccess at java.lang.Class.getField(Class.java:1695) at org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:136) ... 9 more FATAL ERROR in native method: processing of -javaagent failed Aborted {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13899) Jacoco instrumentation fails under jdk8
[ https://issues.apache.org/jira/browse/HBASE-13899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-13899: -- Assignee: Duo Zhang Affects Version/s: 2.0.0 Status: Patch Available (was: Open) Jacoco instrumentation fails under jdk8 --- Key: HBASE-13899 URL: https://issues.apache.org/jira/browse/HBASE-13899 Project: HBase Issue Type: Sub-task Components: build, test Affects Versions: 2.0.0, 1.2.0 Reporter: Sean Busbey Assignee: Duo Zhang Labels: coverage Fix For: 2.0.0, 1.2.0 Attachments: HBASE-13899.patch Moving the post-commit build for master to also cover jdk8 shows failures when attempting to instrument test for jacoco coverage. example: {code} Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:386) at sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:401) Caused by: java.lang.RuntimeException: Class java/util/UUID could not be instrumented. at org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:138) at org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:99) at org.jacoco.agent.rt.internal_5d10cad.PreMain.createRuntime(PreMain.java:51) at org.jacoco.agent.rt.internal_5d10cad.PreMain.premain(PreMain.java:43) ... 6 more Caused by: java.lang.NoSuchFieldException: $jacocoAccess at java.lang.Class.getField(Class.java:1695) at org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:136) ... 9 more FATAL ERROR in native method: processing of -javaagent failed Aborted {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13899) Jacoco instrumentation fails under jdk8
[ https://issues.apache.org/jira/browse/HBASE-13899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-13899: -- Attachment: HBASE-13899.patch One line patch. Jacoco instrumentation fails under jdk8 --- Key: HBASE-13899 URL: https://issues.apache.org/jira/browse/HBASE-13899 Project: HBase Issue Type: Sub-task Components: build, test Affects Versions: 1.2.0 Reporter: Sean Busbey Labels: coverage Fix For: 2.0.0, 1.2.0 Attachments: HBASE-13899.patch Moving the post-commit build for master to also cover jdk8 shows failures when attempting to instrument test for jacoco coverage. example: {code} Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:386) at sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:401) Caused by: java.lang.RuntimeException: Class java/util/UUID could not be instrumented. at org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:138) at org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:99) at org.jacoco.agent.rt.internal_5d10cad.PreMain.createRuntime(PreMain.java:51) at org.jacoco.agent.rt.internal_5d10cad.PreMain.premain(PreMain.java:43) ... 6 more Caused by: java.lang.NoSuchFieldException: $jacocoAccess at java.lang.Class.getField(Class.java:1695) at org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:136) ... 9 more FATAL ERROR in native method: processing of -javaagent failed Aborted {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13899) Jacoco instrumentation fails under jdk8
[ https://issues.apache.org/jira/browse/HBASE-13899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-13899: -- Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Pushed to master and branch-1. Thanks [~busbey]. Jacoco instrumentation fails under jdk8 --- Key: HBASE-13899 URL: https://issues.apache.org/jira/browse/HBASE-13899 Project: HBase Issue Type: Sub-task Components: build, test Affects Versions: 2.0.0, 1.2.0 Reporter: Sean Busbey Assignee: Duo Zhang Labels: coverage Fix For: 2.0.0, 1.2.0 Attachments: HBASE-13899.patch Moving the post-commit build for master to also cover jdk8 shows failures when attempting to instrument test for jacoco coverage. example: {code} Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:386) at sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:401) Caused by: java.lang.RuntimeException: Class java/util/UUID could not be instrumented. at org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:138) at org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:99) at org.jacoco.agent.rt.internal_5d10cad.PreMain.createRuntime(PreMain.java:51) at org.jacoco.agent.rt.internal_5d10cad.PreMain.premain(PreMain.java:43) ... 6 more Caused by: java.lang.NoSuchFieldException: $jacocoAccess at java.lang.Class.getField(Class.java:1695) at org.jacoco.agent.rt.internal_5d10cad.core.runtime.ModifiedSystemClassRuntime.createFor(ModifiedSystemClassRuntime.java:136) ... 9 more FATAL ERROR in native method: processing of -javaagent failed Aborted {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13903) Speedup IdLock
[ https://issues.apache.org/jira/browse/HBASE-13903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587266#comment-14587266 ] Duo Zhang commented on HBASE-13903: --- What about just use a StripedLock and give each stripe an individual HashMap? ConcurrentHashMap is fast because it does not require locking when getting entry from map at most time, but here we use putIfAbsent which will always lock a segment. And for the profiling result, is lots of 'time' or lots of 'cpu' spent in IdLock? This is important. If we spent lots of time waiting a lock, then the problem is contention, not the implementation of IdLock. Thanks. Speedup IdLock -- Key: HBASE-13903 URL: https://issues.apache.org/jira/browse/HBASE-13903 Project: HBase Issue Type: Improvement Components: regionserver Affects Versions: 2.0.0, 1.0.1, 1.1.0, 0.98.13 Reporter: Matteo Bertozzi Assignee: Matteo Bertozzi Priority: Minor Fix For: 2.0.0, 1.2.0 Attachments: HBASE-13903-v0.patch, IdLockPerf.java while testing the read path, I ended up with the profiler showing a lot of time spent in IdLock. The IdLock is used by the HFileReader and the BucketCache, so you'll see a lot of it when you have an hotspot on a hfile. we end up locked by that synchronized() and with too many calls to map.putIfAbsent() {code} public Entry getLockEntry(long id) throws IOException { while ((existing = map.putIfAbsent(entry.id, entry)) != null) { synchronized (existing) { ... } // If the entry is not locked, it might already be deleted from the // map, so we cannot return it. We need to get our entry into the map // or get someone else's locked entry. } } public void releaseLockEntry(Entry entry) { synchronized (entry) { ... } } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13877) Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL
[ https://issues.apache.org/jira/browse/HBASE-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582742#comment-14582742 ] Duo Zhang commented on HBASE-13877: --- +1 on patch v3. Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL Key: HBASE-13877 URL: https://issues.apache.org/jira/browse/HBASE-13877 Project: HBase Issue Type: Bug Reporter: Enis Soztutar Assignee: Enis Soztutar Priority: Blocker Fix For: 2.0.0, 1.2.0, 1.1.1 Attachments: hbase-13877_v1.patch, hbase-13877_v2-branch-1.1.patch, hbase-13877_v3-branch-1.1.patch ITBLL with 1.25B rows failed for me (and Stack as reported in https://issues.apache.org/jira/browse/HBASE-13811?focusedCommentId=14577834page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14577834) HBASE-13811 and HBASE-13853 fixed an issue with WAL edit filtering. The root cause this time seems to be different. It is due to procedure based flush interrupting the flush request in case the procedure is cancelled from an exception elsewhere. This leaves the memstore snapshot intact without aborting the server. The next flush, then flushes the previous memstore with the current seqId (as opposed to seqId from the memstore snapshot). This creates an hfile with larger seqId than what its contents are. Previous behavior in 0.98 and 1.0 (I believe) is that after flush prepare and interruption / exception will cause RS abort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS
[ https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570198#comment-14570198 ] Duo Zhang edited comment on HBASE-13811 at 6/3/15 3:02 AM: --- Write a testcase. It fails for me with {noformat} java.lang.AssertionError: actual array was null at org.apache.hadoop.hbase.regionserver.TestSplitWalDataLoss.test(TestSplitWalDataLoss.java:153) {noformat} I think the fix should also be applied to branch-1.1 since it will cause data loss even if we turn off flush per CF(but no issues if you turn on DLR...). [~ndimiduk] was (Author: apache9): Write a testcase. It fails for me with {noformat} java.lang.AssertionError: actual array was null at org.apache.hadoop.hbase.regionserver.TestSplitWalDataLoss.test(TestSplitWalDataLoss.java:153) {noformat} I think the fix should also be applied to branch-1.1 since it will cause data loss even if we turn off flush per CF. [~ndimiduk] Splitting WALs, we are filtering out too many edits - DATALOSS --- Key: HBASE-13811 URL: https://issues.apache.org/jira/browse/HBASE-13811 Project: HBase Issue Type: Bug Components: wal Affects Versions: 2.0.0, 1.2.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 2.0.0, 1.2.0 Attachments: 13811.branch-1.txt, 13811.txt, HBASE-13811.testcase.patch I've been running ITBLLs against branch-1 around HBASE-13616 (move of ServerShutdownHandler to pv2). I have come across an instance of dataloss. My patch for HBASE-13616 was in place so can only think it the cause (but cannot see how). When we split the logs, we are skipping legit edits. Digging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS
[ https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-13811: -- Attachment: HBASE-13811.testcase.patch Write a testcase. It fails for me with {noformat} java.lang.AssertionError: actual array was null at org.apache.hadoop.hbase.regionserver.TestSplitWalDataLoss.test(TestSplitWalDataLoss.java:153) {noformat} I think the fix should also be applied to branch-1.1 since it will cause data loss even if we turn off flush per CF. [~ndimiduk] Splitting WALs, we are filtering out too many edits - DATALOSS --- Key: HBASE-13811 URL: https://issues.apache.org/jira/browse/HBASE-13811 Project: HBase Issue Type: Bug Components: wal Affects Versions: 2.0.0, 1.2.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 2.0.0, 1.2.0 Attachments: 13811.branch-1.txt, 13811.txt, HBASE-13811.testcase.patch I've been running ITBLLs against branch-1 around HBASE-13616 (move of ServerShutdownHandler to pv2). I have come across an instance of dataloss. My patch for HBASE-13616 was in place so can only think it the cause (but cannot see how). When we split the logs, we are skipping legit edits. Digging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS
[ https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14570273#comment-14570273 ] Duo Zhang commented on HBASE-13811: --- Yeah, with the patch in place {code} for (;;) { RegionStoreSequenceIds ids = testUtil.getMiniHBaseCluster().getMaster().getServerManager() .getLastFlushedSequenceId(region.getRegionInfo().getEncodedNameAsBytes()); if (ids.getStoreSequenceId(0).getSequenceId() oldestSeqIdOfStore) { break; } Thread.sleep(100); } {code} This will not happen so the testcase does not quit...Let me see... Splitting WALs, we are filtering out too many edits - DATALOSS --- Key: HBASE-13811 URL: https://issues.apache.org/jira/browse/HBASE-13811 Project: HBase Issue Type: Bug Components: wal Affects Versions: 2.0.0, 1.2.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 2.0.0, 1.2.0 Attachments: 13811.branch-1.txt, 13811.txt, HBASE-13811.testcase.patch I've been running ITBLLs against branch-1 around HBASE-13616 (move of ServerShutdownHandler to pv2). I have come across an instance of dataloss. My patch for HBASE-13616 was in place so can only think it the cause (but cannot see how). When we split the logs, we are skipping legit edits. Digging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13811) Splitting WALs, we are filtering out too many edits - DATALOSS
[ https://issues.apache.org/jira/browse/HBASE-13811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-13811: -- Attachment: HBASE-13811-v1.testcase.patch [~stack] Try this one? Just make a RegionServerReport manually instead of waiting the regionserver thread to do it for us. I modified getEarliestMemstoreSeqNum and it passed locally. Splitting WALs, we are filtering out too many edits - DATALOSS --- Key: HBASE-13811 URL: https://issues.apache.org/jira/browse/HBASE-13811 Project: HBase Issue Type: Bug Components: wal Affects Versions: 2.0.0, 1.2.0 Reporter: stack Assignee: stack Priority: Critical Fix For: 2.0.0, 1.2.0 Attachments: 13811.branch-1.txt, 13811.txt, HBASE-13811-v1.testcase.patch, HBASE-13811.testcase.patch I've been running ITBLLs against branch-1 around HBASE-13616 (move of ServerShutdownHandler to pv2). I have come across an instance of dataloss. My patch for HBASE-13616 was in place so can only think it the cause (but cannot see how). When we split the logs, we are skipping legit edits. Digging. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13970) NPE during compaction in trunk
[ https://issues.apache.org/jira/browse/HBASE-13970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609386#comment-14609386 ] Duo Zhang commented on HBASE-13970: --- [~ram_krish] Any progress? Thanks. NPE during compaction in trunk -- Key: HBASE-13970 URL: https://issues.apache.org/jira/browse/HBASE-13970 Project: HBase Issue Type: Bug Affects Versions: 2.0.0, 0.98.13, 1.2.0, 1.1.1 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 2.0.0, 0.98.14, 1.2.0, 1.1.2 Attachments: HBASE-13970-v1.patch, HBASE-13970.patch Updated the trunk.. Loaded the table with PE tool. Trigger a flush to ensure all data is flushed out to disk. When the first compaction is triggered we get an NPE and this is very easy to reproduce {code} 015-06-25 21:33:46,041 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/flush-table-proc/acquired 2015-06-25 21:33:46,051 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HRegion: Flushing 1/1 column families, memstore=76.91 MB 2015-06-25 21:33:46,159 ERROR [regionserver/stobdtserver3/10.224.54.70:16040-longCompactions-1435248183945] regionserver.CompactSplitThread: Compaction failed Request = regionName=TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4., storeName=info, fileCount=3, fileSize=343.4 M (114.5 M, 114.5 M, 114.5 M), priority=3, time=7536968291719985 java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController$ActiveCompaction.access$700(PressureAwareCompactionThroughputController.java:79) at org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController.finish(PressureAwareCompactionThroughputController.java:238) at org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:306) at org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:106) at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:112) at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1202) at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1792) at org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:524) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-06-25 21:33:46,745 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.DefaultStoreFlusher: Flushed, sequenceid=1534, memsize=76.9 M, hasBloomFilter=true, into tmp file hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/.tmp/942ba0831a0047a08987439e34361a0c 2015-06-25 21:33:46,772 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HStore: Added hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/info/942ba0831a0047a08987439e34361a0c, entries=68116, sequenceid=1534, filesize=68.7 M 2015-06-25 21:33:46,773 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HRegion: Finished memstore flush of ~76.91 MB/80649344, currentsize=0 B/0 for region TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4. in 723ms, sequenceid=1534, compaction requested=true 2015-06-25 21:33:46,780 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/flush-table-proc/reached/TestTable 2015-06-25 21:33:46,790 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/flush-table-proc/abort/TestTable 2015-06-25 21:33:46,791 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure abort children changed event: /hbase/flush-table-proc/abort 2015-06-25 21:33:46,803 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/flush-table-proc/acquired 2015-06-25 21:33:46,818 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure abort children changed event: /hbase/flush-table-proc/abort {code} Will check this on what is the reason behind it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-13970) NPE during compaction in trunk
[ https://issues.apache.org/jira/browse/HBASE-13970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-13970: -- Resolution: Fixed Assignee: Duo Zhang (was: ramkrishna.s.vasudevan) Hadoop Flags: Reviewed Fix Version/s: (was: 1.0.2) Status: Resolved (was: Patch Available) Pushed to all branches except branch-1.0(HBASE-8329 has not applied to branch-1.0). Thanks [~anoopsamjohn] and [~ram_krish]. NPE during compaction in trunk -- Key: HBASE-13970 URL: https://issues.apache.org/jira/browse/HBASE-13970 Project: HBase Issue Type: Bug Affects Versions: 2.0.0, 0.98.13, 1.2.0, 1.1.1 Reporter: ramkrishna.s.vasudevan Assignee: Duo Zhang Fix For: 2.0.0, 0.98.14, 1.1.2, 1.3.0, 1.2.1 Attachments: HBASE-13970-v1.patch, HBASE-13970.patch Updated the trunk.. Loaded the table with PE tool. Trigger a flush to ensure all data is flushed out to disk. When the first compaction is triggered we get an NPE and this is very easy to reproduce {code} 015-06-25 21:33:46,041 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/flush-table-proc/acquired 2015-06-25 21:33:46,051 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HRegion: Flushing 1/1 column families, memstore=76.91 MB 2015-06-25 21:33:46,159 ERROR [regionserver/stobdtserver3/10.224.54.70:16040-longCompactions-1435248183945] regionserver.CompactSplitThread: Compaction failed Request = regionName=TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4., storeName=info, fileCount=3, fileSize=343.4 M (114.5 M, 114.5 M, 114.5 M), priority=3, time=7536968291719985 java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController$ActiveCompaction.access$700(PressureAwareCompactionThroughputController.java:79) at org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController.finish(PressureAwareCompactionThroughputController.java:238) at org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:306) at org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:106) at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:112) at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1202) at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1792) at org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:524) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-06-25 21:33:46,745 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.DefaultStoreFlusher: Flushed, sequenceid=1534, memsize=76.9 M, hasBloomFilter=true, into tmp file hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/.tmp/942ba0831a0047a08987439e34361a0c 2015-06-25 21:33:46,772 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HStore: Added hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/info/942ba0831a0047a08987439e34361a0c, entries=68116, sequenceid=1534, filesize=68.7 M 2015-06-25 21:33:46,773 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HRegion: Finished memstore flush of ~76.91 MB/80649344, currentsize=0 B/0 for region TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4. in 723ms, sequenceid=1534, compaction requested=true 2015-06-25 21:33:46,780 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/flush-table-proc/reached/TestTable 2015-06-25 21:33:46,790 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/flush-table-proc/abort/TestTable 2015-06-25 21:33:46,791 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure abort children changed event: /hbase/flush-table-proc/abort 2015-06-25 21:33:46,803 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/flush-table-proc/acquired 2015-06-25 21:33:46,818 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure abort children changed event:
[jira] [Commented] (HBASE-14262) Big Trunk unit tests failing with OutOfMemoryError: unable to create new native thread
[ https://issues.apache.org/jira/browse/HBASE-14262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704233#comment-14704233 ] Duo Zhang commented on HBASE-14262: --- {quote} If I revert HBASE-13065, TestAcidGuarantees passes on my local mac... {quote} This could happen since reduce heap size also means increase native memory size we could use so we can create more threads... And I used to deal with TestAcidGuarantees when the new AsyncRpcClient is introduced. I remember that finally I create a static EventLoopGroup and make all AsyncRpcClient share it to reduce thread number... So I suggest we verify the jstack result first to see if we really create too many threads? Maybe we are reaching the precipice, so sometimes it fails and sometimes not... Thanks. Big Trunk unit tests failing with OutOfMemoryError: unable to create new native thread Key: HBASE-14262 URL: https://issues.apache.org/jira/browse/HBASE-14262 Project: HBase Issue Type: Bug Components: test Reporter: stack Assignee: stack The bit unit tests are coming in with OOME, can't create native threads. I was also getting the OOME locally running on MBP. git bisect got me to HBASE-13065, where we upped the test heap for TestDistributedLogSplitting back in feb. Around the time that this went in, we had similar OOME issues but then it was because we were doing 32bit JVMs. It does not seem to be the case here. A recent run failed all the below and most are OOME: {code} {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplicationEndpoint org.apache.hadoop.hbase.replication.TestPerTableCFReplication org.apache.hadoop.hbase.wal.TestBoundedRegionGroupingProvider org.apache.hadoop.hbase.replication.TestReplicationKillMasterRSCompressed org.apache.hadoop.hbase.replication.regionserver.TestRegionReplicaReplicationEndpointNoMaster org.apache.hadoop.hbase.replication.TestReplicationKillSlaveRS org.apache.hadoop.hbase.replication.regionserver.TestRegionReplicaReplicationEndpoint org.apache.hadoop.hbase.replication.TestMasterReplication org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.regionserver.TestRegionMergeTransactionOnCluster org.apache.hadoop.hbase.regionserver.TestRegionFavoredNodes org.apache.hadoop.hbase.replication.TestMultiSlaveReplication org.apache.hadoop.hbase.zookeeper.TestZKLeaderManager org.apache.hadoop.hbase.master.TestMasterRestartAfterDisablingTable org.apache.hadoop.hbase.TestGlobalMemStoreSize org.apache.hadoop.hbase.wal.TestWALFiltering org.apache.hadoop.hbase.replication.TestReplicationSmallTests org.apache.hadoop.hbase.replication.TestReplicationSyncUpTool org.apache.hadoop.hbase.replication.TestReplicationWithTags org.apache.hadoop.hbase.master.procedure.TestTruncateTableProcedure org.apache.hadoop.hbase.replication.TestReplicationChangingPeerRegionservers org.apache.hadoop.hbase.wal.TestDefaultWALProviderWithHLogKey org.apache.hadoop.hbase.regionserver.TestPerColumnFamilyFlush org.apache.hadoop.hbase.snapshot.TestMobRestoreFlushSnapshotFromClient org.apache.hadoop.hbase.master.procedure.TestCreateTableProcedure org.apache.hadoop.hbase.wal.TestWALFactory org.apache.hadoop.hbase.master.procedure.TestServerCrashProcedure org.apache.hadoop.hbase.replication.TestReplicationDisableInactivePeer org.apache.hadoop.hbase.master.procedure.TestAddColumnFamilyProcedure org.apache.hadoop.hbase.mapred.TestMultiTableSnapshotInputFormat org.apache.hadoop.hbase.master.procedure.TestEnableTableProcedure org.apache.hadoop.hbase.master.TestMasterFailoverBalancerPersistence org.apache.hadoop.hbase.TestStochasticBalancerJmxMetrics {color:red}-1 core zombie tests{color}. There are 16 zombie test(s): at org.apache.hadoop.hbase.security.visibility.TestVisibilityLabels.testVisibilityLabelsWithComplexLabels(TestVisibilityLabels.java:216) at org.apache.hadoop.hbase.mapred.TestTableInputFormat.testTableRecordReaderScannerFail(TestTableInputFormat.java:281) at
[jira] [Commented] (HBASE-13970) NPE during compaction in trunk
[ https://issues.apache.org/jira/browse/HBASE-13970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602704#comment-14602704 ] Duo Zhang commented on HBASE-13970: --- No, counter can not be same for 2 threads. For a given value of counter, only one CAS operation can succeed. The failed thread will try another round and the value of counter will be changed. This is the getAndIncrement methods in AtomicInteger. The only difference is how to calculate the next value. {code} /** * Atomically increments by one the current value. * * @return the previous value */ public final int getAndIncrement() { for (;;) { int current = get(); int next = current + 1; if (compareAndSet(current, next)) return current; } } {code} NPE during compaction in trunk -- Key: HBASE-13970 URL: https://issues.apache.org/jira/browse/HBASE-13970 Project: HBase Issue Type: Bug Affects Versions: 2.0.0, 0.98.13, 1.2.0, 1.1.1 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 2.0.0, 0.98.14, 1.2.0, 1.1.2 Attachments: HBASE-13970-v1.patch, HBASE-13970.patch Updated the trunk.. Loaded the table with PE tool. Trigger a flush to ensure all data is flushed out to disk. When the first compaction is triggered we get an NPE and this is very easy to reproduce {code} 015-06-25 21:33:46,041 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure start children changed event: /hbase/flush-table-proc/acquired 2015-06-25 21:33:46,051 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HRegion: Flushing 1/1 column families, memstore=76.91 MB 2015-06-25 21:33:46,159 ERROR [regionserver/stobdtserver3/10.224.54.70:16040-longCompactions-1435248183945] regionserver.CompactSplitThread: Compaction failed Request = regionName=TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4., storeName=info, fileCount=3, fileSize=343.4 M (114.5 M, 114.5 M, 114.5 M), priority=3, time=7536968291719985 java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController$ActiveCompaction.access$700(PressureAwareCompactionThroughputController.java:79) at org.apache.hadoop.hbase.regionserver.compactions.PressureAwareCompactionThroughputController.finish(PressureAwareCompactionThroughputController.java:238) at org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:306) at org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:106) at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:112) at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1202) at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1792) at org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:524) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-06-25 21:33:46,745 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.DefaultStoreFlusher: Flushed, sequenceid=1534, memsize=76.9 M, hasBloomFilter=true, into tmp file hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/.tmp/942ba0831a0047a08987439e34361a0c 2015-06-25 21:33:46,772 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HStore: Added hdfs://stobdtserver3:9010/hbase/data/default/TestTable/028fb0324cd6eb03d5022eb8c147b7c4/info/942ba0831a0047a08987439e34361a0c, entries=68116, sequenceid=1534, filesize=68.7 M 2015-06-25 21:33:46,773 INFO [rs(stobdtserver3,16040,1435248182301)-flush-proc-pool3-thread-1] regionserver.HRegion: Finished memstore flush of ~76.91 MB/80649344, currentsize=0 B/0 for region TestTable,283887,1435248198798.028fb0324cd6eb03d5022eb8c147b7c4. in 723ms, sequenceid=1534, compaction requested=true 2015-06-25 21:33:46,780 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/flush-table-proc/reached/TestTable 2015-06-25 21:33:46,790 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received created event:/hbase/flush-table-proc/abort/TestTable 2015-06-25 21:33:46,791 INFO [main-EventThread] procedure.ZKProcedureMemberRpcs: Received procedure
[jira] [Commented] (HBASE-14178) regionserver blocks because of waiting for offsetLock
[ https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14650729#comment-14650729 ] Duo Zhang commented on HBASE-14178: --- I think the problem here is we only allow one thread read a HFileBlock at the same time even if we disable the BlockCache. If BlockCache is enabled, this is useful since other threads can get the block directly from BlockCache after one thread successfully read the block and put it into BlockCache. This is a needless overhead if we disable BlockCache although I think one should always turn on BlockCache if there are many requests to the same HFileBlock. We should just read the HFileBlock from HDFS. [~chenheng], could you please prepare a patch for master first? Thanks. regionserver blocks because of waiting for offsetLock - Key: HBASE-14178 URL: https://issues.apache.org/jira/browse/HBASE-14178 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.98.6 Reporter: Heng Chen Priority: Critical Fix For: 0.98.6 Attachments: HBASE-14178-0.98.patch, jstack My regionserver blocks, and all client rpc timeout. I print the regionserver's jstack, it seems a lot of thread was blocked for waiting offsetLock, detail infomation belows: PS: my table's block cache is off {code} B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) - locked 0x000773af7c18 (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820) - locked 0x0005e5c55ad0 (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - 0x0005e5c55c08 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14178) regionserver blocks because of waiting for offsetLock
[ https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653201#comment-14653201 ] Duo Zhang commented on HBASE-14178: --- [~anoopsamjohn] {{CacheConfig}} is a bit confusing I think. {{family.isBlockCacheEnabled}} is only equal to {{cacheDataOnRead}}, and we still have chance to put data into {{BlockCache}} if we set {{cacheDataOnWrite}} or {{prefetchOnOpen}} to {{true}} even if we set {{cacheDataOnRead}} to {{false}}? So I suggest here we make a new method called {{shouldReadBlockFromCache}}, and check all the possibility that we may put a block into {{BlockCache}}? Thanks. regionserver blocks because of waiting for offsetLock - Key: HBASE-14178 URL: https://issues.apache.org/jira/browse/HBASE-14178 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.98.6 Reporter: Heng Chen Priority: Critical Fix For: 0.98.6 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, jstack My regionserver blocks, and all client rpc timeout. I print the regionserver's jstack, it seems a lot of threads were blocked for waiting offsetLock, detail infomation belows: PS: my table's block cache is off {code} B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) - locked 0x000773af7c18 (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820) - locked 0x0005e5c55ad0 (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - 0x0005e5c55c08 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14178) regionserver blocks because of waiting for offsetLock
[ https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14653415#comment-14653415 ] Duo Zhang commented on HBASE-14178: --- Yes, the problem here is the lock, not when to read from cache...So if we can make sure the block will not be put into cache after we fetch it from HDFS, then we can bypass the locking step. regionserver blocks because of waiting for offsetLock - Key: HBASE-14178 URL: https://issues.apache.org/jira/browse/HBASE-14178 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.98.6 Reporter: Heng Chen Priority: Critical Fix For: 0.98.6 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, jstack My regionserver blocks, and all client rpc timeout. I print the regionserver's jstack, it seems a lot of threads were blocked for waiting offsetLock, detail infomation belows: PS: my table's block cache is off {code} B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) - locked 0x000773af7c18 (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820) - locked 0x0005e5c55ad0 (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - 0x0005e5c55c08 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14178) regionserver blocks because of waiting for offsetLock
[ https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-14178: -- Attachment: (was: HBASE-14178_v6.patch) regionserver blocks because of waiting for offsetLock - Key: HBASE-14178 URL: https://issues.apache.org/jira/browse/HBASE-14178 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.98.6 Reporter: Heng Chen Priority: Critical Fix For: 0.98.6 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, HBASE-14178_v5.patch, jstack My regionserver blocks, and all client rpc timeout. I print the regionserver's jstack, it seems a lot of threads were blocked for waiting offsetLock, detail infomation belows: PS: my table's block cache is off {code} B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) - locked 0x000773af7c18 (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820) - locked 0x0005e5c55ad0 (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - 0x0005e5c55c08 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14178) regionserver blocks because of waiting for offsetLock
[ https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-14178: -- Attachment: HBASE-14178_v6.patch A flaky test(TestFastFail) failed in pre-commit. Retry. regionserver blocks because of waiting for offsetLock - Key: HBASE-14178 URL: https://issues.apache.org/jira/browse/HBASE-14178 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.98.6 Reporter: Heng Chen Priority: Critical Fix For: 0.98.6 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack My regionserver blocks, and all client rpc timeout. I print the regionserver's jstack, it seems a lot of threads were blocked for waiting offsetLock, detail infomation belows: PS: my table's block cache is off {code} B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) - locked 0x000773af7c18 (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820) - locked 0x0005e5c55ad0 (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - 0x0005e5c55c08 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14178) regionserver blocks because of waiting for offsetLock
[ https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-14178: -- Attachment: (was: HBASE-14178_v6.patch) regionserver blocks because of waiting for offsetLock - Key: HBASE-14178 URL: https://issues.apache.org/jira/browse/HBASE-14178 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.98.6 Reporter: Heng Chen Priority: Critical Fix For: 0.98.6 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack My regionserver blocks, and all client rpc timeout. I print the regionserver's jstack, it seems a lot of threads were blocked for waiting offsetLock, detail infomation belows: PS: my table's block cache is off {code} B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) - locked 0x000773af7c18 (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820) - locked 0x0005e5c55ad0 (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - 0x0005e5c55c08 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14178) regionserver blocks because of waiting for offsetLock
[ https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-14178: -- Attachment: HBASE-14178_v6.patch Retry. regionserver blocks because of waiting for offsetLock - Key: HBASE-14178 URL: https://issues.apache.org/jira/browse/HBASE-14178 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.98.6 Reporter: Heng Chen Priority: Critical Fix For: 0.98.6 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack My regionserver blocks, and all client rpc timeout. I print the regionserver's jstack, it seems a lot of threads were blocked for waiting offsetLock, detail infomation belows: PS: my table's block cache is off {code} B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) - locked 0x000773af7c18 (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820) - locked 0x0005e5c55ad0 (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - 0x0005e5c55c08 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14178) regionserver blocks because of waiting for offsetLock
[ https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654613#comment-14654613 ] Duo Zhang commented on HBASE-14178: --- Oh, I looked into wrong ling number... It is the same reason, if {{blockType}} is {{null}}, the safe way is to read it from {{BlockCache}} first. Maybe it is an {{INDEX}} or {{BLOOM}} block. Thanks. regionserver blocks because of waiting for offsetLock - Key: HBASE-14178 URL: https://issues.apache.org/jira/browse/HBASE-14178 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.98.6 Reporter: Heng Chen Priority: Critical Fix For: 0.98.6 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack My regionserver blocks, and all client rpc timeout. I print the regionserver's jstack, it seems a lot of threads were blocked for waiting offsetLock, detail infomation belows: PS: my table's block cache is off {code} B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) - locked 0x000773af7c18 (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820) - locked 0x0005e5c55ad0 (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - 0x0005e5c55c08 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14178) regionserver blocks because of waiting for offsetLock
[ https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-14178: -- Attachment: HBASE-14178_v6.patch Retry. regionserver blocks because of waiting for offsetLock - Key: HBASE-14178 URL: https://issues.apache.org/jira/browse/HBASE-14178 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.98.6 Reporter: Heng Chen Priority: Critical Fix For: 0.98.6 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack My regionserver blocks, and all client rpc timeout. I print the regionserver's jstack, it seems a lot of threads were blocked for waiting offsetLock, detail infomation belows: PS: my table's block cache is off {code} B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) - locked 0x000773af7c18 (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820) - locked 0x0005e5c55ad0 (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - 0x0005e5c55c08 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14178) regionserver blocks because of waiting for offsetLock
[ https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-14178: -- Attachment: (was: HBASE-14178_v6.patch) regionserver blocks because of waiting for offsetLock - Key: HBASE-14178 URL: https://issues.apache.org/jira/browse/HBASE-14178 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.98.6 Reporter: Heng Chen Priority: Critical Fix For: 0.98.6 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack My regionserver blocks, and all client rpc timeout. I print the regionserver's jstack, it seems a lot of threads were blocked for waiting offsetLock, detail infomation belows: PS: my table's block cache is off {code} B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) - locked 0x000773af7c18 (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820) - locked 0x0005e5c55ad0 (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - 0x0005e5c55c08 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14178) regionserver blocks because of waiting for offsetLock
[ https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14654607#comment-14654607 ] Duo Zhang commented on HBASE-14178: --- [~tedyu] If {{blockType}} is {{null}} then we can not determine if we could cache the block until we actually read it from HDFS. So I think it is appropriate to return false here? We should always acquire the lock unless we can make sure the block will not be cached. Thanks. regionserver blocks because of waiting for offsetLock - Key: HBASE-14178 URL: https://issues.apache.org/jira/browse/HBASE-14178 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.98.6 Reporter: Heng Chen Priority: Critical Fix For: 0.98.6 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack My regionserver blocks, and all client rpc timeout. I print the regionserver's jstack, it seems a lot of threads were blocked for waiting offsetLock, detail infomation belows: PS: my table's block cache is off {code} B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) - locked 0x000773af7c18 (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820) - locked 0x0005e5c55ad0 (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - 0x0005e5c55c08 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13408) HBase In-Memory Memstore Compaction
[ https://issues.apache.org/jira/browse/HBASE-13408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647348#comment-14647348 ] Duo Zhang commented on HBASE-13408: --- OK, I get your point. After a memstore compaction we may drop some old cells so set a new value of the {{oldestUnflushedSeqId}} in WAL is reasonable. And yes, this can avoid WAL triggers a flush for log truncating under your cases. But I still think you can find a way to set it without changing the semantics of flush... Flush is a very critical operation in HBase so you should keep away from it as much as possible unless you have to... Or a more difficult way, remove the old flush operation and introduce some new operations such as reduce your memory usage and persist old cells and so on. You can put your compaction logic in the reduce your memory usage operation. Thanks. HBase In-Memory Memstore Compaction --- Key: HBASE-13408 URL: https://issues.apache.org/jira/browse/HBASE-13408 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: HBaseIn-MemoryMemstoreCompactionDesignDocument-ver02.pdf, HBaseIn-MemoryMemstoreCompactionDesignDocument.pdf, InMemoryMemstoreCompactionEvaluationResults.pdf A store unit holds a column family in a region, where the memstore is its in-memory component. The memstore absorbs all updates to the store; from time to time these updates are flushed to a file on disk, where they are compacted. Unlike disk components, the memstore is not compacted until it is written to the filesystem and optionally to block-cache. This may result in underutilization of the memory due to duplicate entries per row, for example, when hot data is continuously updated. Generally, the faster the data is accumulated in memory, more flushes are triggered, the data sinks to disk more frequently, slowing down retrieval of data, even if very recent. In high-churn workloads, compacting the memstore can help maintain the data in memory, and thereby speed up data retrieval. We suggest a new compacted memstore with the following principles: 1.The data is kept in memory for as long as possible 2.Memstore data is either compacted or in process of being compacted 3.Allow a panic mode, which may interrupt an in-progress compaction and force a flush of part of the memstore. We suggest applying this optimization only to in-memory column families. A design document is attached. This feature was previously discussed in HBASE-5311. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13408) HBase In-Memory Memstore Compaction
[ https://issues.apache.org/jira/browse/HBASE-13408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646101#comment-14646101 ] Duo Zhang commented on HBASE-13408: --- Things we talking here are all 'In Memory', so I do not think we need to modify WAL... I think all the logic could be down in a special memstore implementation? For example, you can set the flush-size to 128M, and introduce a compact-size which only consider the active set size to 32M. When you find the active set reaches 32M then you put it into pipeline and try to compact segments in pipeline to reduce memory usage. The upper layer does not care about how many segments you have, it only cares about the total memstore size. If it reaches 128M then a flush request is coming, then you should flush all data to disk. If there are many redundant cells then the total memstore will never reaches 128M, I think this is exactly what we want here? And this way you do not change the semantic of flush, the log truncating should also work as well. And I think you can use some more compact data structures instead of skip list since the segments in pipeline are read only? This may bring some benefits even if we do not have many redundant cells. What do you think? [~eshcar]. Sorry a bit late. Thanks. HBase In-Memory Memstore Compaction --- Key: HBASE-13408 URL: https://issues.apache.org/jira/browse/HBASE-13408 Project: HBase Issue Type: New Feature Reporter: Eshcar Hillel Attachments: HBaseIn-MemoryMemstoreCompactionDesignDocument-ver02.pdf, HBaseIn-MemoryMemstoreCompactionDesignDocument.pdf, InMemoryMemstoreCompactionEvaluationResults.pdf A store unit holds a column family in a region, where the memstore is its in-memory component. The memstore absorbs all updates to the store; from time to time these updates are flushed to a file on disk, where they are compacted. Unlike disk components, the memstore is not compacted until it is written to the filesystem and optionally to block-cache. This may result in underutilization of the memory due to duplicate entries per row, for example, when hot data is continuously updated. Generally, the faster the data is accumulated in memory, more flushes are triggered, the data sinks to disk more frequently, slowing down retrieval of data, even if very recent. In high-churn workloads, compacting the memstore can help maintain the data in memory, and thereby speed up data retrieval. We suggest a new compacted memstore with the following principles: 1.The data is kept in memory for as long as possible 2.Memstore data is either compacted or in process of being compacted 3.Allow a panic mode, which may interrupt an in-progress compaction and force a flush of part of the memstore. We suggest applying this optimization only to in-memory column families. A design document is attached. This feature was previously discussed in HBASE-5311. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14178) regionserver blocks because of waiting for offsetLock
[ https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659524#comment-14659524 ] Duo Zhang commented on HBASE-14178: --- Yes, we doing more round of check with lock because maybe another thread has already cache the block for us. Things happen here is we disable BC for the given family, so it is impossible that another thread will do the work for us, so we just read from HDFS and bypass the second checking BC round. And as I mentioned above, there are lots of configurations for BC, and {{family.isBlockCacheEnabled()}} is treated as {{cacheDataOnRead}} (You can see the code pasted by [~chenheng], maybe it is a mistake but it is not important for this issue I think, we could open another issue for it). So the safe way to determine if we need the second 'read BC with lock' round is to check if we will put the block back to BC after we read it from HDFS. This is why we introduce a {{shouldLockOnCacheMiss}} method here. Maybe we cound change the name to {{shouldReadAgainWithLockOnCacheMiss}}? Thanks. regionserver blocks because of waiting for offsetLock - Key: HBASE-14178 URL: https://issues.apache.org/jira/browse/HBASE-14178 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.98.6 Reporter: Heng Chen Priority: Critical Fix For: 0.98.6 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack My regionserver blocks, and all client rpc timeout. I print the regionserver's jstack, it seems a lot of threads were blocked for waiting offsetLock, detail infomation belows: PS: my table's block cache is off {code} B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) - locked 0x000773af7c18 (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820) - locked 0x0005e5c55ad0 (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at
[jira] [Comment Edited] (HBASE-14178) regionserver blocks because of waiting for offsetLock
[ https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659574#comment-14659574 ] Duo Zhang edited comment on HBASE-14178 at 8/6/15 6:46 AM: --- [~ram_krish] A bit strange but now, {{family.isBlockCacheEnabled}} only means {{cacheDataOnRead}} ... {{cacheDataOnWrite}} and some other configurations such as {{prefetchOnOpen}} are separated which means you can set them to {{true}} and will take effect even if you explicitly call {{family.setBlockCacheEnabled(false)}}. So theoretically even if you disable BC at family level, it is still possible that we could find the block in BC... was (Author: apache9): [~ram_krish] A bit strange but now, {{family.isBlockCacheEnabled}} only means {{cacheDataOnRead }}... {{cacheDataOnWrite}} and some other configurations such as {{prefetchOnOpen}} are separated which means you can set them to {{true}} and will take effect even if you explicitly call {{family.setBlockCacheEnabled(false)}}. So theoretically even if you disable BC at family level, it is still possible that we could find the block in BC... regionserver blocks because of waiting for offsetLock - Key: HBASE-14178 URL: https://issues.apache.org/jira/browse/HBASE-14178 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.98.6 Reporter: Heng Chen Priority: Critical Fix For: 0.98.6 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack My regionserver blocks, and all client rpc timeout. I print the regionserver's jstack, it seems a lot of threads were blocked for waiting offsetLock, detail infomation belows: PS: my table's block cache is off {code} B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) - locked 0x000773af7c18 (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820) - locked 0x0005e5c55ad0 (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027)
[jira] [Commented] (HBASE-14178) regionserver blocks because of waiting for offsetLock
[ https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659574#comment-14659574 ] Duo Zhang commented on HBASE-14178: --- [~ram_krish] A bit strange but now, {{family.isBlockCacheEnabled}} only means {{cacheDataOnRead }}... {{cacheDataOnWrite}} and some other configurations such as {{prefetchOnOpen}} are separated which means you can set them to {{true}} and will take effect even if you explicitly call {{family.setBlockCacheEnabled(false)}}. So theoretically even if you disable BC at family level, it is still possible that we could find the block in BC... regionserver blocks because of waiting for offsetLock - Key: HBASE-14178 URL: https://issues.apache.org/jira/browse/HBASE-14178 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.98.6 Reporter: Heng Chen Priority: Critical Fix For: 0.98.6 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack My regionserver blocks, and all client rpc timeout. I print the regionserver's jstack, it seems a lot of threads were blocked for waiting offsetLock, detail infomation belows: PS: my table's block cache is off {code} B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) - locked 0x000773af7c18 (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820) - locked 0x0005e5c55ad0 (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - 0x0005e5c55c08 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14178) regionserver blocks because of waiting for offsetLock
[ https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659614#comment-14659614 ] Duo Zhang commented on HBASE-14178: --- [~anoopsamjohn] [~ram_krish] Any concerns on patch v7 then? Thanks. regionserver blocks because of waiting for offsetLock - Key: HBASE-14178 URL: https://issues.apache.org/jira/browse/HBASE-14178 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.98.6 Reporter: Heng Chen Priority: Critical Fix For: 0.98.6 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, HBASE-14178_v7.patch, jstack My regionserver blocks, and all client rpc timeout. I print the regionserver's jstack, it seems a lot of threads were blocked for waiting offsetLock, detail infomation belows: PS: my table's block cache is off {code} B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) - locked 0x000773af7c18 (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820) - locked 0x0005e5c55ad0 (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - 0x0005e5c55c08 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14178) regionserver blocks because of waiting for offsetLock
[ https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-14178: -- Assignee: Heng Chen Priority: Major (was: Critical) Fix Version/s: (was: 0.98.6) 1.1.2 1.2.0 1.0.2 0.98.14 2.0.0 regionserver blocks because of waiting for offsetLock - Key: HBASE-14178 URL: https://issues.apache.org/jira/browse/HBASE-14178 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.98.6 Reporter: Heng Chen Assignee: Heng Chen Fix For: 2.0.0, 0.98.14, 1.0.2, 1.2.0, 1.1.2 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, HBASE-14178_v7.patch, jstack My regionserver blocks, and all client rpc timeout. I print the regionserver's jstack, it seems a lot of threads were blocked for waiting offsetLock, detail infomation belows: PS: my table's block cache is off {code} B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) - locked 0x000773af7c18 (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820) - locked 0x0005e5c55ad0 (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - 0x0005e5c55c08 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14178) regionserver blocks because of waiting for offsetLock
[ https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-14178: -- Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Thanks [~chenheng] for the patch(and the quick turn around at last although I have done some revert commits...). Thanks all guys who helped review the patch. regionserver blocks because of waiting for offsetLock - Key: HBASE-14178 URL: https://issues.apache.org/jira/browse/HBASE-14178 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.98.6 Reporter: Heng Chen Assignee: Heng Chen Fix For: 2.0.0, 0.98.14, 1.0.2, 1.2.0, 1.1.2 Attachments: HBASE-14178-0.98.patch, HBASE-14178-0.98_v8.patch, HBASE-14178-branch_1_v8.patch, HBASE-14178.patch, HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, HBASE-14178_v7.patch, HBASE-14178_v8.patch, jstack My regionserver blocks, and all client rpc timeout. I print the regionserver's jstack, it seems a lot of threads were blocked for waiting offsetLock, detail infomation belows: PS: my table's block cache is off {code} B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) - locked 0x000773af7c18 (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820) - locked 0x0005e5c55ad0 (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - 0x0005e5c55c08 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HBASE-14178) regionserver blocks because of waiting for offsetLock
[ https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659728#comment-14659728 ] Duo Zhang edited comment on HBASE-14178 at 8/6/15 9:23 AM: --- Pushed to all branches. Thanks [~chenheng] for the patch(and the quick turn around at last although I have done some revert commits...). Thanks all guys who helped review the patch. was (Author: apache9): Thanks [~chenheng] for the patch(and the quick turn around at last although I have done some revert commits...). Thanks all guys who helped review the patch. regionserver blocks because of waiting for offsetLock - Key: HBASE-14178 URL: https://issues.apache.org/jira/browse/HBASE-14178 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.98.6 Reporter: Heng Chen Assignee: Heng Chen Fix For: 2.0.0, 0.98.14, 1.0.2, 1.2.0, 1.1.2 Attachments: HBASE-14178-0.98.patch, HBASE-14178-0.98_v8.patch, HBASE-14178-branch_1_v8.patch, HBASE-14178.patch, HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, HBASE-14178_v7.patch, HBASE-14178_v8.patch, jstack My regionserver blocks, and all client rpc timeout. I print the regionserver's jstack, it seems a lot of threads were blocked for waiting offsetLock, detail infomation belows: PS: my table's block cache is off {code} B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) - locked 0x000773af7c18 (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820) - locked 0x0005e5c55ad0 (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - 0x0005e5c55c08 (a
[jira] [Commented] (HBASE-14178) regionserver blocks because of waiting for offsetLock
[ https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659681#comment-14659681 ] Duo Zhang commented on HBASE-14178: --- [~chenheng] Could you please also prepare patch for branches other than master and 0.98? Thanks. regionserver blocks because of waiting for offsetLock - Key: HBASE-14178 URL: https://issues.apache.org/jira/browse/HBASE-14178 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.98.6 Reporter: Heng Chen Assignee: Heng Chen Fix For: 2.0.0, 0.98.14, 1.0.2, 1.2.0, 1.1.2 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, HBASE-14178_v7.patch, jstack My regionserver blocks, and all client rpc timeout. I print the regionserver's jstack, it seems a lot of threads were blocked for waiting offsetLock, detail infomation belows: PS: my table's block cache is off {code} B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) - locked 0x000773af7c18 (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820) - locked 0x0005e5c55ad0 (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - 0x0005e5c55c08 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14178) regionserver blocks because of waiting for offsetLock
[ https://issues.apache.org/jira/browse/HBASE-14178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14659274#comment-14659274 ] Duo Zhang commented on HBASE-14178: --- [~anoopsamjohn] Any concerns of patch v6? Thanks. regionserver blocks because of waiting for offsetLock - Key: HBASE-14178 URL: https://issues.apache.org/jira/browse/HBASE-14178 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.98.6 Reporter: Heng Chen Priority: Critical Fix For: 0.98.6 Attachments: HBASE-14178-0.98.patch, HBASE-14178.patch, HBASE-14178_v1.patch, HBASE-14178_v2.patch, HBASE-14178_v3.patch, HBASE-14178_v4.patch, HBASE-14178_v5.patch, HBASE-14178_v6.patch, jstack My regionserver blocks, and all client rpc timeout. I print the regionserver's jstack, it seems a lot of threads were blocked for waiting offsetLock, detail infomation belows: PS: my table's block cache is off {code} B.DefaultRpcServer.handler=2,queue=2,port=60020 #82 daemon prio=5 os_prio=0 tid=0x01827000 nid=0x2cdc in Object.wait() [0x7f3831b72000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.hadoop.hbase.util.IdLock.getLockEntry(IdLock.java:79) - locked 0x000773af7c18 (a org.apache.hadoop.hbase.util.IdLock$Entry) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:352) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:253) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:524) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:572) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) at org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:269) at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:695) at org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:140) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:3889) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:3969) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:3847) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3820) - locked 0x0005e5c55ad0 (a org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:3807) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4779) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:4753) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:2916) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29583) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2027) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - 0x0005e5c55c08 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14100) Fix high priority findbugs warnings
[ https://issues.apache.org/jira/browse/HBASE-14100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629735#comment-14629735 ] Duo Zhang commented on HBASE-14100: --- [~ram_krish] Fine. Is HBASE-12213 almost done? If not I think we could remove it first in this issue. Fix high priority findbugs warnings --- Key: HBASE-14100 URL: https://issues.apache.org/jira/browse/HBASE-14100 Project: HBase Issue Type: Bug Reporter: Duo Zhang Assignee: Duo Zhang See here: https://builds.apache.org/job/HBase-TRUNK/6654/findbugsResult/HIGH/ We have 6 high priority findbugs warnings. A high priority findbugs warning is usually a bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14100) Fix high priority findbugs warnings
[ https://issues.apache.org/jira/browse/HBASE-14100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629695#comment-14629695 ] Duo Zhang commented on HBASE-14100: --- {code:title=WALSplitter.java} 1208 for (WriterThread t : writerThreads) { 1209 t.finish(); 1210 } 1211 if (interrupt) { 1212 for (WriterThread t : writerThreads) { 1213 t.interrupt(); // interrupt the writer threads. We are stopping now. 1214 } 1215 } 1216 1217 for (WriterThread t : writerThreads) { 1218 if (!progress_failed reporter != null !reporter.progress()) { 1219 progress_failed = true; 1220 } 1221 try { 1222 t.join(); 1223 } catch (InterruptedException ie) { 1224 IOException iie = new InterruptedIOException(); 1225 iie.initCause(ie); 1226 throw iie; 1227 } 1228 } 1229 controller.checkForErrors(); 1230 LOG.info((this.writerThreads == null? 0: this.writerThreads.size()) + // here we do a null check, but the above code use writeThreads directly without a null check 1231 split writers finished; closing...); {code} And here is the defination of writerThreads {code:title=WALSplitter.java} 1131 protected final ListWriterThread writerThreads = Lists.newArrayList(); {code} It is initialized at construction and marked as final, so I think we could just remove the null check? Fix high priority findbugs warnings --- Key: HBASE-14100 URL: https://issues.apache.org/jira/browse/HBASE-14100 Project: HBase Issue Type: Bug Reporter: Duo Zhang Assignee: Duo Zhang See here: https://builds.apache.org/job/HBase-TRUNK/6654/findbugsResult/HIGH/ We have 6 high priority findbugs warnings. A high priority findbugs warning is usually a bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14100) Fix high priority findbugs warnings
[ https://issues.apache.org/jira/browse/HBASE-14100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629672#comment-14629672 ] Duo Zhang commented on HBASE-14100: --- {{DMI_INVOKING_TOSTRING_ON_ARRAY}} {code:title=SplitLogManager.java} 300 FileStatus[] files = fs.listStatus(logDir); 301 if (files != null files.length 0) { 302 LOG.warn(Returning success without actually splitting and 303 + deleting all the log files in path + logDir + : + files, ioe); // files is an array, should we use Arrays.toString here? Is it too noisy? 304 } else { 305 LOG.warn(Unable to delete log src dir. Ignoring. + logDir, ioe); 306 } {code} {code:title=SimpleRegionNormalizer.java} 152 LOG.debug(Table + table + , largest region 153 + largestRegion.getFirst().getRegionName() + has size // Should be getRegionNameAsString()? 154 + largestRegion.getSecond() + , more than 2 times than avg size, splitting); {code} Fix high priority findbugs warnings --- Key: HBASE-14100 URL: https://issues.apache.org/jira/browse/HBASE-14100 Project: HBase Issue Type: Bug Reporter: Duo Zhang Assignee: Duo Zhang See here: https://builds.apache.org/job/HBase-TRUNK/6654/findbugsResult/HIGH/ We have 6 high priority findbugs warnings. A high priority findbugs warning is usually a bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14100) Fix high priority findbugs warnings
[ https://issues.apache.org/jira/browse/HBASE-14100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-14100: -- Status: Patch Available (was: Open) Fix high priority findbugs warnings --- Key: HBASE-14100 URL: https://issues.apache.org/jira/browse/HBASE-14100 Project: HBase Issue Type: Bug Reporter: Duo Zhang Assignee: Duo Zhang Attachments: HBASE-14100.patch See here: https://builds.apache.org/job/HBase-TRUNK/6654/findbugsResult/HIGH/ We have 6 high priority findbugs warnings. A high priority findbugs warning is usually a bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14100) Fix high priority findbugs warnings
[ https://issues.apache.org/jira/browse/HBASE-14100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-14100: -- Attachment: HBASE-14100.patch Fix high priority findbugs warnings --- Key: HBASE-14100 URL: https://issues.apache.org/jira/browse/HBASE-14100 Project: HBase Issue Type: Bug Reporter: Duo Zhang Assignee: Duo Zhang Attachments: HBASE-14100.patch See here: https://builds.apache.org/job/HBase-TRUNK/6654/findbugsResult/HIGH/ We have 6 high priority findbugs warnings. A high priority findbugs warning is usually a bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-14100) Fix high priority findbugs warnings
Duo Zhang created HBASE-14100: - Summary: Fix high priority findbugs warnings Key: HBASE-14100 URL: https://issues.apache.org/jira/browse/HBASE-14100 Project: HBase Issue Type: Bug Reporter: Duo Zhang Assignee: Duo Zhang See here: https://builds.apache.org/job/HBase-TRUNK/6654/findbugsResult/HIGH/ We have 6 high priority findbugs warnings. A high priority findbugs warning is usually a bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14100) Fix high priority findbugs warnings
[ https://issues.apache.org/jira/browse/HBASE-14100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629690#comment-14629690 ] Duo Zhang commented on HBASE-14100: --- {{IL_INFINITE_RECURSIVE_LOOP}} {code:title=BloomFilterUtil.java} 195 /** Should only be used in tests */ 196 public static boolean contains(byte[] buf, int offset, int length, ByteBuffer bloom) { // no reference to this method, so just remove it? 197 return contains(buf, offset, length, bloom); 198 } {code} {code:title=RateLimiter.java} 201 // This method is for strictly testing purpose only 202 @VisibleForTesting 203 public void setNextRefillTime(long nextRefillTime) { 204 this.setNextRefillTime(nextRefillTime); 205 } 206 207 public long getNextRefillTime() { 208 return this.getNextRefillTime(); 209 } {code} These two methods are all overridden by sub classes so no actual problem now. But I do not think it is a good idea to leave a confusing code here? Just mark the method as abstract? Or leave the implementation empty? Fix high priority findbugs warnings --- Key: HBASE-14100 URL: https://issues.apache.org/jira/browse/HBASE-14100 Project: HBase Issue Type: Bug Reporter: Duo Zhang Assignee: Duo Zhang See here: https://builds.apache.org/job/HBase-TRUNK/6654/findbugsResult/HIGH/ We have 6 high priority findbugs warnings. A high priority findbugs warning is usually a bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14100) Fix high priority findbugs warnings
[ https://issues.apache.org/jira/browse/HBASE-14100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14630679#comment-14630679 ] Duo Zhang commented on HBASE-14100: --- {{BloomFilterUtil}} is master only. {{RateLimiter}} is introduced after branch-1.1. And the changes of log are introduced after branch-1.2 or branch-1. So branch-1.0 and 0.98 are not patched. Fix high priority findbugs warnings --- Key: HBASE-14100 URL: https://issues.apache.org/jira/browse/HBASE-14100 Project: HBase Issue Type: Bug Reporter: Duo Zhang Assignee: Duo Zhang Fix For: 2.0.0, 1.2.0, 1.1.2, 1.3.0 Attachments: HBASE-14100.patch See here: https://builds.apache.org/job/HBase-TRUNK/6654/findbugsResult/HIGH/ We have 6 high priority findbugs warnings. A high priority findbugs warning is usually a bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HBASE-14113) Compile failed on master
[ https://issues.apache.org/jira/browse/HBASE-14113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang resolved HBASE-14113. --- Resolution: Cannot Reproduce Fixed by running {{mvn clean}} first. Compile failed on master Key: HBASE-14113 URL: https://issues.apache.org/jira/browse/HBASE-14113 Project: HBase Issue Type: Bug Components: build Reporter: Heng Chen mvn package -DskipTests Compile failed! {code} [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.2:compile (default-compile) on project hbase-client: Compilation failure: Compilation failure: [ERROR] /Users/chenheng/apache/hbase/hbase-client/src/main/java/org/apache/hadoop/hbase/filter/BinaryComparator.java:[54,3] method does not override or implement a method from a supertype [ERROR] /Users/chenheng/apache/hbase/hbase-client/src/main/java/org/apache/hadoop/hbase/filter/NullComparator.java:[64,3] method does not override or implement a method from a supertype [ERROR] /Users/chenheng/apache/hbase/hbase-client/src/main/java/org/apache/hadoop/hbase/filter/LongComparator.java:[51,5] method does not override or implement a method from a supertype [ERROR] /Users/chenheng/apache/hbase/hbase-client/src/main/java/org/apache/hadoop/hbase/filter/BitComparator.java:[137,3] method does not override or implement a method from a supertype [ERROR] /Users/chenheng/apache/hbase/hbase-client/src/main/java/org/apache/hadoop/hbase/filter/BinaryPrefixComparator.java:[56,3] method does not override or implement a method from a supertype {code} My java version: {code} java version 1.8.0_40 Java(TM) SE Runtime Environment (build 1.8.0_40-b25) Java HotSpot(TM) 64-Bit Server VM (build 25.40-b25, mixed mode) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14100) Fix high priority findbugs warnings
[ https://issues.apache.org/jira/browse/HBASE-14100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-14100: -- Resolution: Fixed Fix Version/s: 1.3.0 1.1.2 1.2.0 2.0.0 Status: Resolved (was: Patch Available) Pushed to branch-1.1+. Thanks [~apurtell] and [~ram_krish]. Fix high priority findbugs warnings --- Key: HBASE-14100 URL: https://issues.apache.org/jira/browse/HBASE-14100 Project: HBase Issue Type: Bug Reporter: Duo Zhang Assignee: Duo Zhang Fix For: 2.0.0, 1.2.0, 1.1.2, 1.3.0 Attachments: HBASE-14100.patch See here: https://builds.apache.org/job/HBase-TRUNK/6654/findbugsResult/HIGH/ We have 6 high priority findbugs warnings. A high priority findbugs warning is usually a bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14059) We should add a RS to the dead servers list if admin calls fail more than a threshold
[ https://issues.apache.org/jira/browse/HBASE-14059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14633048#comment-14633048 ] Duo Zhang commented on HBASE-14059: --- {quote} In the issue I ran into it was a bad region causing the RS to be blocked for a long time. {quote} More details? What does 'bad' mean? And you said the region is back to normal when you kill the RS, so I think there maybe another bug? In general, I agree with you, we should offline a RS if admin calls always fail. But it should be used to fix a 'bad' RS, not a 'bad' region. If there is a 'bad' region that can not be fixed by reassign, then as [~chenheng] said, the 'bad' region will kill all regionservers in your cluster... Thanks. We should add a RS to the dead servers list if admin calls fail more than a threshold - Key: HBASE-14059 URL: https://issues.apache.org/jira/browse/HBASE-14059 Project: HBase Issue Type: Bug Components: master, regionserver, rpc Affects Versions: 0.98.13 Reporter: Esteban Gutierrez Assignee: Esteban Gutierrez Priority: Critical I ran into this problem twice this week: calls from the HBase master to a RS can timeout since the RS call queue size has been maxed out, however since the RS is not dead (ephemeral znode still present) the master keeps attempting to perform admin tasks like trying to open or close a region but those operations eventually fail after we run out of retries or the assignment manager attempts to re-assign to other RSs. From the side effects of this I've noticed master operations to be fully blocked or RITs since we cannot close the region and open the region in a new location since RS is not dead. A potential solution for this is to add the RS to the list of dead RSs after certain number of calls from the master to the RS fail. I've noticed only the problem in 0.98.x but it should be present in all versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14126) I'm using play framework, creating a sample Hbase project. And I keep getting this error: tried to access method com.google.common.base.Stopwatch.init()V from class
[ https://issues.apache.org/jira/browse/HBASE-14126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14634301#comment-14634301 ] Duo Zhang commented on HBASE-14126: --- HBase uses an early version of guava. See this https://github.com/google/guava/commit/fd0cbc2c5c90e85fb22c8e86ea19630032090943 The constructors are changed to package private at version 17 or 18. The shaded hbase-client maybe a solution http://mvnrepository.com/artifact/org.apache.hbase/hbase-shaded-client But seems it is 1.1 only. So if you still want to use hbase 1.0, you could try adding a guava dependency explicitly in your pom.xml which version is lower than 17. And finally, this is an hbase usage question. Next time please send email to the hbase-user or hbase-dev mailing-list first. If we confirm something could be down in hbase then we could open a issue here. Thanks. I'm using play framework, creating a sample Hbase project. And I keep getting this error: tried to access method com.google.common.base.Stopwatch.init()V from class org.apache.hadoop.hbase.zookeeper.MetaTableLocator - Key: HBASE-14126 URL: https://issues.apache.org/jira/browse/HBASE-14126 Project: HBase Issue Type: Task Environment: Ubuntu; Play Framework 2.4.2; Hbase 1.0 Reporter: Lesley Cheung Assignee: Lesley Cheung The simple Hbase project works in Maven. But when I use play framework , it keeps showing this error. I modified the lib dependency many times, but I just can't eliminate this error. Some one please help me! java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.init()V from class org.apache.hadoop.hbase.zookeeper.MetaTableLocator at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:434) at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:60) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1123) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1110) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1262) at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1126) at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:369) at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:320) at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:206) at org.apache.hadoop.hbase.client.BufferedMutatorImpl.flush(BufferedMutatorImpl.java:183) at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1496) at org.apache.hadoop.hbase.client.HTable.put(HTable.java:1119) at utils.BigQueue.run(BigQueue.java:68) at akka.actor.LightArrayRevolverScheduler$$anon$2$$anon$1.run(Scheduler.scala:242) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14004) [Replication] Inconsistency between Memstore and WAL may result in data in remote cluster that is not in the origin
[ https://issues.apache.org/jira/browse/HBASE-14004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977769#comment-14977769 ] Duo Zhang commented on HBASE-14004: --- The problem here not only effects replication. {quote} As a result, the handler will rollback the Memstore and the later flushed HFile will also skip this record. {quote} What if the regionserver crashed before flushing HFile? I think the record will come back since it has already been persisted in WAL. Add a marker maybe a solution, but you need to check the marker everywhere when replaying WAL, and you still need to deal with the failure when placing marker... I do not think it is easy to do... The basic problem here is we may have inconsistency between memstore and WAL when we fail to sync WAL. A simple solution is killing the regionserver when we fail to sync WAL which means we will never rollback memstore but reconstruct it using WAL. We can make sure there is no difference between memstore and WAL under this situation. If we want to keep regionserver alive when syncing failed, then I think we need to find the real result of the sync operation. Maybe we could close the WAL file and check its length? Of course, if we have lost the connection to namenode, I think there is no simple solution other than killing the regionserver... Thanks. > [Replication] Inconsistency between Memstore and WAL may result in data in > remote cluster that is not in the origin > --- > > Key: HBASE-14004 > URL: https://issues.apache.org/jira/browse/HBASE-14004 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: He Liangliang >Priority: Critical > Labels: replication, wal > > Looks like the current write path can cause inconsistency between > memstore/hfile and WAL which cause the slave cluster has more data than the > master cluster. > The simplified write path looks like: > 1. insert record into Memstore > 2. write record to WAL > 3. sync WAL > 4. rollback Memstore if 3 fails > It's possible that the HDFS sync RPC call fails, but the data is already > (may partially) transported to the DNs which finally get persisted. As a > result, the handler will rollback the Memstore and the later flushed HFile > will also skip this record. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14004) [Replication] Inconsistency between Memstore and WAL may result in data in remote cluster that is not in the origin
[ https://issues.apache.org/jira/browse/HBASE-14004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978108#comment-14978108 ] Duo Zhang commented on HBASE-14004: --- [~carp84] This is an inherent problem of RPC based systems due to temporary network failure. HBase use {{hflush}} to sync WAL, I do not know the details that if hflush will call namenode to update length, but in any case, the last RPC call could fail at client side but succeed at server side(network failure when writing return value back). And sure, this should be a bug in HBase. I checked the code, if an exception is thrown from hflush, {{FSHLog.SyncRunner}} simply passes it to upper layer. So it could happen that hflush is succeeded at HDFS, but HBase think it is failed and cause inconsistency. I think we need to find a way to make sure whether the WAL is actually persisted at HDFS. And if DFSClient already has retry, then I think killing regionserver is enough? Any suggestions [~carp84]? Thanks. > [Replication] Inconsistency between Memstore and WAL may result in data in > remote cluster that is not in the origin > --- > > Key: HBASE-14004 > URL: https://issues.apache.org/jira/browse/HBASE-14004 > Project: HBase > Issue Type: Bug > Components: regionserver >Reporter: He Liangliang >Priority: Critical > Labels: replication, wal > > Looks like the current write path can cause inconsistency between > memstore/hfile and WAL which cause the slave cluster has more data than the > master cluster. > The simplified write path looks like: > 1. insert record into Memstore > 2. write record to WAL > 3. sync WAL > 4. rollback Memstore if 3 fails > It's possible that the HDFS sync RPC call fails, but the data is already > (may partially) transported to the DNs which finally get persisted. As a > result, the handler will rollback the Memstore and the later flushed HFile > will also skip this record. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12959) Compact never end when table's dataBlockEncoding using PREFIX_TREE
[ https://issues.apache.org/jira/browse/HBASE-12959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14997903#comment-14997903 ] Duo Zhang commented on HBASE-12959: --- I sent an email to [~bdifn] long time ago to confirm whether HBASE-12817 solve this problem but he/she haven't replied yet... And PREFIX_TREE encoding has been used widely in our project for a long time. I haven't find any other problem after HBASE-12817. Hope this could help. [~wang_yongqiang] > Compact never end when table's dataBlockEncoding using PREFIX_TREE > > > Key: HBASE-12959 > URL: https://issues.apache.org/jira/browse/HBASE-12959 > Project: HBase > Issue Type: Bug > Components: hbase >Affects Versions: 0.98.7 > Environment: hbase 0.98.7 > hadoop 2.5.1 >Reporter: wuchengzhi >Priority: Critical > Attachments: PrefixTreeCompact.java, txtfile-part1.txt.gz, > txtfile-part2.txt.gz, txtfile-part4.txt.gz, txtfile-part5.txt.gz, > txtfile-part6.txt.gz, txtfile-part7.txt.gz > > > I upgraded the hbase from 0.96.1.1 to 0.98.7 and hadoop from 2.2.0 to > 2.5.1,some table encoding using prefix-tree was abnormal for compacting, the > gui shows the table's Compaction status is MAJOR_AND_MINOR(MAJOR) all the > time. > in the regionserver dump , there are some logs as below: > Tasks: > === > Task: Compacting info in > PREFIX_NOT_COMPACT,,1421954285670.41ef60e2c221772626e141d5080296c5. > Status: RUNNING:Compacting store info > Running for 1097s (on the site running more than 3 days) > > Thread 197 (regionserver60020-smallCompactions-1421954341530): > State: RUNNABLE > Blocked count: 7 > Waited count: 3 > Stack: > > org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.followFan(PrefixTreeArrayScanner.java:329) > > org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArraySearcher.positionAtOrAfter(PrefixTreeArraySearcher.java:149) > > org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArraySearcher.seekForwardToOrAfter(PrefixTreeArraySearcher.java:183) > > org.apache.hadoop.hbase.codec.prefixtree.PrefixTreeSeeker.seekToOrBeforeUsingPositionAtOrAfter(PrefixTreeSeeker.java:199) > > org.apache.hadoop.hbase.codec.prefixtree.PrefixTreeSeeker.seekToKeyInBlock(PrefixTreeSeeker.java:162) > > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.loadBlockAndSeekToKey(HFileReaderV2.java:1172) > > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:573) > > org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) > > org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) > > org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) > > org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) > > org.apache.hadoop.hbase.regionserver.KeyValueHeap.reseek(KeyValueHeap.java:257) > > org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:697) > > org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) > > org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) > > org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:222) > > org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:77) > > org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:110) > org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1099) > org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1482) > Thread 177 (regionserver60020-smallCompactions-1421954314809): > State: RUNNABLE > Blocked count: 40 > Waited count: 60 > Stack: > > org.apache.hadoop.hbase.codec.prefixtree.decode.column.ColumnReader.populateBuffer(ColumnReader.java:81) > > org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.populateQualifier(PrefixTreeArrayScanner.java:471) > > org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.populateNonRowFields(PrefixTreeArrayScanner.java:452) > > org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.nextRow(PrefixTreeArrayScanner.java:226) > > org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.advance(PrefixTreeArrayScanner.java:208) > > org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArraySearcher.positionAtQualifierTimestamp(PrefixTreeArraySearcher.java:244) > >
[jira] [Commented] (HBASE-12959) Compact never end when table's dataBlockEncoding using PREFIX_TREE
[ https://issues.apache.org/jira/browse/HBASE-12959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998232#comment-14998232 ] Duo Zhang commented on HBASE-12959: --- No, I haven't met this problem yet. > Compact never end when table's dataBlockEncoding using PREFIX_TREE > > > Key: HBASE-12959 > URL: https://issues.apache.org/jira/browse/HBASE-12959 > Project: HBase > Issue Type: Bug > Components: hbase >Affects Versions: 0.98.7 > Environment: hbase 0.98.7 > hadoop 2.5.1 >Reporter: wuchengzhi >Priority: Critical > Attachments: PrefixTreeCompact.java, txtfile-part1.txt.gz, > txtfile-part2.txt.gz, txtfile-part4.txt.gz, txtfile-part5.txt.gz, > txtfile-part6.txt.gz, txtfile-part7.txt.gz > > > I upgraded the hbase from 0.96.1.1 to 0.98.7 and hadoop from 2.2.0 to > 2.5.1,some table encoding using prefix-tree was abnormal for compacting, the > gui shows the table's Compaction status is MAJOR_AND_MINOR(MAJOR) all the > time. > in the regionserver dump , there are some logs as below: > Tasks: > === > Task: Compacting info in > PREFIX_NOT_COMPACT,,1421954285670.41ef60e2c221772626e141d5080296c5. > Status: RUNNING:Compacting store info > Running for 1097s (on the site running more than 3 days) > > Thread 197 (regionserver60020-smallCompactions-1421954341530): > State: RUNNABLE > Blocked count: 7 > Waited count: 3 > Stack: > > org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.followFan(PrefixTreeArrayScanner.java:329) > > org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArraySearcher.positionAtOrAfter(PrefixTreeArraySearcher.java:149) > > org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArraySearcher.seekForwardToOrAfter(PrefixTreeArraySearcher.java:183) > > org.apache.hadoop.hbase.codec.prefixtree.PrefixTreeSeeker.seekToOrBeforeUsingPositionAtOrAfter(PrefixTreeSeeker.java:199) > > org.apache.hadoop.hbase.codec.prefixtree.PrefixTreeSeeker.seekToKeyInBlock(PrefixTreeSeeker.java:162) > > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.loadBlockAndSeekToKey(HFileReaderV2.java:1172) > > org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:573) > > org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:257) > > org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:173) > > org.apache.hadoop.hbase.regionserver.NonLazyKeyValueScanner.doRealSeek(NonLazyKeyValueScanner.java:55) > > org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:313) > > org.apache.hadoop.hbase.regionserver.KeyValueHeap.reseek(KeyValueHeap.java:257) > > org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:697) > > org.apache.hadoop.hbase.regionserver.StoreScanner.seekAsDirection(StoreScanner.java:683) > > org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:533) > > org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:222) > > org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:77) > > org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:110) > org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1099) > org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1482) > Thread 177 (regionserver60020-smallCompactions-1421954314809): > State: RUNNABLE > Blocked count: 40 > Waited count: 60 > Stack: > > org.apache.hadoop.hbase.codec.prefixtree.decode.column.ColumnReader.populateBuffer(ColumnReader.java:81) > > org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.populateQualifier(PrefixTreeArrayScanner.java:471) > > org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.populateNonRowFields(PrefixTreeArrayScanner.java:452) > > org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.nextRow(PrefixTreeArrayScanner.java:226) > > org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.advance(PrefixTreeArrayScanner.java:208) > > org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArraySearcher.positionAtQualifierTimestamp(PrefixTreeArraySearcher.java:244) > > org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArraySearcher.positionAtOrAfter(PrefixTreeArraySearcher.java:123) > > org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArraySearcher.seekForwardToOrAfter(PrefixTreeArraySearcher.java:183) > >
[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only
[ https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003595#comment-15003595 ] Duo Zhang commented on HBASE-14790: --- HTTP/2 has its own problems that we haven't finish the read path yet but the write protocol is much more complex than read protocol... And also, if we plan to do it in HDFS, I think we should make it more general, and it is better to design a good event-driven FileSystem interface at the beginning. I do not think either of them is easy... So my plan is to implement a simple version in HBase which only compatible with hadoop 2.x first, make sure it has some benefits and actually ship it with HBase. And then, we could start implementing a more general and more powerful event-driven FileSystem in HDFS. When the new FileSystem is out, we could move HBase to use the new FileSystem in HDFS and drop the old simple version. What do you think? [~wheat9] Thanks. > Implement a new DFSOutputStream for logging WAL only > > > Key: HBASE-14790 > URL: https://issues.apache.org/jira/browse/HBASE-14790 > Project: HBase > Issue Type: Improvement >Reporter: Duo Zhang > > The original {{DFSOutputStream}} is very powerful and aims to serve all > purposes. But in fact, we do not need most of the features if we only want to > log WAL. For example, we do not need pipeline recovery since we could just > close the old logger and open a new one. And also, we do not need to write > multiple blocks since we could also open a new logger if the old file is too > large. > And the most important thing is that, it is hard to handle all the corner > cases to avoid data loss or data inconsistency(such as HBASE-14004) when > using original DFSOutputStream due to its complicated logic. And the > complicated logic also force us to use some magical tricks to increase > performance. For example, we need to use multiple threads to call {{hflush}} > when logging, and now we use 5 threads. But why 5 not 10 or 100? > So here, I propose we should implement our own {{DFSOutputStream}} when > logging WAL. For correctness, and also for performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14806) Missing sources.jar for several modules when building HBase
[ https://issues.apache.org/jira/browse/HBASE-14806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-14806: -- Attachment: HBASE-14806.patch Seems only hbase-common and hbase-external-blockcache missing sources.jar. I modified the pom.xml. But I'm not sure whether the output sources.jar contains correct LICENSE files. [~busbey] Could you please help verifying the sources.jar generated with this patch? Thanks. > Missing sources.jar for several modules when building HBase > --- > > Key: HBASE-14806 > URL: https://issues.apache.org/jira/browse/HBASE-14806 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Duo Zhang > Attachments: HBASE-14806.patch > > > Introduced by HBASE-14085. The problem is, for example, in > hbase-common/pom.xml, we have > {code:title=pom.xml} > > org.apache.maven.plugins > maven-source-plugin > > true > > src/main/java > ${project.build.outputDirectory}/META-INF > > > > {code} > But in fact, the path inside {{}} tag is relative to source > directories, not the project directory. So the maven-source-plugin always end > with > {noformat} > No sources in project. Archive not created. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14806) Missing sources.jar for several modules when building HBase
[ https://issues.apache.org/jira/browse/HBASE-14806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-14806: -- Assignee: Duo Zhang Affects Version/s: 2.0.0 Status: Patch Available (was: Open) > Missing sources.jar for several modules when building HBase > --- > > Key: HBASE-14806 > URL: https://issues.apache.org/jira/browse/HBASE-14806 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Duo Zhang >Assignee: Duo Zhang > Attachments: HBASE-14806.patch > > > Introduced by HBASE-14085. The problem is, for example, in > hbase-common/pom.xml, we have > {code:title=pom.xml} > > org.apache.maven.plugins > maven-source-plugin > > true > > src/main/java > ${project.build.outputDirectory}/META-INF > > > > {code} > But in fact, the path inside {{}} tag is relative to source > directories, not the project directory. So the maven-source-plugin always end > with > {noformat} > No sources in project. Archive not created. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-14806) Missing sources.jar for several modules when building HBase
Duo Zhang created HBASE-14806: - Summary: Missing sources.jar for several modules when building HBase Key: HBASE-14806 URL: https://issues.apache.org/jira/browse/HBASE-14806 Project: HBase Issue Type: Bug Reporter: Duo Zhang Introduced by HBASE-14085. The problem is, for example, in hbase-common/pom.xml, we have {code:title=pom.xml} org.apache.maven.plugins maven-source-plugin true src/main/java ${project.build.outputDirectory}/META-INF {code} But in fact, the path inside {{}} tag is relative to source directories, not the project directory. So the maven-source-plugin always end with {noformat} No sources in project. Archive not created. {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only
[ https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003469#comment-15003469 ] Duo Zhang commented on HBASE-14790: --- The new implementation will be event-driven which means we could use a callback when sync so we do not need multiple sync threads any more. So I think it is a big project if we want to implement this in HDFS since we introduce a new style of API. Even though we could only implement a new writer first, but I think it is still very important to design a good general interface first. So I think we could implement this in HBase first and collect some perf results. Then we could start talking about move it into HDFS. Thanks. > Implement a new DFSOutputStream for logging WAL only > > > Key: HBASE-14790 > URL: https://issues.apache.org/jira/browse/HBASE-14790 > Project: HBase > Issue Type: Improvement >Reporter: Duo Zhang > > The original {{DFSOutputStream}} is very powerful and aims to serve all > purposes. But in fact, we do not need most of the features if we only want to > log WAL. For example, we do not need pipeline recovery since we could just > close the old logger and open a new one. And also, we do not need to write > multiple blocks since we could also open a new logger if the old file is too > large. > And the most important thing is that, it is hard to handle all the corner > cases to avoid data loss or data inconsistency(such as HBASE-14004) when > using original DFSOutputStream due to its complicated logic. And the > complicated logic also force us to use some magical tricks to increase > performance. For example, we need to use multiple threads to call {{hflush}} > when logging, and now we use 5 threads. But why 5 not 10 or 100? > So here, I propose we should implement our own {{DFSOutputStream}} when > logging WAL. For correctness, and also for performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14806) Missing sources.jar for several modules when building HBase
[ https://issues.apache.org/jira/browse/HBASE-14806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005116#comment-15005116 ] Duo Zhang commented on HBASE-14806: --- The META-INF directory contained in the generated sources.jar contains the following files. Enough? [~busbey] {noformat} " zip.vim version v27 " Browsing zipfile /home/zhangduo/hbase/code/hbase-common/target/hbase-common-2.0.0-SNAPSHOT-sources.jar " Select a file with cursor and press ENTER META-INF/ META-INF/MANIFEST.MF META-INF/LICENSE META-INF/NOTICE META-INF/DEPENDENCIES " zip.vim version v27 " Browsing zipfile /home/zhangduo/hbase/code/hbase-external-blockcache/target/hbase-external-blockcache-2.0.0-SNAPSHOT-sources.jar " Select a file with cursor and press ENTER META-INF/ META-INF/MANIFEST.MF META-INF/LICENSE META-INF/NOTICE META-INF/DEPENDENCIES {noformat} > Missing sources.jar for several modules when building HBase > --- > > Key: HBASE-14806 > URL: https://issues.apache.org/jira/browse/HBASE-14806 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Duo Zhang >Assignee: Duo Zhang > Attachments: HBASE-14806.patch > > > Introduced by HBASE-14085. The problem is, for example, in > hbase-common/pom.xml, we have > {code:title=pom.xml} > > org.apache.maven.plugins > maven-source-plugin > > true > > src/main/java > ${project.build.outputDirectory}/META-INF > > > > {code} > But in fact, the path inside {{}} tag is relative to source > directories, not the project directory. So the maven-source-plugin always end > with > {noformat} > No sources in project. Archive not created. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14806) Missing sources.jar for several modules when building HBase
[ https://issues.apache.org/jira/browse/HBASE-14806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-14806: -- Affects Version/s: 1.1.3 1.3.0 1.2.0 1.0.3 0.98.16 Fix Version/s: 1.0.4 0.98.17 1.1.4 1.3.0 1.2.0 2.0.0 > Missing sources.jar for several modules when building HBase > --- > > Key: HBASE-14806 > URL: https://issues.apache.org/jira/browse/HBASE-14806 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.0, 1.3.0, 1.0.3, 1.1.3, 0.98.16 >Reporter: Duo Zhang >Assignee: Duo Zhang > Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.4, 0.98.17, 1.0.4 > > Attachments: HBASE-14806.patch > > > Introduced by HBASE-14085. The problem is, for example, in > hbase-common/pom.xml, we have > {code:title=pom.xml} > > org.apache.maven.plugins > maven-source-plugin > > true > > src/main/java > ${project.build.outputDirectory}/META-INF > > > > {code} > But in fact, the path inside {{}} tag is relative to source > directories, not the project directory. So the maven-source-plugin always end > with > {noformat} > No sources in project. Archive not created. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14806) Missing sources.jar for several modules when building HBase
[ https://issues.apache.org/jira/browse/HBASE-14806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-14806: -- Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Pushed to all branches. Thanks [~busbey] for reviewing. > Missing sources.jar for several modules when building HBase > --- > > Key: HBASE-14806 > URL: https://issues.apache.org/jira/browse/HBASE-14806 > Project: HBase > Issue Type: Bug >Affects Versions: 2.0.0, 1.2.0, 1.3.0, 1.0.3, 1.1.3, 0.98.16 >Reporter: Duo Zhang >Assignee: Duo Zhang > Fix For: 2.0.0, 1.2.0, 1.3.0, 1.1.4, 0.98.17, 1.0.4 > > Attachments: HBASE-14806.patch > > > Introduced by HBASE-14085. The problem is, for example, in > hbase-common/pom.xml, we have > {code:title=pom.xml} > > org.apache.maven.plugins > maven-source-plugin > > true > > src/main/java > ${project.build.outputDirectory}/META-INF > > > > {code} > But in fact, the path inside {{}} tag is relative to source > directories, not the project directory. So the maven-source-plugin always end > with > {noformat} > No sources in project. Archive not created. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only
[ https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998397#comment-14998397 ] Duo Zhang commented on HBASE-14790: --- The root reason of HBASE-14004 is that, HBase and HDFS may not reach an agreement on the length of WAL file. {{DFSOutputStream}} itself does lots of thing to repair the inconsistency when error occurred, but if these approaches are all failed, then we have no idea what to do next... In fact, the simplest way to reach an agreement on file length is closing the file, and I think it is enough for logging WAL. So a simple solution is, when logging failed, try closing the file(just make a call to namenode with confirmed length), if still failing, retry forever until success and set a flag to tell upper layer the WAL is broken so we could reject write request immediately to prevent OOM. Thanks. > Implement a new DFSOutputStream for logging WAL only > > > Key: HBASE-14790 > URL: https://issues.apache.org/jira/browse/HBASE-14790 > Project: HBase > Issue Type: Improvement >Reporter: Duo Zhang > > The original {{DFSOutputStream}} is very powerful and aims to serve all > purposes. But in fact, we do not need most of the features if we only want to > log WAL. For example, we do not need pipeline recovery since we could just > close the old logger and open a new one. And also, we do not need to write > multiple blocks since we could also open a new logger if the old file is too > large. > And the most important thing is that, it is hard to handle all the corner > cases to avoid data loss or data inconsistency(such as HBASE-14004) when > using original DFSOutputStream due to its complicated logic. And the > complicated logic also force us to use some magical tricks to increase > performance. For example, we need to use multiple threads to call {{hflush}} > when logging, and now we use 5 threads. But why 5 not 10 or 100? > So here, I propose we should implement our own {{DFSOutputStream}} when > logging WAL. For correctness, and also for performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only
[ https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998407#comment-14998407 ] Duo Zhang commented on HBASE-14790: --- {quote} As for this, IMO we can set a limit, if exceed the limits, we can think HDFS is broken and close the RS directly. wdyt? {quote} Agree. But the timeout value should be carefully chosen. > Implement a new DFSOutputStream for logging WAL only > > > Key: HBASE-14790 > URL: https://issues.apache.org/jira/browse/HBASE-14790 > Project: HBase > Issue Type: Improvement >Reporter: Duo Zhang > > The original {{DFSOutputStream}} is very powerful and aims to serve all > purposes. But in fact, we do not need most of the features if we only want to > log WAL. For example, we do not need pipeline recovery since we could just > close the old logger and open a new one. And also, we do not need to write > multiple blocks since we could also open a new logger if the old file is too > large. > And the most important thing is that, it is hard to handle all the corner > cases to avoid data loss or data inconsistency(such as HBASE-14004) when > using original DFSOutputStream due to its complicated logic. And the > complicated logic also force us to use some magical tricks to increase > performance. For example, we need to use multiple threads to call {{hflush}} > when logging, and now we use 5 threads. But why 5 not 10 or 100? > So here, I propose we should implement our own {{DFSOutputStream}} when > logging WAL. For correctness, and also for performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only
[ https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1417#comment-1417 ] Duo Zhang commented on HBASE-14790: --- {{DFSOutputStream}} has pipeline recovery, so it is hard to say what is the actual state if the recovery is failed, and its {{close}} method also does lots of things other than just calling {{completeFile}}. So it is hard to control the behavior exactly. Thanks. > Implement a new DFSOutputStream for logging WAL only > > > Key: HBASE-14790 > URL: https://issues.apache.org/jira/browse/HBASE-14790 > Project: HBase > Issue Type: Improvement >Reporter: Duo Zhang > > The original {{DFSOutputStream}} is very powerful and aims to serve all > purposes. But in fact, we do not need most of the features if we only want to > log WAL. For example, we do not need pipeline recovery since we could just > close the old logger and open a new one. And also, we do not need to write > multiple blocks since we could also open a new logger if the old file is too > large. > And the most important thing is that, it is hard to handle all the corner > cases to avoid data loss or data inconsistency(such as HBASE-14004) when > using original DFSOutputStream due to its complicated logic. And the > complicated logic also force us to use some magical tricks to increase > performance. For example, we need to use multiple threads to call {{hflush}} > when logging, and now we use 5 threads. But why 5 not 10 or 100? > So here, I propose we should implement our own {{DFSOutputStream}} when > logging WAL. For correctness, and also for performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-14790) Implement a new DFSOutputStream for logging WAL only
Duo Zhang created HBASE-14790: - Summary: Implement a new DFSOutputStream for logging WAL only Key: HBASE-14790 URL: https://issues.apache.org/jira/browse/HBASE-14790 Project: HBase Issue Type: Improvement Reporter: Duo Zhang The original {{DFSOutputStream}} is very powerful and aims to serve all purposes. But in fact, we do not need most of the features if we only want to log WAL. For example, we do not need pipeline recovery since we could just close the old logger and open a new one. And also, we do not need to write multiple blocks since we could also open a new logger if the old file is too large. And the most important thing is that, it is hard to handle all the corner cases to avoid data loss or data inconsistency(such as HBASE-14004) when using original DFSOutputStream due to its complicated logic. And the complicated logic also force us to use some magical tricks to increase performance. For example, we need to use multiple threads to call {{hflush}} when logging, and now we use 5 threads. But why 5 not 10 or 100? So here, I propose we should implement our own {{DFSOutputStream}} when logging WAL. For correctness, and also for performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14759) Avoid using Math.abs when selecting SyncRunner in FSHLog
[ https://issues.apache.org/jira/browse/HBASE-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990744#comment-14990744 ] Duo Zhang commented on HBASE-14759: --- {code} this.syncRunnerIndex = (this.syncRunnerIndex + 1) % this.syncRunners.length; {code} I think this will update syncRunnerIndex? [~enis] > Avoid using Math.abs when selecting SyncRunner in FSHLog > > > Key: HBASE-14759 > URL: https://issues.apache.org/jira/browse/HBASE-14759 > Project: HBase > Issue Type: Bug > Components: wal >Reporter: Duo Zhang >Assignee: Duo Zhang > Attachments: HBASE-14759.patch > > > {code:title=FSHLog.java} > int index = Math.abs(this.syncRunnerIndex++) % this.syncRunners.length; > try { > this.syncRunners[index].offer(sequence, this.syncFutures, > this.syncFuturesCount); > } catch (Exception e) { > // Should NEVER get here. > requestLogRoll(); > this.exception = new DamagedWALException("Failed offering sync", > e); > } > {code} > Math.abs will return Integer.MIN_VALUE if you pass Integer.MIN_VALUE in since > the actual absolute value of Integer.MIN_VALUE is out of range. > I think {{this.syncRunnerIndex++}} will overflow eventually if we keep the > regionserver running for enough time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14759) Avoid using Math.abs when selecting SyncRunner in FSHLog
[ https://issues.apache.org/jira/browse/HBASE-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-14759: -- Affects Version/s: 1.2.0 2.0.0 1.0.2 1.1.2 Fix Version/s: 1.1.3 1.0.3 1.2.0 2.0.0 Set fix versions, 0.98 does not have this problem. > Avoid using Math.abs when selecting SyncRunner in FSHLog > > > Key: HBASE-14759 > URL: https://issues.apache.org/jira/browse/HBASE-14759 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 2.0.0, 1.0.2, 1.2.0, 1.1.2 >Reporter: Duo Zhang >Assignee: Duo Zhang > Fix For: 2.0.0, 1.2.0, 1.0.3, 1.1.3 > > Attachments: HBASE-14759.patch > > > {code:title=FSHLog.java} > int index = Math.abs(this.syncRunnerIndex++) % this.syncRunners.length; > try { > this.syncRunners[index].offer(sequence, this.syncFutures, > this.syncFuturesCount); > } catch (Exception e) { > // Should NEVER get here. > requestLogRoll(); > this.exception = new DamagedWALException("Failed offering sync", > e); > } > {code} > Math.abs will return Integer.MIN_VALUE if you pass Integer.MIN_VALUE in since > the actual absolute value of Integer.MIN_VALUE is out of range. > I think {{this.syncRunnerIndex++}} will overflow eventually if we keep the > regionserver running for enough time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14759) Avoid using Math.abs when selecting SyncRunner in FSHLog
[ https://issues.apache.org/jira/browse/HBASE-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994772#comment-14994772 ] Duo Zhang commented on HBASE-14759: --- OK, Thanks. Let me commit this. > Avoid using Math.abs when selecting SyncRunner in FSHLog > > > Key: HBASE-14759 > URL: https://issues.apache.org/jira/browse/HBASE-14759 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 2.0.0, 1.0.2, 1.2.0, 1.1.2, 1.3.0 >Reporter: Duo Zhang >Assignee: Duo Zhang > Fix For: 2.0.0, 1.2.0, 1.3.0, 1.0.3, 1.1.3 > > Attachments: HBASE-14759.patch > > > {code:title=FSHLog.java} > int index = Math.abs(this.syncRunnerIndex++) % this.syncRunners.length; > try { > this.syncRunners[index].offer(sequence, this.syncFutures, > this.syncFuturesCount); > } catch (Exception e) { > // Should NEVER get here. > requestLogRoll(); > this.exception = new DamagedWALException("Failed offering sync", > e); > } > {code} > Math.abs will return Integer.MIN_VALUE if you pass Integer.MIN_VALUE in since > the actual absolute value of Integer.MIN_VALUE is out of range. > I think {{this.syncRunnerIndex++}} will overflow eventually if we keep the > regionserver running for enough time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14759) Avoid using Math.abs when selecting SyncRunner in FSHLog
[ https://issues.apache.org/jira/browse/HBASE-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-14759: -- Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Pushed to 1.0+. Thanks [~yuzhih...@gmail.com] and [~enis] for reviewing. > Avoid using Math.abs when selecting SyncRunner in FSHLog > > > Key: HBASE-14759 > URL: https://issues.apache.org/jira/browse/HBASE-14759 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 2.0.0, 1.0.2, 1.2.0, 1.1.2, 1.3.0 >Reporter: Duo Zhang >Assignee: Duo Zhang > Fix For: 2.0.0, 1.2.0, 1.3.0, 1.0.3, 1.1.3 > > Attachments: HBASE-14759.patch > > > {code:title=FSHLog.java} > int index = Math.abs(this.syncRunnerIndex++) % this.syncRunners.length; > try { > this.syncRunners[index].offer(sequence, this.syncFutures, > this.syncFuturesCount); > } catch (Exception e) { > // Should NEVER get here. > requestLogRoll(); > this.exception = new DamagedWALException("Failed offering sync", > e); > } > {code} > Math.abs will return Integer.MIN_VALUE if you pass Integer.MIN_VALUE in since > the actual absolute value of Integer.MIN_VALUE is out of range. > I think {{this.syncRunnerIndex++}} will overflow eventually if we keep the > regionserver running for enough time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14759) Avoid using Math.abs when selecting SyncRunner in FSHLog
[ https://issues.apache.org/jira/browse/HBASE-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-14759: -- Affects Version/s: 1.3.0 Fix Version/s: 1.3.0 > Avoid using Math.abs when selecting SyncRunner in FSHLog > > > Key: HBASE-14759 > URL: https://issues.apache.org/jira/browse/HBASE-14759 > Project: HBase > Issue Type: Bug > Components: wal >Affects Versions: 2.0.0, 1.0.2, 1.2.0, 1.1.2, 1.3.0 >Reporter: Duo Zhang >Assignee: Duo Zhang > Fix For: 2.0.0, 1.2.0, 1.3.0, 1.0.3, 1.1.3 > > Attachments: HBASE-14759.patch > > > {code:title=FSHLog.java} > int index = Math.abs(this.syncRunnerIndex++) % this.syncRunners.length; > try { > this.syncRunners[index].offer(sequence, this.syncFutures, > this.syncFuturesCount); > } catch (Exception e) { > // Should NEVER get here. > requestLogRoll(); > this.exception = new DamagedWALException("Failed offering sync", > e); > } > {code} > Math.abs will return Integer.MIN_VALUE if you pass Integer.MIN_VALUE in since > the actual absolute value of Integer.MIN_VALUE is out of range. > I think {{this.syncRunnerIndex++}} will overflow eventually if we keep the > regionserver running for enough time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-14759) Avoid using Math.abs when selecting SyncRunner in FSHLog
Duo Zhang created HBASE-14759: - Summary: Avoid using Math.abs when selecting SyncRunner in FSHLog Key: HBASE-14759 URL: https://issues.apache.org/jira/browse/HBASE-14759 Project: HBase Issue Type: Bug Components: wal Reporter: Duo Zhang {code:title=FSHLog.java} int index = Math.abs(this.syncRunnerIndex++) % this.syncRunners.length; try { this.syncRunners[index].offer(sequence, this.syncFutures, this.syncFuturesCount); } catch (Exception e) { // Should NEVER get here. requestLogRoll(); this.exception = new DamagedWALException("Failed offering sync", e); } {code} Math.abs will return Integer.MIN_VALUE if you pass Integer.MIN_VALUE in since the actual absolute value of Integer.MIN_VALUE is out of range. I think {{this.syncRunnerIndex++}} will overflow eventually if we keep the regionserver running for enough time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14759) Avoid using Math.abs when selecting SyncRunner in FSHLog
[ https://issues.apache.org/jira/browse/HBASE-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989229#comment-14989229 ] Duo Zhang commented on HBASE-14759: --- Add a simple testcase in TestFSHLog {code} @Test public void testSyncRunnerIndexOverflow() throws IOException, NoSuchFieldException, SecurityException, IllegalArgumentException, IllegalAccessException { final String name = "testSyncRunnerIndexOverflow"; FSHLog log = new FSHLog(fs, FSUtils.getRootDir(conf), name, HConstants.HREGION_OLDLOGDIR_NAME, conf, null, true, null, null); Field ringBufferEventHandlerField = FSHLog.class.getDeclaredField("ringBufferEventHandler"); ringBufferEventHandlerField.setAccessible(true); FSHLog.RingBufferEventHandler ringBufferEventHandler = (FSHLog.RingBufferEventHandler) ringBufferEventHandlerField.get(log); Field syncRunnerIndexField = FSHLog.RingBufferEventHandler.class.getDeclaredField("syncRunnerIndex"); syncRunnerIndexField.setAccessible(true); syncRunnerIndexField.set(ringBufferEventHandler, Integer.MAX_VALUE); HTableDescriptor htd = new HTableDescriptor(TableName.valueOf("t1")).addFamily(new HColumnDescriptor("row")); HRegionInfo hri = new HRegionInfo(htd.getTableName(), HConstants.EMPTY_START_ROW, HConstants.EMPTY_END_ROW); MultiVersionConcurrencyControl mvcc = new MultiVersionConcurrencyControl(); addEdits(log, hri, htd, 1, mvcc); addEdits(log, hri, htd, 1, mvcc); } {code} It will fail with {noformat} org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: On sync at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1782) at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1) at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:128) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hbase.regionserver.wal.DamagedWALException: Failed offering sync at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1776) ... 5 more Caused by: java.lang.ArrayIndexOutOfBoundsException: -3 at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1772) ... 5 more {noformat} > Avoid using Math.abs when selecting SyncRunner in FSHLog > > > Key: HBASE-14759 > URL: https://issues.apache.org/jira/browse/HBASE-14759 > Project: HBase > Issue Type: Bug > Components: wal >Reporter: Duo Zhang > > {code:title=FSHLog.java} > int index = Math.abs(this.syncRunnerIndex++) % this.syncRunners.length; > try { > this.syncRunners[index].offer(sequence, this.syncFutures, > this.syncFuturesCount); > } catch (Exception e) { > // Should NEVER get here. > requestLogRoll(); > this.exception = new DamagedWALException("Failed offering sync", > e); > } > {code} > Math.abs will return Integer.MIN_VALUE if you pass Integer.MIN_VALUE in since > the actual absolute value of Integer.MIN_VALUE is out of range. > I think {{this.syncRunnerIndex++}} will overflow eventually if we keep the > regionserver running for enough time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14759) Avoid using Math.abs when selecting SyncRunner in FSHLog
[ https://issues.apache.org/jira/browse/HBASE-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-14759: -- Attachment: HBASE-14759.patch > Avoid using Math.abs when selecting SyncRunner in FSHLog > > > Key: HBASE-14759 > URL: https://issues.apache.org/jira/browse/HBASE-14759 > Project: HBase > Issue Type: Bug > Components: wal >Reporter: Duo Zhang > Attachments: HBASE-14759.patch > > > {code:title=FSHLog.java} > int index = Math.abs(this.syncRunnerIndex++) % this.syncRunners.length; > try { > this.syncRunners[index].offer(sequence, this.syncFutures, > this.syncFuturesCount); > } catch (Exception e) { > // Should NEVER get here. > requestLogRoll(); > this.exception = new DamagedWALException("Failed offering sync", > e); > } > {code} > Math.abs will return Integer.MIN_VALUE if you pass Integer.MIN_VALUE in since > the actual absolute value of Integer.MIN_VALUE is out of range. > I think {{this.syncRunnerIndex++}} will overflow eventually if we keep the > regionserver running for enough time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-14759) Avoid using Math.abs when selecting SyncRunner in FSHLog
[ https://issues.apache.org/jira/browse/HBASE-14759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duo Zhang updated HBASE-14759: -- Assignee: Duo Zhang Status: Patch Available (was: Open) > Avoid using Math.abs when selecting SyncRunner in FSHLog > > > Key: HBASE-14759 > URL: https://issues.apache.org/jira/browse/HBASE-14759 > Project: HBase > Issue Type: Bug > Components: wal >Reporter: Duo Zhang >Assignee: Duo Zhang > Attachments: HBASE-14759.patch > > > {code:title=FSHLog.java} > int index = Math.abs(this.syncRunnerIndex++) % this.syncRunners.length; > try { > this.syncRunners[index].offer(sequence, this.syncFutures, > this.syncFuturesCount); > } catch (Exception e) { > // Should NEVER get here. > requestLogRoll(); > this.exception = new DamagedWALException("Failed offering sync", > e); > } > {code} > Math.abs will return Integer.MIN_VALUE if you pass Integer.MIN_VALUE in since > the actual absolute value of Integer.MIN_VALUE is out of range. > I think {{this.syncRunnerIndex++}} will overflow eventually if we keep the > regionserver running for enough time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-14586) Use a maven profile to run Jacoco analysis
[ https://issues.apache.org/jira/browse/HBASE-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956135#comment-14956135 ] Duo Zhang commented on HBASE-14586: --- +1 > Use a maven profile to run Jacoco analysis > -- > > Key: HBASE-14586 > URL: https://issues.apache.org/jira/browse/HBASE-14586 > Project: HBase > Issue Type: Improvement >Affects Versions: 1.1.0 >Reporter: Andrew Wang >Assignee: Andrew Wang >Priority: Minor > Attachments: hbase-14586.001.patch, hbase-14586.001.patch, > hbase-14586.002.patch > > > The pom.xml has a line like this for the Surefire argLine, which has an extra > ${argLine} reference. Recommend changes like this: > {noformat} > -${hbase-surefire.argLine} ${argLine} > +${hbase-surefire.argLine} > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13082) Coarsen StoreScanner locks to RegionScanner
[ https://issues.apache.org/jira/browse/HBASE-13082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964555#comment-14964555 ] Duo Zhang commented on HBASE-13082: --- I mean maybe we could have a fast path without locking for normal reading since for most scan operation there is no flush involved. And if a flush happen, then we change back to use a lock. I think this is something like a BiasedLock? > Coarsen StoreScanner locks to RegionScanner > --- > > Key: HBASE-13082 > URL: https://issues.apache.org/jira/browse/HBASE-13082 > Project: HBase > Issue Type: Bug >Reporter: Lars Hofhansl >Assignee: ramkrishna.s.vasudevan > Attachments: 13082-test.txt, 13082-v2.txt, 13082-v3.txt, > 13082-v4.txt, 13082.txt, 13082.txt, HBASE-13082.pdf, HBASE-13082_1_WIP.patch, > HBASE-13082_2_WIP.patch, HBASE-13082_3.patch, HBASE-13082_4.patch, gc.png, > gc.png, gc.png, hits.png, next.png, next.png > > > Continuing where HBASE-10015 left of. > We can avoid locking (and memory fencing) inside StoreScanner by deferring to > the lock already held by the RegionScanner. > In tests this shows quite a scan improvement and reduced CPU (the fences make > the cores wait for memory fetches). > There are some drawbacks too: > * All calls to RegionScanner need to be remain synchronized > * Implementors of coprocessors need to be diligent in following the locking > contract. For example Phoenix does not lock RegionScanner.nextRaw() and > required in the documentation (not picking on Phoenix, this one is my fault > as I told them it's OK) > * possible starving of flushes and compaction with heavy read load. > RegionScanner operations would keep getting the locks and the > flushes/compactions would not be able finalize the set of files. > I'll have a patch soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-13082) Coarsen StoreScanner locks to RegionScanner
[ https://issues.apache.org/jira/browse/HBASE-13082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964481#comment-14964481 ] Duo Zhang commented on HBASE-13082: --- {quote} On flush, what if we let all Scans complete before letting go the snapshot? More memory pressure. Simpler implementation. {quote} How much time a {{RegionScanner}} can continue sometimes depends on the user of HBase. Think of a {{Filter}} that matches nothing. A {{RegionScanner}} could scan the whole region if that {{Filter}} is applied. So I do not think having a ref count on snapshot is a good idea. So I think resetting heap is necessary when there is a flush, and the problem is how to do it without a expensive overhead? Maybe something like a Biased Locking? Thanks. > Coarsen StoreScanner locks to RegionScanner > --- > > Key: HBASE-13082 > URL: https://issues.apache.org/jira/browse/HBASE-13082 > Project: HBase > Issue Type: Bug >Reporter: Lars Hofhansl >Assignee: ramkrishna.s.vasudevan > Attachments: 13082-test.txt, 13082-v2.txt, 13082-v3.txt, > 13082-v4.txt, 13082.txt, 13082.txt, HBASE-13082.pdf, HBASE-13082_1_WIP.patch, > HBASE-13082_2_WIP.patch, HBASE-13082_3.patch, HBASE-13082_4.patch, gc.png, > gc.png, gc.png, hits.png, next.png, next.png > > > Continuing where HBASE-10015 left of. > We can avoid locking (and memory fencing) inside StoreScanner by deferring to > the lock already held by the RegionScanner. > In tests this shows quite a scan improvement and reduced CPU (the fences make > the cores wait for memory fetches). > There are some drawbacks too: > * All calls to RegionScanner need to be remain synchronized > * Implementors of coprocessors need to be diligent in following the locking > contract. For example Phoenix does not lock RegionScanner.nextRaw() and > required in the documentation (not picking on Phoenix, this one is my fault > as I told them it's OK) > * possible starving of flushes and compaction with heavy read load. > RegionScanner operations would keep getting the locks and the > flushes/compactions would not be able finalize the set of files. > I'll have a patch soon. -- This message was sent by Atlassian JIRA (v6.3.4#6332)