[jira] [Updated] (HBASE-5689) Skipping RecoveredEdits may cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5689: Attachment: HBASE-5689v2.patch Sorry for the later response. In the patchV2, I add some doc for getCompletedRecoveredEditsFilePath() as the comment by Stack. All the other is same with patchV1. Please review again. Thanks! Skipping RecoveredEdits may cause data loss --- Key: HBASE-5689 URL: https://issues.apache.org/jira/browse/HBASE-5689 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.0 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.94.0 Attachments: 5689-testcase.patch, HBASE-5689.patch, HBASE-5689.patch, HBASE-5689v2.patch Let's see the following scenario: 1.Region is on the server A 2.put KV(r1-v1) to the region 3.move region from server A to server B 4.put KV(r2-v2) to the region 5.move region from server B to server A 6.put KV(r3-v3) to the region 7.kill -9 server B and start it 8.kill -9 server A and start it 9.scan the region, we could only get two KV(r1-v1,r2-v2), the third KV(r3-v3) is lost. Let's analyse the upper scenario from the code: 1.the edit logs of KV(r1-v1) and KV(r3-v3) are both recorded in the same hlog file on server A. 2.when we split server B's hlog file in the process of ServerShutdownHandler, we create one RecoveredEdits file f1 for the region. 2.when we split server A's hlog file in the process of ServerShutdownHandler, we create another RecoveredEdits file f2 for the region. 3.however, RecoveredEdits file f2 will be skiped when initializing region HRegion#replayRecoveredEditsIfAny {code} for (Path edits: files) { if (edits == null || !this.fs.exists(edits)) { LOG.warn(Null or non-existent edits file: + edits); continue; } if (isZeroLengthThenDelete(this.fs, edits)) continue; if (checkSafeToSkip) { Path higher = files.higher(edits); long maxSeqId = Long.MAX_VALUE; if (higher != null) { // Edit file name pattern, HLog.EDITFILES_NAME_PATTERN: -?[0-9]+ String fileName = higher.getName(); maxSeqId = Math.abs(Long.parseLong(fileName)); } if (maxSeqId = minSeqId) { String msg = Maximum possible sequenceid for this log is + maxSeqId + , skipped the whole file, path= + edits; LOG.debug(msg); continue; } else { checkSafeToSkip = false; } } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5672) TestLruBlockCache#testBackgroundEvictionThread fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-5672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5672: Attachment: HBASE-5672v2.patch In the patchV2, I add a flag in the thread. Please review again Thanks. TestLruBlockCache#testBackgroundEvictionThread fails occasionally - Key: HBASE-5672 URL: https://issues.apache.org/jira/browse/HBASE-5672 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Attachments: HBASE-5672.patch, HBASE-5672v2.patch We find TestLruBlockCache#testBackgroundEvictionThread fails occasionally. I think it's a problem of the test case. Because runEviction() only do evictionThread.evict(): {code} public void evict() { synchronized(this) { this.notify(); // FindBugs NN_NAKED_NOTIFY } } {code} However when we call evictionThread.evict(), the evictionThread may haven't been in run() in the TestLruBlockCache#testBackgroundEvictionThread. If we run the test many times, we could find failture easily. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5689) Skipping RecoveredEdits may cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5689: Attachment: HBASE-5689v3.patch In the patchV3, append the optimize suggested by Ming Ma :replayRecoveredEditsIfAny simply walk through all the recoverededit files and skip those files that have max sequence id = current sequence id of the region. Skipping RecoveredEdits may cause data loss --- Key: HBASE-5689 URL: https://issues.apache.org/jira/browse/HBASE-5689 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.0 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.94.0 Attachments: 5689-testcase.patch, HBASE-5689.patch, HBASE-5689.patch, HBASE-5689v2.patch, HBASE-5689v3.patch Let's see the following scenario: 1.Region is on the server A 2.put KV(r1-v1) to the region 3.move region from server A to server B 4.put KV(r2-v2) to the region 5.move region from server B to server A 6.put KV(r3-v3) to the region 7.kill -9 server B and start it 8.kill -9 server A and start it 9.scan the region, we could only get two KV(r1-v1,r2-v2), the third KV(r3-v3) is lost. Let's analyse the upper scenario from the code: 1.the edit logs of KV(r1-v1) and KV(r3-v3) are both recorded in the same hlog file on server A. 2.when we split server B's hlog file in the process of ServerShutdownHandler, we create one RecoveredEdits file f1 for the region. 2.when we split server A's hlog file in the process of ServerShutdownHandler, we create another RecoveredEdits file f2 for the region. 3.however, RecoveredEdits file f2 will be skiped when initializing region HRegion#replayRecoveredEditsIfAny {code} for (Path edits: files) { if (edits == null || !this.fs.exists(edits)) { LOG.warn(Null or non-existent edits file: + edits); continue; } if (isZeroLengthThenDelete(this.fs, edits)) continue; if (checkSafeToSkip) { Path higher = files.higher(edits); long maxSeqId = Long.MAX_VALUE; if (higher != null) { // Edit file name pattern, HLog.EDITFILES_NAME_PATTERN: -?[0-9]+ String fileName = higher.getName(); maxSeqId = Math.abs(Long.parseLong(fileName)); } if (maxSeqId = minSeqId) { String msg = Maximum possible sequenceid for this log is + maxSeqId + , skipped the whole file, path= + edits; LOG.debug(msg); continue; } else { checkSafeToSkip = false; } } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5689) Skip RecoveredEdits may cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5689: Attachment: HBASE-5689.patch In the patch, I make the region's MaximumEditLogSeqNum in the RecoveredEdit file as the file name. Skip RecoveredEdits may cause data loss --- Key: HBASE-5689 URL: https://issues.apache.org/jira/browse/HBASE-5689 Project: HBase Issue Type: Bug Components: regionserver Reporter: chunhui shen Assignee: chunhui shen Attachments: 5689-testcase.patch, HBASE-5689.patch Let's see the following scenario: 1.Region is on the server A 2.put KV(r1-v1) to the region 3.move region from server A to server B 4.put KV(r2-v2) to the region 5.move region from server B to server A 6.put KV(r3-v3) to the region 7.kill -9 server B and start it 8.kill -9 server A and start it 9.scan the region, we could only get two KV(r1-v1,r2-v2), the third KV(r3-v3) is lost. Let's analyse the upper scenario from the code: 1.the edit logs of KV(r1-v1) and KV(r3-v3) are both recorded in the same hlog file on server A. 2.when we split server B's hlog file in the process of ServerShutdownHandler, we create one RecoveredEdits file f1 for the region. 2.when we split server A's hlog file in the process of ServerShutdownHandler, we create another RecoveredEdits file f2 for the region. 3.however, RecoveredEdits file f2 will be skiped when initializing region HRegion#replayRecoveredEditsIfAny {code} for (Path edits: files) { if (edits == null || !this.fs.exists(edits)) { LOG.warn(Null or non-existent edits file: + edits); continue; } if (isZeroLengthThenDelete(this.fs, edits)) continue; if (checkSafeToSkip) { Path higher = files.higher(edits); long maxSeqId = Long.MAX_VALUE; if (higher != null) { // Edit file name pattern, HLog.EDITFILES_NAME_PATTERN: -?[0-9]+ String fileName = higher.getName(); maxSeqId = Math.abs(Long.parseLong(fileName)); } if (maxSeqId = minSeqId) { String msg = Maximum possible sequenceid for this log is + maxSeqId + , skipped the whole file, path= + edits; LOG.debug(msg); continue; } else { checkSafeToSkip = false; } } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5689) Skip RecoveredEdits may cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5689: Attachment: 5689-testcase.patch I have written a test case for the issue to make a data loss scenario when skip RecoveredEdits. Skip RecoveredEdits may cause data loss --- Key: HBASE-5689 URL: https://issues.apache.org/jira/browse/HBASE-5689 Project: HBase Issue Type: Bug Components: regionserver Reporter: chunhui shen Assignee: chunhui shen Attachments: 5689-testcase.patch Let's see the following scenario: 1.Region is on the server A 2.put KV(r1-v1) to the region 3.move region from server A to server B 4.put KV(r2-v2) to the region 5.move region from server B to server A 6.put KV(r3-v3) to the region 7.kill -9 server B and start it 8.kill -9 server A and start it 9.scan the region, we could only get two KV(r1-v1,r2-v2), the third KV(r3-v3) is lost. Let's analyse the upper scenario from the code: 1.the edit logs of KV(r1-v1) and KV(r3-v3) are both recorded in the same hlog file on server A. 2.when we split server B's hlog file in the process of ServerShutdownHandler, we create one RecoveredEdits file f1 for the region. 2.when we split server A's hlog file in the process of ServerShutdownHandler, we create another RecoveredEdits file f2 for the region. 3.however, RecoveredEdits file f2 will be skiped when initializing region HRegion#replayRecoveredEditsIfAny {code} for (Path edits: files) { if (edits == null || !this.fs.exists(edits)) { LOG.warn(Null or non-existent edits file: + edits); continue; } if (isZeroLengthThenDelete(this.fs, edits)) continue; if (checkSafeToSkip) { Path higher = files.higher(edits); long maxSeqId = Long.MAX_VALUE; if (higher != null) { // Edit file name pattern, HLog.EDITFILES_NAME_PATTERN: -?[0-9]+ String fileName = higher.getName(); maxSeqId = Math.abs(Long.parseLong(fileName)); } if (maxSeqId = minSeqId) { String msg = Maximum possible sequenceid for this log is + maxSeqId + , skipped the whole file, path= + edits; LOG.debug(msg); continue; } else { checkSafeToSkip = false; } } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5672) TestLruBlockCache#testBackgroundEvictionThread fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-5672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5672: Description: We find TestLruBlockCache#testBackgroundEvictionThread fails occasionally. I think it's a problem of the test case. Because runEviction() only do evictionThread.evict(): {code} public void evict() { synchronized(this) { this.notify(); // FindBugs NN_NAKED_NOTIFY } } {code} However when we call evictionThread.evict(), the evictionThread may haven't been in run() in the TestLruBlockCache#testBackgroundEvictionThread. If we run the test many times, we could find failture easily. was: We find TestLruBlockCache#testBackgroundEvictionThread fails occasionally. I think it's a problem of the test case. Because runEviction() only do evictionThread.evict(): {code} public void evict() { synchronized(this) { this.notify(); // FindBugs NN_NAKED_NOTIFY } } {code} However when we call evictionThread.evict(), the evictionThread may haven't been in run() in the TestLruBlockCache#testBackgroundEvictionThread. TestLruBlockCache#testBackgroundEvictionThread fails occasionally - Key: HBASE-5672 URL: https://issues.apache.org/jira/browse/HBASE-5672 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen We find TestLruBlockCache#testBackgroundEvictionThread fails occasionally. I think it's a problem of the test case. Because runEviction() only do evictionThread.evict(): {code} public void evict() { synchronized(this) { this.notify(); // FindBugs NN_NAKED_NOTIFY } } {code} However when we call evictionThread.evict(), the evictionThread may haven't been in run() in the TestLruBlockCache#testBackgroundEvictionThread. If we run the test many times, we could find failture easily. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5672) TestLruBlockCache#testBackgroundEvictionThread fails occasionally
[ https://issues.apache.org/jira/browse/HBASE-5672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5672: Attachment: HBASE-5672.patch No failture for 1000 times with the patch. TestLruBlockCache#testBackgroundEvictionThread fails occasionally - Key: HBASE-5672 URL: https://issues.apache.org/jira/browse/HBASE-5672 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Attachments: HBASE-5672.patch We find TestLruBlockCache#testBackgroundEvictionThread fails occasionally. I think it's a problem of the test case. Because runEviction() only do evictionThread.evict(): {code} public void evict() { synchronized(this) { this.notify(); // FindBugs NN_NAKED_NOTIFY } } {code} However when we call evictionThread.evict(), the evictionThread may haven't been in run() in the TestLruBlockCache#testBackgroundEvictionThread. If we run the test many times, we could find failture easily. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5624) Aborting regionserver when splitting region, may cause daughter region not assigned by ServerShutdownHandler.
[ https://issues.apache.org/jira/browse/HBASE-5624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5624: Attachment: HBASE-5624.patch Aborting regionserver when splitting region, may cause daughter region not assigned by ServerShutdownHandler. - Key: HBASE-5624 URL: https://issues.apache.org/jira/browse/HBASE-5624 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Attachments: HBASE-5624.patch If one region is splitting when regionserver is stoping. The following code may executed in SplitTransaction#openDaughters. {code} // TODO: Is this check needed here? if (stopped || stopping) { // add 2nd daughter first (see HBASE-4335) MetaEditor.addDaughter(server.getCatalogTracker(), b.getRegionInfo(), null); MetaEditor.addDaughter(server.getCatalogTracker(), a.getRegionInfo(), null); LOG.info(Not opening daughters + b.getRegionInfo().getRegionNameAsString() + and + a.getRegionInfo().getRegionNameAsString() + because stopping= + stopping + , stopped= + stopped); } {code} So, for the two daughter regions, their location are both null in .META. When ServerShutdownHandler process the dead server, it will not assign these two daughter regions since their location(info:server) are null in .META. by MetaReader.getServerUserRegions(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5568) Multi concurrent flushcache() for one region could cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5568: Attachment: HBASE-5568v2.patch Multi concurrent flushcache() for one region could cause data loss -- Key: HBASE-5568 URL: https://issues.apache.org/jira/browse/HBASE-5568 Project: HBase Issue Type: Bug Components: regionserver Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Attachments: HBASE-5568-90.patch, HBASE-5568.patch, HBASE-5568.patch, HBASE-5568v2.patch, HBASE-5568v2.patch We could call HRegion#flushcache() concurrently now through HRegionServer#splitRegion or HRegionServer#flushRegion by HBaseAdmin. However, we find if HRegion#internalFlushcache() is called concurrently by multi thread, HRegion.memstoreSize will be calculated wrong. At the end of HRegion#internalFlushcache(), we will do this.addAndGetGlobalMemstoreSize(-flushsize), but the flushsize may not the actual memsize which flushed to hdfs. It cause HRegion.memstoreSize is negative and prevent next flush if we close this region. Logs in RS for region e9d827913a056e696c39bc569ea3 2012-03-11 16:31:36,690 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for writetest1,,1331454657410.e9d827913a056e696c39bc569ea3 f99f., current region memstore size 128.0m 2012-03-11 16:31:37,999 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf1/8162481165586107427, entries=153106, sequenceid=619316544, memsize=59.6m, filesize=31.2m 2012-03-11 16:31:38,830 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for writetest1,,1331454657410.e9d827913a056e696c39bc569ea3 f99f., current region memstore size 134.8m 2012-03-11 16:31:39,458 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf2/3425971951499794221, entries=230183, sequenceid=619316544, memsize=68.5m, filesize=26.6m 2012-03-11 16:31:39,459 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~128.1m for region writetest1,,1331454657410.e9d827913a 056e696c39bc569ea3f99f. in 2769ms, sequenceid=619316544, compaction requested=false 2012-03-11 16:31:39,459 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for writetest1,,1331454657410.e9d827913a056e696c39bc569ea3 f99f., current region memstore size 6.8m 2012-03-11 16:31:39,529 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf1/1811012969998104626, entries=8002, sequenceid=619332759, memsize=3.1m, filesize=1.6m 2012-03-11 16:31:39,640 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf2/770333473623552048, entries=12231, sequenceid=619332759, memsize=3.6m, filesize=1.4m 2012-03-11 16:31:39,641 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~134.8m for region writetest1,,1331454657410.e9d827913a 056e696c39bc569ea3f99f. in 811ms, sequenceid=619332759, compaction requested=true 2012-03-11 16:31:39,707 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf1/5656568849587368557, entries=119, sequenceid=619332979, memsize=47.4k, filesize=25.6k 2012-03-11 16:31:39,775 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf2/794343845650987521, entries=157, sequenceid=619332979, memsize=47.8k, filesize=19.3k 2012-03-11 16:31:39,777 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~6.8m for region writetest1,,1331454657410.e9d827913a05 6e696c39bc569ea3f99f. in 318ms, sequenceid=619332979, compaction requested=true -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5571) Table will be disabling forever
[ https://issues.apache.org/jira/browse/HBASE-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5571: Attachment: HBASE-5571v3.patch patchv3 change unassign() in nodeDeleted() to invokeUnassign(). Table will be disabling forever --- Key: HBASE-5571 URL: https://issues.apache.org/jira/browse/HBASE-5571 Project: HBase Issue Type: Bug Components: master, regionserver Reporter: chunhui shen Assignee: chunhui shen Attachments: BASE-5571v2.patch, HBASE-5571.patch, HBASE-5571v3.patch If we restart master when it is disabling one table, the table will be disabling forever. In current logic, Region CLOSE RPC will always returned NotServingRegionException because RS has already closed the region before we restart master. So table will be disabling forever because the region will in RIT all along. In another case, when AssignmentManager#rebuildUserRegions(), it will put parent regions to AssignmentManager.regions, so we can't close these parent regions until it is purged by CatalogJanitor if we execute disabling the table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5568) Multi concurrent flushcache() for one region could cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5568: Attachment: HBASE-5568v2.patch patch v2 for test case: change TestStore#testDeleteExpiredStoreFiles#ttl from 1 to 4 Multi concurrent flushcache() for one region could cause data loss -- Key: HBASE-5568 URL: https://issues.apache.org/jira/browse/HBASE-5568 Project: HBase Issue Type: Bug Components: regionserver Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.90.7, 0.92.2, 0.94.0, 0.96.0 Attachments: HBASE-5568-90.patch, HBASE-5568.patch, HBASE-5568.patch, HBASE-5568v2.patch We could call HRegion#flushcache() concurrently now through HRegionServer#splitRegion or HRegionServer#flushRegion by HBaseAdmin. However, we find if HRegion#internalFlushcache() is called concurrently by multi thread, HRegion.memstoreSize will be calculated wrong. At the end of HRegion#internalFlushcache(), we will do this.addAndGetGlobalMemstoreSize(-flushsize), but the flushsize may not the actual memsize which flushed to hdfs. It cause HRegion.memstoreSize is negative and prevent next flush if we close this region. Logs in RS for region e9d827913a056e696c39bc569ea3 2012-03-11 16:31:36,690 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for writetest1,,1331454657410.e9d827913a056e696c39bc569ea3 f99f., current region memstore size 128.0m 2012-03-11 16:31:37,999 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf1/8162481165586107427, entries=153106, sequenceid=619316544, memsize=59.6m, filesize=31.2m 2012-03-11 16:31:38,830 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for writetest1,,1331454657410.e9d827913a056e696c39bc569ea3 f99f., current region memstore size 134.8m 2012-03-11 16:31:39,458 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf2/3425971951499794221, entries=230183, sequenceid=619316544, memsize=68.5m, filesize=26.6m 2012-03-11 16:31:39,459 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~128.1m for region writetest1,,1331454657410.e9d827913a 056e696c39bc569ea3f99f. in 2769ms, sequenceid=619316544, compaction requested=false 2012-03-11 16:31:39,459 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for writetest1,,1331454657410.e9d827913a056e696c39bc569ea3 f99f., current region memstore size 6.8m 2012-03-11 16:31:39,529 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf1/1811012969998104626, entries=8002, sequenceid=619332759, memsize=3.1m, filesize=1.6m 2012-03-11 16:31:39,640 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf2/770333473623552048, entries=12231, sequenceid=619332759, memsize=3.6m, filesize=1.4m 2012-03-11 16:31:39,641 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~134.8m for region writetest1,,1331454657410.e9d827913a 056e696c39bc569ea3f99f. in 811ms, sequenceid=619332759, compaction requested=true 2012-03-11 16:31:39,707 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf1/5656568849587368557, entries=119, sequenceid=619332979, memsize=47.4k, filesize=25.6k 2012-03-11 16:31:39,775 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf2/794343845650987521, entries=157, sequenceid=619332979, memsize=47.8k, filesize=19.3k 2012-03-11 16:31:39,777 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~6.8m for region writetest1,,1331454657410.e9d827913a05 6e696c39bc569ea3f99f. in 318ms, sequenceid=619332979, compaction requested=true -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5568) Multi concurrent flushcache() for one region could cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5568: Attachment: HBASE-5568-90.patch Submit a patch for 90 version Multi concurrent flushcache() for one region could cause data loss -- Key: HBASE-5568 URL: https://issues.apache.org/jira/browse/HBASE-5568 Project: HBase Issue Type: Bug Components: regionserver Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.92.2, 0.94.0, 0.96.0 Attachments: HBASE-5568-90.patch, HBASE-5568.patch We could call HRegion#flushcache() concurrently now through HRegionServer#splitRegion or HRegionServer#flushRegion by HBaseAdmin. However, we find if HRegion#internalFlushcache() is called concurrently by multi thread, HRegion.memstoreSize will be calculated wrong. At the end of HRegion#internalFlushcache(), we will do this.addAndGetGlobalMemstoreSize(-flushsize), but the flushsize may not the actual memsize which flushed to hdfs. It cause HRegion.memstoreSize is negative and prevent next flush if we close this region. Logs in RS for region e9d827913a056e696c39bc569ea3 2012-03-11 16:31:36,690 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for writetest1,,1331454657410.e9d827913a056e696c39bc569ea3 f99f., current region memstore size 128.0m 2012-03-11 16:31:37,999 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf1/8162481165586107427, entries=153106, sequenceid=619316544, memsize=59.6m, filesize=31.2m 2012-03-11 16:31:38,830 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for writetest1,,1331454657410.e9d827913a056e696c39bc569ea3 f99f., current region memstore size 134.8m 2012-03-11 16:31:39,458 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf2/3425971951499794221, entries=230183, sequenceid=619316544, memsize=68.5m, filesize=26.6m 2012-03-11 16:31:39,459 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~128.1m for region writetest1,,1331454657410.e9d827913a 056e696c39bc569ea3f99f. in 2769ms, sequenceid=619316544, compaction requested=false 2012-03-11 16:31:39,459 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for writetest1,,1331454657410.e9d827913a056e696c39bc569ea3 f99f., current region memstore size 6.8m 2012-03-11 16:31:39,529 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf1/1811012969998104626, entries=8002, sequenceid=619332759, memsize=3.1m, filesize=1.6m 2012-03-11 16:31:39,640 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf2/770333473623552048, entries=12231, sequenceid=619332759, memsize=3.6m, filesize=1.4m 2012-03-11 16:31:39,641 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~134.8m for region writetest1,,1331454657410.e9d827913a 056e696c39bc569ea3f99f. in 811ms, sequenceid=619332759, compaction requested=true 2012-03-11 16:31:39,707 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf1/5656568849587368557, entries=119, sequenceid=619332979, memsize=47.4k, filesize=25.6k 2012-03-11 16:31:39,775 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf2/794343845650987521, entries=157, sequenceid=619332979, memsize=47.8k, filesize=19.3k 2012-03-11 16:31:39,777 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~6.8m for region writetest1,,1331454657410.e9d827913a05 6e696c39bc569ea3f99f. in 318ms, sequenceid=619332979, compaction requested=true -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5571) Table will be disabling forever
[ https://issues.apache.org/jira/browse/HBASE-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5571: Attachment: HBASE-5571.patch Table will be disabling forever --- Key: HBASE-5571 URL: https://issues.apache.org/jira/browse/HBASE-5571 Project: HBase Issue Type: Bug Components: master, regionserver Reporter: chunhui shen Assignee: chunhui shen Attachments: HBASE-5571.patch If we restart master when it is disabling one table, the table will be disabling forever. In current logic, Region CLOSE RPC will always returned NotServingRegionException because RS has already closed the region before we restart master. So table will be disabling forever because the region will in RIT all along. In another case, when AssignmentManager#rebuildUserRegions(), it will put parent regions to AssignmentManager.regions, so we can't close these parent regions until it is purged by CatalogJanitor if we execute disabling the table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5270: Attachment: HBASE-5270-92v11.patch patchv11 for 0.92 Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Assignee: chunhui shen Fix For: 0.92.2 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, HBASE-5270-92v11.patch, HBASE-5270v11.patch, hbase-5270.patch, hbase-5270v10.patch, hbase-5270v2.patch, hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, hbase-5270v7.patch, hbase-5270v8.patch, hbase-5270v9.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5571) Table will be disabling forever
[ https://issues.apache.org/jira/browse/HBASE-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5571: Attachment: BASE-5571v2.patch Table will be disabling forever --- Key: HBASE-5571 URL: https://issues.apache.org/jira/browse/HBASE-5571 Project: HBase Issue Type: Bug Components: master, regionserver Reporter: chunhui shen Assignee: chunhui shen Attachments: BASE-5571v2.patch, HBASE-5571.patch If we restart master when it is disabling one table, the table will be disabling forever. In current logic, Region CLOSE RPC will always returned NotServingRegionException because RS has already closed the region before we restart master. So table will be disabling forever because the region will in RIT all along. In another case, when AssignmentManager#rebuildUserRegions(), it will put parent regions to AssignmentManager.regions, so we can't close these parent regions until it is purged by CatalogJanitor if we execute disabling the table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5563) HRegionInfo#compareTo add the comparison of regionId
[ https://issues.apache.org/jira/browse/HBASE-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5563: Attachment: HBASE-5563v2.patch HRegionInfo#compareTo add the comparison of regionId Key: HBASE-5563 URL: https://issues.apache.org/jira/browse/HBASE-5563 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Attachments: HBASE-5563.patch, HBASE-5563v2.patch In the one region multi assigned case, we could find that two regions have the same table name, same startKey, same endKey, and different regionId, so these two regions are same in TreeMap but different in HashMap. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5568) Multi concurrent flushcache() for one region could cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5568: Attachment: HBASE-5568.patch Multi concurrent flushcache() for one region could cause data loss -- Key: HBASE-5568 URL: https://issues.apache.org/jira/browse/HBASE-5568 Project: HBase Issue Type: Bug Components: regionserver Reporter: chunhui shen Assignee: chunhui shen Attachments: HBASE-5568.patch We could call HRegion#flushcache() concurrently now through HRegionServer#splitRegion or HRegionServer#flushRegion by HBaseAdmin. However, we find if HRegion#internalFlushcache() is called concurrently by multi thread, HRegion.memstoreSize will be calculated wrong. At the end of HRegion#internalFlushcache(), we will do this.addAndGetGlobalMemstoreSize(-flushsize), but the flushsize may not the actual memsize which flushed to hdfs. It cause HRegion.memstoreSize is negative and prevent next flush if we close this region. Logs in RS for region e9d827913a056e696c39bc569ea3 2012-03-11 16:31:36,690 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for writetest1,,1331454657410.e9d827913a056e696c39bc569ea3 f99f., current region memstore size 128.0m 2012-03-11 16:31:37,999 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf1/8162481165586107427, entries=153106, sequenceid=619316544, memsize=59.6m, filesize=31.2m 2012-03-11 16:31:38,830 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for writetest1,,1331454657410.e9d827913a056e696c39bc569ea3 f99f., current region memstore size 134.8m 2012-03-11 16:31:39,458 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf2/3425971951499794221, entries=230183, sequenceid=619316544, memsize=68.5m, filesize=26.6m 2012-03-11 16:31:39,459 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~128.1m for region writetest1,,1331454657410.e9d827913a 056e696c39bc569ea3f99f. in 2769ms, sequenceid=619316544, compaction requested=false 2012-03-11 16:31:39,459 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for writetest1,,1331454657410.e9d827913a056e696c39bc569ea3 f99f., current region memstore size 6.8m 2012-03-11 16:31:39,529 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf1/1811012969998104626, entries=8002, sequenceid=619332759, memsize=3.1m, filesize=1.6m 2012-03-11 16:31:39,640 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf2/770333473623552048, entries=12231, sequenceid=619332759, memsize=3.6m, filesize=1.4m 2012-03-11 16:31:39,641 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~134.8m for region writetest1,,1331454657410.e9d827913a 056e696c39bc569ea3f99f. in 811ms, sequenceid=619332759, compaction requested=true 2012-03-11 16:31:39,707 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf1/5656568849587368557, entries=119, sequenceid=619332979, memsize=47.4k, filesize=25.6k 2012-03-11 16:31:39,775 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf2/794343845650987521, entries=157, sequenceid=619332979, memsize=47.8k, filesize=19.3k 2012-03-11 16:31:39,777 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~6.8m for region writetest1,,1331454657410.e9d827913a05 6e696c39bc569ea3f99f. in 318ms, sequenceid=619332979, compaction requested=true -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5270: Attachment: HBASE-5270v11.patch minor items addressed in patchv11 https://reviews.apache.org/r/4021/ Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Assignee: chunhui shen Fix For: 0.92.2 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, HBASE-5270v11.patch, hbase-5270.patch, hbase-5270v10.patch, hbase-5270v2.patch, hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, hbase-5270v7.patch, hbase-5270v8.patch, hbase-5270v9.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5563) HRegionInfo#compareTo add the comparison of regionId
[ https://issues.apache.org/jira/browse/HBASE-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5563: Attachment: HBASE-5563.patch HRegionInfo#compareTo add the comparison of regionId Key: HBASE-5563 URL: https://issues.apache.org/jira/browse/HBASE-5563 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Attachments: HBASE-5563.patch In the one region multi assigned case, we could find that two regions have the same table name, same startKey, same endKey, and different regionId, so these two regions are same in TreeMap but different in HashMap. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5528) Retry splitting log if failed in the process of ServerShutdownHandler, and abort master when retries exhausted
[ https://issues.apache.org/jira/browse/HBASE-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5528: Attachment: hbase-5528v3.patch Retry splitting log if failed in the process of ServerShutdownHandler, and abort master when retries exhausted -- Key: HBASE-5528 URL: https://issues.apache.org/jira/browse/HBASE-5528 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-5528.patch, hbase-5528v2.patch, hbase-5528v3.patch We will retry splitting log if failed in splitLogAfterStartup when master starts. However, there is no retry for failed splitting log in the process of ServerShutdownHandler. Also, if we finally failed to split log, we should abort master even if filesystem is ok to prevent data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5528) Change retrying splitting log forever if throws IOException to numbered times, and abort master when retries exhausted
[ https://issues.apache.org/jira/browse/HBASE-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5528: Description: In current log-splitting retry logic, it will retry forever if throws IOException, I think we'd better change it to numbered times, and abort master when retries exhausted. was: We will retry splitting log if failed in splitLogAfterStartup when master starts. However, there is no retry for failed splitting log in the process of ServerShutdownHandler. Also, if we finally failed to split log, we should abort master even if filesystem is ok to prevent data loss. Summary: Change retrying splitting log forever if throws IOException to numbered times, and abort master when retries exhausted (was: Retry splitting log if failed in the process of ServerShutdownHandler, and abort master when retries exhausted) Change retrying splitting log forever if throws IOException to numbered times, and abort master when retries exhausted --- Key: HBASE-5528 URL: https://issues.apache.org/jira/browse/HBASE-5528 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-5528.patch, hbase-5528v2.patch, hbase-5528v3.patch In current log-splitting retry logic, it will retry forever if throws IOException, I think we'd better change it to numbered times, and abort master when retries exhausted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5528) Retry splitting log if failed in the process of ServerShutdownHandler, and abort master when retries exhausted
[ https://issues.apache.org/jira/browse/HBASE-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5528: Attachment: hbase-5528.patch Retry splitting log if failed in the process of ServerShutdownHandler, and abort master when retries exhausted -- Key: HBASE-5528 URL: https://issues.apache.org/jira/browse/HBASE-5528 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-5528.patch We will retry splitting log if failed in splitLogAfterStartup when master starts. However, there is no retry for failed splitting log in the process of ServerShutdownHandler. Also, if we finally failed to split log, we should abort master even if filesystem is ok to prevent data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5528) Retry splitting log if failed in the process of ServerShutdownHandler, and abort master when retries exhausted
[ https://issues.apache.org/jira/browse/HBASE-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5528: Attachment: hbase-5528v2.patch Patch v2,add Runtime.getRuntime().halt(1) after filesystem check failure Retry splitting log if failed in the process of ServerShutdownHandler, and abort master when retries exhausted -- Key: HBASE-5528 URL: https://issues.apache.org/jira/browse/HBASE-5528 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-5528.patch, hbase-5528v2.patch We will retry splitting log if failed in splitLogAfterStartup when master starts. However, there is no retry for failed splitting log in the process of ServerShutdownHandler. Also, if we finally failed to split log, we should abort master even if filesystem is ok to prevent data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5270: Attachment: hbase-5270v10.patch Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Assignee: chunhui shen Fix For: 0.92.2 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v10.patch, hbase-5270v2.patch, hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, hbase-5270v7.patch, hbase-5270v8.patch, hbase-5270v9.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5501) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5501: Attachment: hbase-5501.patch In the hbase-5501.patch, we prevent processing of SSH before finish assign ROOTandMETA,and use the same onlineServers for splitLog and joinCluster during initializing Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5501 URL: https://issues.apache.org/jira/browse/HBASE-5501 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-5501.patch In a live cluster, we do the following step 1.kill the master; 1.start the master, and master is initializing; 3.master complete splitLog 4.kill the META server 5.master start assigning ROOT and META 6.Now meta region data will loss since we may assign meta region before SSH finish split log for dead META server. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5270: Attachment: hbase-5270v9.patch Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Assignee: chunhui shen Fix For: 0.92.1, 0.94.0 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch, hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, hbase-5270v7.patch, hbase-5270v8.patch, hbase-5270v9.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5454) Refuse operations from Admin before master is initialized
[ https://issues.apache.org/jira/browse/HBASE-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5454: Summary: Refuse operations from Admin before master is initialized (was: Refuse operations from Admin befor master is initialized) Refuse operations from Admin before master is initialized - Key: HBASE-5454 URL: https://issues.apache.org/jira/browse/HBASE-5454 Project: HBase Issue Type: Improvement Reporter: chunhui shen Attachments: hbase-5454.patch, hbase-5454v2.patch In our testing environment, When master is initializing, we found conflict problems between master#assignAllUserRegions and EnableTable event, causing assigning region throw exception so that master abort itself. We think we'd better refuse operations from Admin, such as CreateTable, EnableTable,etc, It could reduce error. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5270: Attachment: hbase-5270v7.patch Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Assignee: chunhui shen Fix For: 0.92.1, 0.94.0 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch, hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, hbase-5270v7.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5270: Attachment: hbase-5270v8.patch Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Assignee: chunhui shen Fix For: 0.92.1, 0.94.0 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch, hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, hbase-5270v7.patch, hbase-5270v8.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5270: Attachment: hbase-5270v5.patch Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Assignee: chunhui shen Fix For: 0.92.1, 0.94.0 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch, hbase-5270v4.patch, hbase-5270v5.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5270: Attachment: hbase-5270v6.patch I‘m sorry for the mistake of ConcurrentHashSet. Thank Ted. Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Assignee: chunhui shen Fix For: 0.92.1, 0.94.0 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch, hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5454) Refuse operations from Admin befor master is initialized
[ https://issues.apache.org/jira/browse/HBASE-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5454: Attachment: hbase-5454.patch Refuse operations from Admin befor master is initialized Key: HBASE-5454 URL: https://issues.apache.org/jira/browse/HBASE-5454 Project: HBase Issue Type: Improvement Reporter: chunhui shen Attachments: hbase-5454.patch In our testing environment, When master is initializing, we found conflict problems between master#assignAllUserRegions and EnableTable event, causing assigning region throw exception so that master abort itself. We think we'd better refuse operations from Admin, such as CreateTable, EnableTable,etc, It could reduce error. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5422) StartupBulkAssigner would cause a lot of timeout on RIT when assigning large numbers of regions (timeout = 3 mins)
[ https://issues.apache.org/jira/browse/HBASE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5422: Attachment: hbase-5422v3.patch Thanks for Ted's review. StartupBulkAssigner would cause a lot of timeout on RIT when assigning large numbers of regions (timeout = 3 mins) -- Key: HBASE-5422 URL: https://issues.apache.org/jira/browse/HBASE-5422 Project: HBase Issue Type: Bug Components: master Reporter: chunhui shen Attachments: 5422-90.patch, 5422-90v2.patch, hbase-5422.patch, hbase-5422v2.patch, hbase-5422v3.patch In our produce environment We find a lot of timeout on RIT when cluster up, there are about 7w regions in the cluster( 25 regionservers ). First, we could see the following log:(See the region 33cf229845b1009aa8a3f7b0f85c9bd0) master's log 2012-02-13 18:07:41,409 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x348f4a94723da5 Async create of unassigned node for 33cf229845b1009aa8a3f7b0f85c9bd0 with OFFLINE state 2012-02-13 18:07:42,560 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback: rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=OFFLINE, ts=1329127661409, server=r03f11025.yh.aliyun.com,60020,1329127549907 2012-02-13 18:07:42,996 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback: rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=OFFLINE, ts=1329127661409 2012-02-13 18:10:48,072 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=PENDING_OPEN, ts=1329127662996 2012-02-13 18:10:48,072 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 2012-02-13 18:11:16,744 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=r03f11025.yh.aliyun.com,60020,1329127549907, region=33cf229845b1009aa8a3f7b0f85c9bd0 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for 33cf229845b1009aa8a3f7b0f85c9bd0; deleting unassigned node 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x348f4a94723da5 Deleting existing unassigned node for 33cf229845b1009aa8a3f7b0f85c9bd0 that is in expected state RS_ZK_REGION_OPENED 2012-02-13 18:38:07,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x348f4a94723da5 Successfully deleted unassigned node for region 33cf229845b1009aa8a3f7b0f85c9bd0 in expected state RS_ZK_REGION_OPENED 2012-02-13 18:38:07,573 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. on r03f11025.yh.aliyun.com,60020,1329127549907 2012-02-13 18:50:54,428 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. so generated a random one; hri=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0., src=, dest=r01b05043.yh.aliyun.com,60020,1329127549041; 29 (online=29, exclude=null) available servers 2012-02-13 18:50:54,428 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. to r01b05043.yh.aliyun.com,60020,1329127549041 2012-02-13 19:31:50,514 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=PENDING_OPEN, ts=1329132528086 2012-02-13 19:31:50,514 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. Regionserver's log 2012-02-13 18:07:43,537 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 2012-02-13 18:11:16,560 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. Through the RS's log, we could find it is larger than 3mins from receive openRegion request to start processing openRegion, causing timeout on RIT in master for the region.
[jira] [Updated] (HBASE-5422) StartupBulkAssigner would cause a lot of timeout on RIT when assigning large numbers of regions (timeout = 3 mins)
[ https://issues.apache.org/jira/browse/HBASE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5422: Attachment: 5422-90v3.patch StartupBulkAssigner would cause a lot of timeout on RIT when assigning large numbers of regions (timeout = 3 mins) -- Key: HBASE-5422 URL: https://issues.apache.org/jira/browse/HBASE-5422 Project: HBase Issue Type: Bug Components: master Reporter: chunhui shen Attachments: 5422-90.patch, 5422-90v2.patch, 5422-90v3.patch, hbase-5422.patch, hbase-5422v2.patch, hbase-5422v3.patch In our produce environment We find a lot of timeout on RIT when cluster up, there are about 7w regions in the cluster( 25 regionservers ). First, we could see the following log:(See the region 33cf229845b1009aa8a3f7b0f85c9bd0) master's log 2012-02-13 18:07:41,409 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x348f4a94723da5 Async create of unassigned node for 33cf229845b1009aa8a3f7b0f85c9bd0 with OFFLINE state 2012-02-13 18:07:42,560 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback: rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=OFFLINE, ts=1329127661409, server=r03f11025.yh.aliyun.com,60020,1329127549907 2012-02-13 18:07:42,996 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback: rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=OFFLINE, ts=1329127661409 2012-02-13 18:10:48,072 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=PENDING_OPEN, ts=1329127662996 2012-02-13 18:10:48,072 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 2012-02-13 18:11:16,744 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=r03f11025.yh.aliyun.com,60020,1329127549907, region=33cf229845b1009aa8a3f7b0f85c9bd0 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for 33cf229845b1009aa8a3f7b0f85c9bd0; deleting unassigned node 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x348f4a94723da5 Deleting existing unassigned node for 33cf229845b1009aa8a3f7b0f85c9bd0 that is in expected state RS_ZK_REGION_OPENED 2012-02-13 18:38:07,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x348f4a94723da5 Successfully deleted unassigned node for region 33cf229845b1009aa8a3f7b0f85c9bd0 in expected state RS_ZK_REGION_OPENED 2012-02-13 18:38:07,573 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. on r03f11025.yh.aliyun.com,60020,1329127549907 2012-02-13 18:50:54,428 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. so generated a random one; hri=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0., src=, dest=r01b05043.yh.aliyun.com,60020,1329127549041; 29 (online=29, exclude=null) available servers 2012-02-13 18:50:54,428 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. to r01b05043.yh.aliyun.com,60020,1329127549041 2012-02-13 19:31:50,514 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=PENDING_OPEN, ts=1329132528086 2012-02-13 19:31:50,514 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. Regionserver's log 2012-02-13 18:07:43,537 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 2012-02-13 18:11:16,560 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. Through the RS's log, we could find it is larger than 3mins from receive openRegion request to start processing openRegion, causing timeout on RIT in master for the region. Let's see the code of
[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5270: Attachment: hbase-5270v4.patch Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Fix For: 0.92.1, 0.94.0 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch, hbase-5270v4.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5270: Attachment: 5270-90v3.patch Takes Stack‘s comment in 5270-90v3.patch Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Fix For: 0.94.0, 0.92.1 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5422) StartupBulkAssigner would cause a lot of timeout on RIT when assigning large numbers of regions (timeout = 3 mins)
[ https://issues.apache.org/jira/browse/HBASE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5422: Attachment: 5422-90v2.patch hbase-5422v2.patch Make an addPlan method that takes a Map of plans in patchv2 StartupBulkAssigner would cause a lot of timeout on RIT when assigning large numbers of regions (timeout = 3 mins) -- Key: HBASE-5422 URL: https://issues.apache.org/jira/browse/HBASE-5422 Project: HBase Issue Type: Bug Components: master Reporter: chunhui shen Attachments: 5422-90.patch, 5422-90v2.patch, hbase-5422.patch, hbase-5422v2.patch In our produce environment We find a lot of timeout on RIT when cluster up, there are about 7w regions in the cluster( 25 regionservers ). First, we could see the following log:(See the region 33cf229845b1009aa8a3f7b0f85c9bd0) master's log 2012-02-13 18:07:41,409 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x348f4a94723da5 Async create of unassigned node for 33cf229845b1009aa8a3f7b0f85c9bd0 with OFFLINE state 2012-02-13 18:07:42,560 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback: rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=OFFLINE, ts=1329127661409, server=r03f11025.yh.aliyun.com,60020,1329127549907 2012-02-13 18:07:42,996 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback: rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=OFFLINE, ts=1329127661409 2012-02-13 18:10:48,072 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=PENDING_OPEN, ts=1329127662996 2012-02-13 18:10:48,072 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 2012-02-13 18:11:16,744 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=r03f11025.yh.aliyun.com,60020,1329127549907, region=33cf229845b1009aa8a3f7b0f85c9bd0 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for 33cf229845b1009aa8a3f7b0f85c9bd0; deleting unassigned node 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x348f4a94723da5 Deleting existing unassigned node for 33cf229845b1009aa8a3f7b0f85c9bd0 that is in expected state RS_ZK_REGION_OPENED 2012-02-13 18:38:07,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x348f4a94723da5 Successfully deleted unassigned node for region 33cf229845b1009aa8a3f7b0f85c9bd0 in expected state RS_ZK_REGION_OPENED 2012-02-13 18:38:07,573 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. on r03f11025.yh.aliyun.com,60020,1329127549907 2012-02-13 18:50:54,428 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. so generated a random one; hri=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0., src=, dest=r01b05043.yh.aliyun.com,60020,1329127549041; 29 (online=29, exclude=null) available servers 2012-02-13 18:50:54,428 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. to r01b05043.yh.aliyun.com,60020,1329127549041 2012-02-13 19:31:50,514 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=PENDING_OPEN, ts=1329132528086 2012-02-13 19:31:50,514 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. Regionserver's log 2012-02-13 18:07:43,537 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 2012-02-13 18:11:16,560 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. Through the RS's log, we could find it is larger than 3mins from receive openRegion request to start processing openRegion,
[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5270: Attachment: 5270-testcasev2.patch hbase-5270v2.patch Optimize the testcase as Stack's sample. And hbase-5270v2 is a patch to fix the issue for trunk including testcase. Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Fix For: 0.94.0, 0.92.1 Attachments: 5270-90-testcase.patch, 5270-90.patch, 5270-testcase.patch, 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5270: Attachment: 5270-90v2.patch 5270-90-testcasev2.patch Testcase and patch for 90 version Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Fix For: 0.94.0, 0.92.1 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-testcase.patch, 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5422) StartupBulkAssigner would cause a lot of timeout on RIT when assigning large numbers of regions (timeout = 3 mins)
[ https://issues.apache.org/jira/browse/HBASE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5422: Attachment: hbase-5422.patch StartupBulkAssigner would cause a lot of timeout on RIT when assigning large numbers of regions (timeout = 3 mins) -- Key: HBASE-5422 URL: https://issues.apache.org/jira/browse/HBASE-5422 Project: HBase Issue Type: Bug Components: master Reporter: chunhui shen Attachments: hbase-5422.patch In our produce environment We find a lot of timeout on RIT when cluster up, there are about 7w regions in the cluster( 25 regionservers ). First, we could see the following log:(See the region 33cf229845b1009aa8a3f7b0f85c9bd0) master's log 2012-02-13 18:07:41,409 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x348f4a94723da5 Async create of unassigned node for 33cf229845b1009aa8a3f7b0f85c9bd0 with OFFLINE state 2012-02-13 18:07:42,560 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback: rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=OFFLINE, ts=1329127661409, server=r03f11025.yh.aliyun.com,60020,1329127549907 2012-02-13 18:07:42,996 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback: rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=OFFLINE, ts=1329127661409 2012-02-13 18:10:48,072 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=PENDING_OPEN, ts=1329127662996 2012-02-13 18:10:48,072 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 2012-02-13 18:11:16,744 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=r03f11025.yh.aliyun.com,60020,1329127549907, region=33cf229845b1009aa8a3f7b0f85c9bd0 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for 33cf229845b1009aa8a3f7b0f85c9bd0; deleting unassigned node 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x348f4a94723da5 Deleting existing unassigned node for 33cf229845b1009aa8a3f7b0f85c9bd0 that is in expected state RS_ZK_REGION_OPENED 2012-02-13 18:38:07,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x348f4a94723da5 Successfully deleted unassigned node for region 33cf229845b1009aa8a3f7b0f85c9bd0 in expected state RS_ZK_REGION_OPENED 2012-02-13 18:38:07,573 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. on r03f11025.yh.aliyun.com,60020,1329127549907 2012-02-13 18:50:54,428 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. so generated a random one; hri=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0., src=, dest=r01b05043.yh.aliyun.com,60020,1329127549041; 29 (online=29, exclude=null) available servers 2012-02-13 18:50:54,428 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. to r01b05043.yh.aliyun.com,60020,1329127549041 2012-02-13 19:31:50,514 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=PENDING_OPEN, ts=1329132528086 2012-02-13 19:31:50,514 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. Regionserver's log 2012-02-13 18:07:43,537 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 2012-02-13 18:11:16,560 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. Through the RS's log, we could find it is larger than 3mins from receive openRegion request to start processing openRegion, causing timeout on RIT in master for the region. Let's see the code of StartupBulkAssigner, we could find regionPlans are not added when assigning regions,
[jira] [Updated] (HBASE-5422) StartupBulkAssigner would cause a lot of timeout on RIT when assigning large numbers of regions (timeout = 3 mins)
[ https://issues.apache.org/jira/browse/HBASE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5422: Attachment: 5422-90.patch StartupBulkAssigner would cause a lot of timeout on RIT when assigning large numbers of regions (timeout = 3 mins) -- Key: HBASE-5422 URL: https://issues.apache.org/jira/browse/HBASE-5422 Project: HBase Issue Type: Bug Components: master Reporter: chunhui shen Attachments: 5422-90.patch, hbase-5422.patch In our produce environment We find a lot of timeout on RIT when cluster up, there are about 7w regions in the cluster( 25 regionservers ). First, we could see the following log:(See the region 33cf229845b1009aa8a3f7b0f85c9bd0) master's log 2012-02-13 18:07:41,409 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x348f4a94723da5 Async create of unassigned node for 33cf229845b1009aa8a3f7b0f85c9bd0 with OFFLINE state 2012-02-13 18:07:42,560 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback: rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=OFFLINE, ts=1329127661409, server=r03f11025.yh.aliyun.com,60020,1329127549907 2012-02-13 18:07:42,996 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback: rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=OFFLINE, ts=1329127661409 2012-02-13 18:10:48,072 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=PENDING_OPEN, ts=1329127662996 2012-02-13 18:10:48,072 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 2012-02-13 18:11:16,744 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=r03f11025.yh.aliyun.com,60020,1329127549907, region=33cf229845b1009aa8a3f7b0f85c9bd0 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for 33cf229845b1009aa8a3f7b0f85c9bd0; deleting unassigned node 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x348f4a94723da5 Deleting existing unassigned node for 33cf229845b1009aa8a3f7b0f85c9bd0 that is in expected state RS_ZK_REGION_OPENED 2012-02-13 18:38:07,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x348f4a94723da5 Successfully deleted unassigned node for region 33cf229845b1009aa8a3f7b0f85c9bd0 in expected state RS_ZK_REGION_OPENED 2012-02-13 18:38:07,573 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. on r03f11025.yh.aliyun.com,60020,1329127549907 2012-02-13 18:50:54,428 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. so generated a random one; hri=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0., src=, dest=r01b05043.yh.aliyun.com,60020,1329127549041; 29 (online=29, exclude=null) available servers 2012-02-13 18:50:54,428 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. to r01b05043.yh.aliyun.com,60020,1329127549041 2012-02-13 19:31:50,514 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=PENDING_OPEN, ts=1329132528086 2012-02-13 19:31:50,514 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. Regionserver's log 2012-02-13 18:07:43,537 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 2012-02-13 18:11:16,560 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. Through the RS's log, we could find it is larger than 3mins from receive openRegion request to start processing openRegion, causing timeout on RIT in master for the region. Let's see the code of StartupBulkAssigner, we could find regionPlans are not added when assigning
[jira] [Updated] (HBASE-5423) Regionserver may block on waitOnAllRegionsToClose when aborting
[ https://issues.apache.org/jira/browse/HBASE-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5423: Attachment: hbase-5423.patch We needn't wait failedClose regions to close. Regionserver may block on waitOnAllRegionsToClose when aborting --- Key: HBASE-5423 URL: https://issues.apache.org/jira/browse/HBASE-5423 Project: HBase Issue Type: Bug Components: regionserver Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-5423.patch If closeRegion throws any exception (It would be caused by FS ) when RS is aborting, RS will block forever on waitOnAllRegionsToClose(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5423) Regionserver may block forever on waitOnAllRegionsToClose when aborting
[ https://issues.apache.org/jira/browse/HBASE-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5423: Summary: Regionserver may block forever on waitOnAllRegionsToClose when aborting (was: Regionserver may block on waitOnAllRegionsToClose when aborting) Regionserver may block forever on waitOnAllRegionsToClose when aborting --- Key: HBASE-5423 URL: https://issues.apache.org/jira/browse/HBASE-5423 Project: HBase Issue Type: Bug Components: regionserver Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-5423.patch If closeRegion throws any exception (It would be caused by FS ) when RS is aborting, RS will block forever on waitOnAllRegionsToClose(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5270: Attachment: 5270-90-testcase.patch 5270-90.patch 5270-testcase.patch hbase-5270.patch I have written a testcase for this issue to show the exist problem in 5270-testcase.patch And hbase-5270.patch is the combination of testcase and HBASE-5179 ‘s latest patch。 5270-90.patch and 5270-90-testcase.patch are for 90 version Now, testcase may not contain all the situations which causes problems and runs slowly. I will optimize the testcase later. Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Fix For: 0.94.0, 0.92.1 Attachments: 5270-90-testcase.patch, 5270-90.patch, 5270-testcase.patch, hbase-5270.patch This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Attachment: 5179-90v16.patch @Zhihong 90-v16 add some logic about when to waitUntilNoLogDir and pass Jinchao's test Please take a review Thanks. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v15.patch, 5179-90v16.patch, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, Errorlog, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Attachment: hbase-5179v17.patch hbase-5179v17.patch for trunk, changing the method names of DeadServer with the same as 90v17.txt Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v15.patch, 5179-90v16.patch, 5179-90v17.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, Errorlog, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, hbase-5179v17.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Attachment: 5179-90v12.patch Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 5179-90v12.patch, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Attachment: hbase-5179v12.patch Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 5179-90v12.patch, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Attachment: 5179-90v14.patch 90v13 has a mistake about waitUntilNoLogDir {code}if (this.fileSystemManager.logDirExists(serverName) != null) { break; }{code} should changed as {code} if (this.fileSystemManager.logDirExists(serverName) == null) { break; }{code} also prevent from NPE about it. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, Errorlog, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Attachment: 5179-90v15.patch @Jinchao I find 90V14 has also a problem. I'm sorry. Could you use v15. Thanks. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v15.patch, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, Errorlog, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Attachment: hbase-5179v10.patch Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Attachment: 5179-90v8.patch @Zhihong @Jinchao Based on 90v7.patch, 90v8 changes waiting forever to waiting for a time out, and add the guarantee with case : RS is dead but zk node exists when master start, so it is not considered a known server and its region will be assigned with out splitting log. Thanks for the comment. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Attachment: hbase-5179v8.patch hbase-5179v8.patch for trunk, which used to prevent meta-data-loss Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Attachment: 5179-90v9.patch Add Zhihong's and Jinchao's comment. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Attachment: hbase-5179v9.patch Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Attachment: hbase-5179v6.patch In v6patch, I add isDeadServerBeingProcessed logic for Trunk and place it in ServerManger. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Attachment: hbase-5179v7.patch v6 has a collision for the trunk Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Attachment: 5179-90v5.patch Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Attachment: 5179-90v6.patch @Zhihong @Jinchao In 90v6 patch, I add the logic of expiring server when master is initializing which is done in Trunk。 Also I add a guarantee that splitlog is completed when assigning META in case mentioned by Jinchao What do you think? Please do check Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Attachment: 5179-90v7.patch Change a little on 90v6, please see the 90v7 Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Attachment: hbase-5179v5.patch In patch v5, I add javadoc to explain getDeadServersBeingProcessed() and getDeadServers. And also add some more in DeadServer about deadServersBeingProcessed. About Stack's comment that a server is in either inProgress or its in the deadServers list? I think a server could both in processingDeadServers list and deadServers list. DeadServers list only store one instance for one regionserver, but processingDeadServers list may store multi instances for one regionserver with several startcode Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-90.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Attachment: 5179-90v2.patch Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-90.txt, 5179-90v2.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5165) Concurrent processing of DeleteTableHandler and ServerShutdownHandler may cause deleted region to assign again
[ https://issues.apache.org/jira/browse/HBASE-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5165: Attachment: hbase-5165v2.patch Modify it as Ted's comment in 5165v2.patch.(DeleteTableHandler doesn't set table enabled until no ServerShutdownHanler) Since fixupDaughters() will not be executed if table is disabled, I think patchv2 fix this issue including HBASE-5155 Waiting for better approach. Concurrent processing of DeleteTableHandler and ServerShutdownHandler may cause deleted region to assign again -- Key: HBASE-5165 URL: https://issues.apache.org/jira/browse/HBASE-5165 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: chunhui shen Attachments: hbase-5165.patch, hbase-5165v2.patch Concurrent processing of DeleteTableHandler and ServerShutdownHandler may cause following situation 1.Table has already be disabled. 2.ServerShutdownHandler is doing MetaReader.getServerUserRegions. 3.When step2 is processing or is completed just now, DeleteTableHandler starts to delete region(Remove region from META and Delete region from FS) 4.DeleteTableHandler set table enabled. 4.ServerShutdownHandler is starting to assign region which is alread deleted by DeleteTableHandler. The result of above operations is producing an invalid record in .META. and can't be fixed by hbck -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Affects Version/s: 0.90.2 Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completing splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Attachment: hbase-5179.patch In the patch, we ensure ProcessingDeadServers not be processed in master's processing of failover Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completing splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5179: Description: If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. was: If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completing splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5152) Region is on service before completing initialized when doing rollback of split, it will affect read correctness
[ https://issues.apache.org/jira/browse/HBASE-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5152: Summary: Region is on service before completing initialized when doing rollback of split, it will affect read correctness (was: Region is on service before completing initialized when doing rollback of split, it will affect readcorrectness ) Region is on service before completing initialized when doing rollback of split, it will affect read correctness - Key: HBASE-5152 URL: https://issues.apache.org/jira/browse/HBASE-5152 Project: HBase Issue Type: Bug Reporter: chunhui shen Attachments: hbase-5152.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5165) Concurrent processing of DeleteTableHandler and ServerShutdownHandler may cause deleted region to assign again
[ https://issues.apache.org/jira/browse/HBASE-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5165: Attachment: hbase-5165.patch Concurrent processing of DeleteTableHandler and ServerShutdownHandler may cause deleted region to assign again -- Key: HBASE-5165 URL: https://issues.apache.org/jira/browse/HBASE-5165 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: chunhui shen Attachments: hbase-5165.patch Concurrent processing of DeleteTableHandler and ServerShutdownHandler may cause following situation 1.Table has already be disabled. 2.ServerShutdownHandler is doing MetaReader.getServerUserRegions. 3.When step2 is processing or is completed just now, DeleteTableHandler starts to delete region(Remove region from META and Delete region from FS) 4.DeleteTableHandler set table enabled. 4.ServerShutdownHandler is starting to assign region which is alread deleted by DeleteTableHandler. The result of above operations is producing an invalid record in .META. and can't be fixed by hbck -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5121) MajorCompaction may affect scan's correctness
[ https://issues.apache.org/jira/browse/HBASE-5121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5121: Attachment: hbase-5121-testcase.patch hbase-5121v2.patch MajorCompaction may affect scan's correctness - Key: HBASE-5121 URL: https://issues.apache.org/jira/browse/HBASE-5121 Project: HBase Issue Type: Bug Components: regionserver Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-5121-testcase.patch, hbase-5121.patch, hbase-5121v2.patch In our test, there are two families' keyvalue for one row. But we could find a infrequent problem when doing scan's next if majorCompaction happens concurrently. In the client's two continuous doing scan.next(): 1.First time, scan's next returns the result where family A is null. 2.Second time, scan's next returns the result where family B is null. The two next()'s result have the same row. If there are more families, I think the scenario will be more strange... We find the reason is that storescanner.peek() is changed after majorCompaction if there are delete type KeyValue. This change causes the PriorityQueueKeyValueScanner of RegionScanner's heap is not sure to be sorted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5121) MajorCompaction may affect scan's correctness
[ https://issues.apache.org/jira/browse/HBASE-5121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5121: Attachment: hbase-5121.patch MajorCompaction may affect scan's correctness - Key: HBASE-5121 URL: https://issues.apache.org/jira/browse/HBASE-5121 Project: HBase Issue Type: Bug Components: regionserver Reporter: chunhui shen Attachments: hbase-5121.patch In our test, there are two families' keyvalue for one row. But we could find a infrequent problem when doing scan's next if majorCompaction happens concurrently. In the client's two continuous doing scan.next(): 1.First time, scan's next returns the result where family A is null. 2.Second time, scan's next returns the result where family B is null. The two next()'s result have the same row. If there are more families, I think the scenario will be more strange... We find the reason is that storescanner.peek() is changed after majorCompaction if there are delete type KeyValue. This change causes the PriorityQueueKeyValueScanner of RegionScanner's heap is not sure to be sorted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5100) Rollback of split would cause closed region to opened
[ https://issues.apache.org/jira/browse/HBASE-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5100: Attachment: hbase-5100.patch Rollback of split would cause closed region to opened -- Key: HBASE-5100 URL: https://issues.apache.org/jira/browse/HBASE-5100 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-5100.patch If master sending close region to rs and region's split transaction concurrently happen, it may cause closed region to opened. See the detailed code in SplitTransaction#createDaughters {code} ListStoreFile hstoreFilesToSplit = null; try{ hstoreFilesToSplit = this.parent.close(false); if (hstoreFilesToSplit == null) { // The region was closed by a concurrent thread. We can't continue // with the split, instead we must just abandon the split. If we // reopen or split this could cause problems because the region has // probably already been moved to a different server, or is in the // process of moving to a different server. throw new IOException(Failed to close region: already closed by + another thread); } } finally { this.journal.add(JournalEntry.CLOSED_PARENT_REGION); } {code} when rolling back, the JournalEntry.CLOSED_PARENT_REGION causes this.parent.initialize(); Although this region is not onlined in the regionserver, it may bring some potential problem. For example, in our environment, the closed parent region is rolled back sucessfully , and then starting compaction and split again. The parent region is f892dd6107b6b4130199582abc78e9c1 master log {code} 2011-12-26 00:24:42,693 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1., src=dw87.kgb.sqa.cm4,60020,1324827866085, dest=dw80.kgb.sqa.cm4,60020,1324827865780 2011-12-26 00:24:42,693 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. (offlining) 2011-12-26 00:24:42,694 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=dw87.kgb.sqa.cm4,60020,1324827866085, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. 2011-12-26 00:24:42,699 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase-tbfs/unassigned/f892dd6107b6b4130199582abc78e9c1 (region=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1., server=dw87.kgb.sqa.cm4,60020,1324827866085, state=RS_ZK_REGION_CLOSING) 2011-12-26 00:24:42,699 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSING, server=dw87.kgb.sqa.cm4,60020,1324827866085, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,348 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=dw87.kgb.sqa.cm4,60020,1324827866085, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. state=CLOSED, ts=1324830285347 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x13447f283f40e73 Creating (or updating) unassigned node for f892dd6107b6b4130199582abc78e9c1 with OFFLINE state 2011-12-26 00:24:45,354 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=dw75.kgb.sqa.cm4:6, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,354 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. destination server is + serverName=dw80.kgb.sqa.cm4,60020,1324827865780, load=(requests=0, regions=1,
[jira] [Updated] (HBASE-5020) MetaReader#fullScan doesn't stop scanning when vistor returns false in 0.90 version
[ https://issues.apache.org/jira/browse/HBASE-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-5020: Attachment: hbase-5020.patch MetaReader#fullScan doesn't stop scanning when vistor returns false in 0.90 version Key: HBASE-5020 URL: https://issues.apache.org/jira/browse/HBASE-5020 Project: HBase Issue Type: Bug Reporter: chunhui shen Attachments: hbase-5020.patch In current 0.90 code, {code} public static void fullScan(CatalogTracker catalogTracker, final Visitor visitor, final byte [] startrow) throws IOException { HRegionInterface metaServer = catalogTracker.waitForMetaServerConnectionDefault(); Scan scan = new Scan(); if (startrow != null) scan.setStartRow(startrow); scan.addFamily(HConstants.CATALOG_FAMILY); long scannerid = metaServer.openScanner( HRegionInfo.FIRST_META_REGIONINFO.getRegionName(), scan); try { Result data; while((data = metaServer.next(scannerid)) != null) { if (!data.isEmpty()) visitor.visit(data); } } finally { metaServer.close(scannerid); } return; } {code} If visitor.visit(data) return false, the scan will not stop; However, it is not the same as the description of Visitor {code} public interface Visitor { /** * Visit the catalog table row. * @param r A row from catalog table * @return True if we are to proceed scanning the table, else false if * we are to stop now. */ public boolean visit(final Result r) throws IOException; } {code} I think it is a miss, and trunk doesn't exist this hole. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4880) Region is on service before openRegionHandler completes, may cause data loss
[ https://issues.apache.org/jira/browse/HBASE-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-4880: Attachment: hbase-4880v4.patch Region is on service before openRegionHandler completes, may cause data loss Key: HBASE-4880 URL: https://issues.apache.org/jira/browse/HBASE-4880 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: chunhui shen Assignee: chunhui shen Attachments: 4880.txt, hbase-4880.patch, hbase-4880v2.patch, hbase-4880v3.patch, hbase-4880v4.patch OpenRegionHandler in regionserver is processed as the following steps: {code} 1.openregion()(Through it, closed = false, closing = false) 2.addToOnlineRegions(region) 3.update .meta. table 4.update ZK's node state to RS_ZK_REGION_OPEND {code} We can find that region is on service before Step 4. It means client could put data to this region after step 3. What will happen if step 4 is failed processing? It will execute OpenRegionHandler#cleanupFailedOpen which will do closing region, and master assign this region to another regionserver. If closing region is failed, the data which is put between step 3 and step 4 may loss, because the region has been opend on another regionserver and be put new data. Therefore, it may not be recoverd through replayRecoveredEdit() because the edit's LogSeqId is smaller than current region SeqId. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4988) MetaServer crash cause all splitting regionserver abort
[ https://issues.apache.org/jira/browse/HBASE-4988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-4988: Attachment: hbase-4988v1.patch MetaServer crash cause all splitting regionserver abort --- Key: HBASE-4988 URL: https://issues.apache.org/jira/browse/HBASE-4988 Project: HBase Issue Type: Bug Reporter: chunhui shen Attachments: hbase-4988v1.patch If metaserver crash now, All the splitting regionserver will abort theirself. Becasue the code {code} this.journal.add(JournalEntry.PONR); MetaEditor.offlineParentInMeta(server.getCatalogTracker(), this.parent.getRegionInfo(), a.getRegionInfo(), b.getRegionInfo()); {code} If the JournalEntry is PONR, split's roll back will abort itselef. It is terrible in huge putting environment when metaserver crash -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4880) Region is on service before completing openRegionHanlder, may cause data loss
[ https://issues.apache.org/jira/browse/HBASE-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-4880: Attachment: hbase-4880v2.patch Region is on service before completing openRegionHanlder, may cause data loss - Key: HBASE-4880 URL: https://issues.apache.org/jira/browse/HBASE-4880 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-4880.patch, hbase-4880v2.patch OpenRegionHandler in regionserver is processed as the following steps: {code} 1.openregion()(Through it, closed = false, closing = false) 2.addToOnlineRegions(region) 3.update .meta. table 4.update ZK's node state to RS_ZK_REGION_OPEND {code} We can find that region is on service before Step 4. It means client could put data to this region after step 3. What will happen if step 4 is failed processing? It will execute OpenRegionHandler#cleanupFailedOpen which will do closing region, and master assign this region to another regionserver. If closing region is failed, the data which is put between step 3 and step 4 may loss, because the region has been opend on another regionserver and be put new data. Therefore, it may not be recoverd through replayRecoveredEdit() because the edit's LogSeqId is smaller than current region SeqId. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4880) Region is on service before completing openRegionHanlder, may cause data loss
[ https://issues.apache.org/jira/browse/HBASE-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-4880: Attachment: hbase-4880v3.patch Region is on service before completing openRegionHanlder, may cause data loss - Key: HBASE-4880 URL: https://issues.apache.org/jira/browse/HBASE-4880 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-4880.patch, hbase-4880v2.patch, hbase-4880v3.patch OpenRegionHandler in regionserver is processed as the following steps: {code} 1.openregion()(Through it, closed = false, closing = false) 2.addToOnlineRegions(region) 3.update .meta. table 4.update ZK's node state to RS_ZK_REGION_OPEND {code} We can find that region is on service before Step 4. It means client could put data to this region after step 3. What will happen if step 4 is failed processing? It will execute OpenRegionHandler#cleanupFailedOpen which will do closing region, and master assign this region to another regionserver. If closing region is failed, the data which is put between step 3 and step 4 may loss, because the region has been opend on another regionserver and be put new data. Therefore, it may not be recoverd through replayRecoveredEdit() because the edit's LogSeqId is smaller than current region SeqId. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4899) Region would be assigned twice easily with continually killing server and moving region in testing environment
[ https://issues.apache.org/jira/browse/HBASE-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-4899: Attachment: hbase-4899v2.patch Region would be assigned twice easily with continually killing server and moving region in testing environment --- Key: HBASE-4899 URL: https://issues.apache.org/jira/browse/HBASE-4899 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-4899.patch, hbase-4899v2.patch Before assigning region in ServerShutdownHandler#process, it will check whether region is in RIT, however, this checking doesn't work as the excepted in the following case: 1.move region A from server B to server C 2.kill server B 3.start server B immediately Let's see what happen in the code for the above case {code} for step1: 1.1 server B close the region A, 1.2 master setOffline for region A,(AssignmentManager#setOffline:this.regions.remove(regionInfo)) 1.3 server C start to open region A.(Not completed) for step3: master ServerShutdownHandler#process() for server B { .. splitlog() ... ListRegionState regionsInTransition = this.services.getAssignmentManager() .processServerShutdown(this.serverName); ... Skip regions that were in transition unless CLOSING or PENDING_CLOSE ... assign region } {code} In fact, when running ServerShutdownHandler#process()#this.services.getAssignmentManager().processServerShutdown(this.serverName), region A is in RIT (step1.3 not completed), but the return ListRegionState regionsInTransition doesn't contain it, because region A has removed from AssignmentManager.regions by AssignmentManager#setOffline in step 1.2 Therefore, region A will be assigned twice. Actually, one server killed and started twice will also easily cause region assigned twice. Exclude the above reason, another probability : when execute ServerShutdownHandler#process()#MetaReader.getServerUserRegions ,region is included which is in RIT now. But after completing MetaReader.getServerUserRegions, the region has been opened in other server and is not in RIT now. In our testing environment where balancing,moving and killing are executed periodly, assigning region twice often happens, and it is hateful because it will affect other test cases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4899) Region would be assigned twice easily with continually killing server and moving region in testing environment
[ https://issues.apache.org/jira/browse/HBASE-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-4899: Attachment: hbase-4899v3.patch Region would be assigned twice easily with continually killing server and moving region in testing environment --- Key: HBASE-4899 URL: https://issues.apache.org/jira/browse/HBASE-4899 Project: HBase Issue Type: Bug Affects Versions: 0.92.1 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Attachments: hbase-4899.patch, hbase-4899v2.patch, hbase-4899v3.patch Before assigning region in ServerShutdownHandler#process, it will check whether region is in RIT, however, this checking doesn't work as the excepted in the following case: 1.move region A from server B to server C 2.kill server B 3.start server B immediately Let's see what happen in the code for the above case {code} for step1: 1.1 server B close the region A, 1.2 master setOffline for region A,(AssignmentManager#setOffline:this.regions.remove(regionInfo)) 1.3 server C start to open region A.(Not completed) for step3: master ServerShutdownHandler#process() for server B { .. splitlog() ... ListRegionState regionsInTransition = this.services.getAssignmentManager() .processServerShutdown(this.serverName); ... Skip regions that were in transition unless CLOSING or PENDING_CLOSE ... assign region } {code} In fact, when running ServerShutdownHandler#process()#this.services.getAssignmentManager().processServerShutdown(this.serverName), region A is in RIT (step1.3 not completed), but the return ListRegionState regionsInTransition doesn't contain it, because region A has removed from AssignmentManager.regions by AssignmentManager#setOffline in step 1.2 Therefore, region A will be assigned twice. Actually, one server killed and started twice will also easily cause region assigned twice. Exclude the above reason, another probability : when execute ServerShutdownHandler#process()#MetaReader.getServerUserRegions ,region is included which is in RIT now. But after completing MetaReader.getServerUserRegions, the region has been opened in other server and is not in RIT now. In our testing environment where balancing,moving and killing are executed periodly, assigning region twice often happens, and it is hateful because it will affect other test cases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4862) Splitting hlog and opening region concurrently may cause data loss
[ https://issues.apache.org/jira/browse/HBASE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-4862: Attachment: hbase-4862v7fortrunk.patch hbase-4862v7for0.90.patch Based on patchV6,update javadoc of HLog#getSplitEditFilesSorted Splitting hlog and opening region concurrently may cause data loss -- Key: HBASE-4862 URL: https://issues.apache.org/jira/browse/HBASE-4862 Project: HBase Issue Type: Bug Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: 4862-v6-90.txt, 4862-v6-trunk.patch, 4862.patch, 4862.txt, hbase-4862v1 for 0.90.diff, hbase-4862v1 for 0.90.diff, hbase-4862v1 for trunk.diff, hbase-4862v1 for trunk.diff, hbase-4862v2for0.90.diff, hbase-4862v2fortrunk.diff, hbase-4862v3for0.90.diff, hbase-4862v3fortrunk.diff, hbase-4862v5for0.90.diff, hbase-4862v5fortrunk.diff, hbase-4862v7for0.90.patch, hbase-4862v7fortrunk.patch Case Description: 1.Split hlog thread creat writer for the file region A/recoverd.edits/123456 and is appending log entry 2.Regionserver is opening region A now, and in the process replayRecoveredEditsIfAny() ,it will delete the file region A/recoverd.edits/123456 3.Split hlog thread catches the io exception, and stop parse this log file and if skipError = true , add it to the corrupt logsHowever, data in other regions in this log file will loss 4.Or if skipError = false, it will check filesystem.Of course, the file system is ok , and it only prints a error log, continue assigning regions. Therefore, data in other log files will also loss!! The case may happen in the following: 1.Move region from server A to server B 2.kill server A and Server B 3.restart server A and Server B We could prevent this exception throuth forbiding deleting recover.edits file which is appending by split hlog thread -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4880) Region is on service before completing openRegionHanlder, may cause data loss
[ https://issues.apache.org/jira/browse/HBASE-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-4880: Attachment: hbase-4880.patch Region isn't on service unitl completing openRegionHanlder successfully. Region is on service before completing openRegionHanlder, may cause data loss - Key: HBASE-4880 URL: https://issues.apache.org/jira/browse/HBASE-4880 Project: HBase Issue Type: Bug Reporter: chunhui shen Attachments: hbase-4880.patch OpenRegionHandler in regionserver is processed as the following steps: {code} 1.openregion()(Through it, closed = false, closing = false) 2.addToOnlineRegions(region) 3.update .meta. table 4.update ZK's node state to RS_ZK_REGION_OPEND {code} We can find that region is on service before Step 4. It means client could put data to this region after step 3. What will happen if step 4 is failed processing? It will execute OpenRegionHandler#cleanupFailedOpen which will do closing region, and master assign this region to another regionserver. If closing region is failed, the data which is put between step 3 and step 4 may loss, because the region has been opend on another regionserver and be put new data. Therefore, it may not be recoverd through replayRecoveredEdit() because the edit's LogSeqId is smaller than current region SeqId. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4878) Master crash when spliting hlog may cause data loss
[ https://issues.apache.org/jira/browse/HBASE-4878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-4878: Attachment: hbase-4878v2.patch @ramkrishna Have done in patchv2 Thanks. Master crash when spliting hlog may cause data loss --- Key: HBASE-4878 URL: https://issues.apache.org/jira/browse/HBASE-4878 Project: HBase Issue Type: Bug Reporter: chunhui shen Attachments: hbase-4878.diff, hbase-4878v2.patch Let's see the code of HlogSplitter#splitLog(final FileStatus[] logfiles) {code} private ListPath splitLog(final FileStatus[] logfiles) throws IOException { try { for (FileStatus log : logfiles) { parseHLog(in, logPath, entryBuffers, fs, conf, skipErrors); } archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf); } finally { status.setStatus(Finishing writing output logs and closing down.); splits = outputSink.finishWritingAndClose(); } } {code} If master is killed, after finishing archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf), but before finishing splits = outputSink.finishWritingAndClose(); Log date would loss! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4862) Splitting hlog and opening region concurrently may cause data loss
[ https://issues.apache.org/jira/browse/HBASE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-4862: Attachment: hbase-4862v5fortrunk.diff hbase-4862v5for0.90.diff Add a test case in patchv5 Splitting hlog and opening region concurrently may cause data loss -- Key: HBASE-4862 URL: https://issues.apache.org/jira/browse/HBASE-4862 Project: HBase Issue Type: Bug Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: 4862.patch, 4862.txt, hbase-4862v1 for 0.90.diff, hbase-4862v1 for 0.90.diff, hbase-4862v1 for trunk.diff, hbase-4862v1 for trunk.diff, hbase-4862v2for0.90.diff, hbase-4862v2fortrunk.diff, hbase-4862v3for0.90.diff, hbase-4862v3fortrunk.diff, hbase-4862v5for0.90.diff, hbase-4862v5fortrunk.diff Case Description: 1.Split hlog thread creat writer for the file region A/recoverd.edits/123456 and is appending log entry 2.Regionserver is opening region A now, and in the process replayRecoveredEditsIfAny() ,it will delete the file region A/recoverd.edits/123456 3.Split hlog thread catches the io exception, and stop parse this log file and if skipError = true , add it to the corrupt logsHowever, data in other regions in this log file will loss 4.Or if skipError = false, it will check filesystem.Of course, the file system is ok , and it only prints a error log, continue assigning regions. Therefore, data in other log files will also loss!! The case may happen in the following: 1.Move region from server A to server B 2.kill server A and Server B 3.restart server A and Server B We could prevent this exception throuth forbiding deleting recover.edits file which is appending by split hlog thread -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4878) Master crash when spliting hlog may cause data loss
[ https://issues.apache.org/jira/browse/HBASE-4878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-4878: Attachment: hbase-4878.diff Master crash when spliting hlog may cause data loss --- Key: HBASE-4878 URL: https://issues.apache.org/jira/browse/HBASE-4878 Project: HBase Issue Type: Bug Reporter: chunhui shen Attachments: hbase-4878.diff Let's see the code of HlogSplitter#splitLog(final FileStatus[] logfiles) {code} private ListPath splitLog(final FileStatus[] logfiles) throws IOException { try { for (FileStatus log : logfiles) { parseHLog(in, logPath, entryBuffers, fs, conf, skipErrors); } archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf); } finally { status.setStatus(Finishing writing output logs and closing down.); splits = outputSink.finishWritingAndClose(); } } {code} If master is killed, after finishing archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf), but before finishing splits = outputSink.finishWritingAndClose(); Log date would loss! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4862) Splitting hlog and opening region concurrently may cause data loss
[ https://issues.apache.org/jira/browse/HBASE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-4862: Attachment: hbase-4862v1 for trunk.diff hbase-4862v1 for 0.90.diff Grant license to ASF for the attached patch Splitting hlog and opening region concurrently may cause data loss -- Key: HBASE-4862 URL: https://issues.apache.org/jira/browse/HBASE-4862 Project: HBase Issue Type: Bug Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: 4862.patch, 4862.txt, hbase-4862v1 for 0.90.diff, hbase-4862v1 for 0.90.diff, hbase-4862v1 for trunk.diff, hbase-4862v1 for trunk.diff Case Description: 1.Split hlog thread creat writer for the file region A/recoverd.edits/123456 and is appending log entry 2.Regionserver is opening region A now, and in the process replayRecoveredEditsIfAny() ,it will delete the file region A/recoverd.edits/123456 3.Split hlog thread catches the io exception, and stop parse this log file and if skipError = true , add it to the corrupt logsHowever, data in other regions in this log file will loss 4.Or if skipError = false, it will check filesystem.Of course, the file system is ok , and it only prints a error log, continue assigning regions. Therefore, data in other log files will also loss!! The case may happen in the following: 1.Move region from server A to server B 2.kill server A and Server B 3.restart server A and Server B We could prevent this exception throuth forbiding deleting recover.edits file which is appending by split hlog thread -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4862) Splitting hlog and opening region concurrently may cause data loss
[ https://issues.apache.org/jira/browse/HBASE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-4862: Attachment: hbase-4862v2fortrunk.diff hbase-4862v2for0.90.diff @Ted I has amend the patch Please check. Splitting hlog and opening region concurrently may cause data loss -- Key: HBASE-4862 URL: https://issues.apache.org/jira/browse/HBASE-4862 Project: HBase Issue Type: Bug Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: 4862.patch, 4862.txt, hbase-4862v1 for 0.90.diff, hbase-4862v1 for 0.90.diff, hbase-4862v1 for trunk.diff, hbase-4862v1 for trunk.diff, hbase-4862v2for0.90.diff, hbase-4862v2fortrunk.diff Case Description: 1.Split hlog thread creat writer for the file region A/recoverd.edits/123456 and is appending log entry 2.Regionserver is opening region A now, and in the process replayRecoveredEditsIfAny() ,it will delete the file region A/recoverd.edits/123456 3.Split hlog thread catches the io exception, and stop parse this log file and if skipError = true , add it to the corrupt logsHowever, data in other regions in this log file will loss 4.Or if skipError = false, it will check filesystem.Of course, the file system is ok , and it only prints a error log, continue assigning regions. Therefore, data in other log files will also loss!! The case may happen in the following: 1.Move region from server A to server B 2.kill server A and Server B 3.restart server A and Server B We could prevent this exception throuth forbiding deleting recover.edits file which is appending by split hlog thread -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4862) Splitting hlog and opening region concurrently may cause data loss
[ https://issues.apache.org/jira/browse/HBASE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-4862: Attachment: hbase-4862v3fortrunk.diff hbase-4862v3for0.90.diff Splitting hlog and opening region concurrently may cause data loss -- Key: HBASE-4862 URL: https://issues.apache.org/jira/browse/HBASE-4862 Project: HBase Issue Type: Bug Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: 4862.patch, 4862.txt, hbase-4862v1 for 0.90.diff, hbase-4862v1 for 0.90.diff, hbase-4862v1 for trunk.diff, hbase-4862v1 for trunk.diff, hbase-4862v2for0.90.diff, hbase-4862v2fortrunk.diff, hbase-4862v3for0.90.diff, hbase-4862v3fortrunk.diff Case Description: 1.Split hlog thread creat writer for the file region A/recoverd.edits/123456 and is appending log entry 2.Regionserver is opening region A now, and in the process replayRecoveredEditsIfAny() ,it will delete the file region A/recoverd.edits/123456 3.Split hlog thread catches the io exception, and stop parse this log file and if skipError = true , add it to the corrupt logsHowever, data in other regions in this log file will loss 4.Or if skipError = false, it will check filesystem.Of course, the file system is ok , and it only prints a error log, continue assigning regions. Therefore, data in other log files will also loss!! The case may happen in the following: 1.Move region from server A to server B 2.kill server A and Server B 3.restart server A and Server B We could prevent this exception throuth forbiding deleting recover.edits file which is appending by split hlog thread -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4862) Splitting hlog and opening region concurrently may cause data loss
[ https://issues.apache.org/jira/browse/HBASE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-4862: Attachment: hbase-4862v1 for 0.90.diff Splitting hlog and opening region concurrently may cause data loss -- Key: HBASE-4862 URL: https://issues.apache.org/jira/browse/HBASE-4862 Project: HBase Issue Type: Bug Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: 4862.patch, hbase-4862v1 for 0.90.diff, hbase-4862v1 for trunk.diff Case Description: 1.Split hlog thread creat writer for the file region A/recoverd.edits/123456 and is appending log entry 2.Regionserver is opening region A now, and in the process replayRecoveredEditsIfAny() ,it will delete the file region A/recoverd.edits/123456 3.Split hlog thread catches the io exception, and stop parse this log file and if skipError = true , add it to the corrupt logsHowever, data in other regions in this log file will loss 4.Or if skipError = false, it will check filesystem.Of course, the file system is ok , and it only prints a error log, continue assigning regions. Therefore, data in other log files will also loss!! The case may happen in the following: 1.Move region from server A to server B 2.kill server A and Server B 3.restart server A and Server B We could prevent this exception throuth forbiding deleting recover.edits file which is appending by split hlog thread -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4862) Splitting hlog and opening region concurrently may cause data loss
[ https://issues.apache.org/jira/browse/HBASE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-4862: Attachment: hbase-4862v1 for trunk.diff Splitting hlog and opening region concurrently may cause data loss -- Key: HBASE-4862 URL: https://issues.apache.org/jira/browse/HBASE-4862 Project: HBase Issue Type: Bug Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: 4862.patch, hbase-4862v1 for 0.90.diff, hbase-4862v1 for trunk.diff Case Description: 1.Split hlog thread creat writer for the file region A/recoverd.edits/123456 and is appending log entry 2.Regionserver is opening region A now, and in the process replayRecoveredEditsIfAny() ,it will delete the file region A/recoverd.edits/123456 3.Split hlog thread catches the io exception, and stop parse this log file and if skipError = true , add it to the corrupt logsHowever, data in other regions in this log file will loss 4.Or if skipError = false, it will check filesystem.Of course, the file system is ok , and it only prints a error log, continue assigning regions. Therefore, data in other log files will also loss!! The case may happen in the following: 1.Move region from server A to server B 2.kill server A and Server B 3.restart server A and Server B We could prevent this exception throuth forbiding deleting recover.edits file which is appending by split hlog thread -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4862) Split hlog and open region currently happend may cause data loss
[ https://issues.apache.org/jira/browse/HBASE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-4862: Attachment: 4862.patch Split hlog :Add suffix .tmp for file in the recoverd.edits directory when creating, and rename it without the suffix after close; ReplayRecoveredEditsIfAny: skip the file whose name ends with .tmp Split hlog and open region currently happend may cause data loss Key: HBASE-4862 URL: https://issues.apache.org/jira/browse/HBASE-4862 Project: HBase Issue Type: Bug Affects Versions: 0.90.2 Reporter: chunhui shen Attachments: 4862.patch Case Description: 1.Split hlog thread creat writer for the file region A/recoverd.edits/123456 and is appending log entry 2.Regionserver is opening region A now, and in the process replayRecoveredEditsIfAny() ,it will delete the file region A/recoverd.edits/123456 3.Split hlog thread catches the io exception, and stop parse this log file and if skipError = true , add it to the corrupt logsHowever, data in other regions in this log file will loss 4.Or if skipError = false, it will check filesystem.Of course, the file system is ok , and it only prints a error log, continue assigning regions. Therefore, data in other log files will also loss!! The case may happen in the following: 1.Move region from server A to server B 2.kill server A and Server B 3.restart server A and Server B We could prevent this exception throuth forbiding deleting recover.edits file which is appending by split hlog thread -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4862) Split hlog and open region concurrently happend may cause data loss
[ https://issues.apache.org/jira/browse/HBASE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chunhui shen updated HBASE-4862: Summary: Split hlog and open region concurrently happend may cause data loss (was: Split hlog and open region currently happend may cause data loss) Split hlog and open region concurrently happend may cause data loss --- Key: HBASE-4862 URL: https://issues.apache.org/jira/browse/HBASE-4862 Project: HBase Issue Type: Bug Affects Versions: 0.90.2 Reporter: chunhui shen Attachments: 4862.patch Case Description: 1.Split hlog thread creat writer for the file region A/recoverd.edits/123456 and is appending log entry 2.Regionserver is opening region A now, and in the process replayRecoveredEditsIfAny() ,it will delete the file region A/recoverd.edits/123456 3.Split hlog thread catches the io exception, and stop parse this log file and if skipError = true , add it to the corrupt logsHowever, data in other regions in this log file will loss 4.Or if skipError = false, it will check filesystem.Of course, the file system is ok , and it only prints a error log, continue assigning regions. Therefore, data in other log files will also loss!! The case may happen in the following: 1.Move region from server A to server B 2.kill server A and Server B 3.restart server A and Server B We could prevent this exception throuth forbiding deleting recover.edits file which is appending by split hlog thread -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira