[jira] [Created] (HBASE-5797) Add the active master address on the web of regionserver
Add the active master address on the web of regionserver Key: HBASE-5797 URL: https://issues.apache.org/jira/browse/HBASE-5797 Project: HBase Issue Type: Improvement Reporter: chunhui shen Assignee: chunhui shen Add the active master address on the web of regionserver, so we can easily know the master server name and jump to master web from regionserver web. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5689) Skip RecoveredEdits may cause data loss
Skip RecoveredEdits may cause data loss --- Key: HBASE-5689 URL: https://issues.apache.org/jira/browse/HBASE-5689 Project: HBase Issue Type: Bug Components: regionserver Reporter: chunhui shen Assignee: chunhui shen Let's see the following scenario: 1.Region is on the server A 2.put KV(r1->v1) to the region 3.move region from server A to server B 4.put KV(r2->v2) to the region 5.move region from server B to server A 6.put KV(r3->v3) to the region 7.kill -9 server B and start it 8.kill -9 server A and start it 9.scan the region, we could only get two KV(r1->v1,r2->v2), the third KV(r3->v3) is lost. Let's analyse the upper scenario from the code: 1.the edit logs of KV(r1->v1) and KV(r3->v3) are both recorded in the same hlog file on server A. 2.when we split server B's hlog file in the process of ServerShutdownHandler, we create one RecoveredEdits file f1 for the region. 2.when we split server A's hlog file in the process of ServerShutdownHandler, we create another RecoveredEdits file f2 for the region. 3.however, RecoveredEdits file f2 will be skiped when initializing region HRegion#replayRecoveredEditsIfAny {code} for (Path edits: files) { if (edits == null || !this.fs.exists(edits)) { LOG.warn("Null or non-existent edits file: " + edits); continue; } if (isZeroLengthThenDelete(this.fs, edits)) continue; if (checkSafeToSkip) { Path higher = files.higher(edits); long maxSeqId = Long.MAX_VALUE; if (higher != null) { // Edit file name pattern, HLog.EDITFILES_NAME_PATTERN: "-?[0-9]+" String fileName = higher.getName(); maxSeqId = Math.abs(Long.parseLong(fileName)); } if (maxSeqId <= minSeqId) { String msg = "Maximum possible sequenceid for this log is " + maxSeqId + ", skipped the whole file, path=" + edits; LOG.debug(msg); continue; } else { checkSafeToSkip = false; } } {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5672) TestLruBlockCache#testBackgroundEvictionThread fails occasionally
TestLruBlockCache#testBackgroundEvictionThread fails occasionally - Key: HBASE-5672 URL: https://issues.apache.org/jira/browse/HBASE-5672 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen We find TestLruBlockCache#testBackgroundEvictionThread fails occasionally. I think it's a problem of the test case. Because runEviction() only do evictionThread.evict(): {code} public void evict() { synchronized(this) { this.notify(); // FindBugs NN_NAKED_NOTIFY } } {code} However when we call evictionThread.evict(), the evictionThread may haven't been in run() in the TestLruBlockCache#testBackgroundEvictionThread. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5624) Aborting regionserver when splitting region, may cause daughter region not assigned by ServerShutdownHandler.
Aborting regionserver when splitting region, may cause daughter region not assigned by ServerShutdownHandler. - Key: HBASE-5624 URL: https://issues.apache.org/jira/browse/HBASE-5624 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen If one region is splitting when regionserver is stoping. The following code may executed in SplitTransaction#openDaughters. {code} // TODO: Is this check needed here? if (stopped || stopping) { // add 2nd daughter first (see HBASE-4335) MetaEditor.addDaughter(server.getCatalogTracker(), b.getRegionInfo(), null); MetaEditor.addDaughter(server.getCatalogTracker(), a.getRegionInfo(), null); LOG.info("Not opening daughters " + b.getRegionInfo().getRegionNameAsString() + " and " + a.getRegionInfo().getRegionNameAsString() + " because stopping=" + stopping + ", stopped=" + stopped); } {code} So, for the two daughter regions, their location are both null in .META. When ServerShutdownHandler process the dead server, it will not assign these two daughter regions since their location(info:server) are null in .META. by MetaReader.getServerUserRegions(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5571) Table will be disabling forever
Table will be disabling forever --- Key: HBASE-5571 URL: https://issues.apache.org/jira/browse/HBASE-5571 Project: HBase Issue Type: Bug Components: master, regionserver Reporter: chunhui shen Assignee: chunhui shen If we restart master when it is disabling one table, the table will be disabling forever. In current logic, Region CLOSE RPC will always returned NotServingRegionException because RS has already closed the region before we restart master. So table will be disabling forever because the region will in RIT all along. In another case, when AssignmentManager#rebuildUserRegions(), it will put parent regions to AssignmentManager.regions, so we can't close these parent regions until it is purged by CatalogJanitor if we execute disabling the table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5568) Multi concurrent flushcache() for one region could cause data loss
Multi concurrent flushcache() for one region could cause data loss -- Key: HBASE-5568 URL: https://issues.apache.org/jira/browse/HBASE-5568 Project: HBase Issue Type: Bug Components: regionserver Reporter: chunhui shen Assignee: chunhui shen We could call HRegion#flushcache() concurrently now through HRegionServer#splitRegion or HRegionServer#flushRegion by HBaseAdmin. However, we find if HRegion#internalFlushcache() is called concurrently by multi thread, HRegion.memstoreSize will be calculated wrong. At the end of HRegion#internalFlushcache(), we will do this.addAndGetGlobalMemstoreSize(-flushsize), but the flushsize may not the actual memsize which flushed to hdfs. It cause HRegion.memstoreSize is negative and prevent next flush if we close this region. Logs in RS for region e9d827913a056e696c39bc569ea3 2012-03-11 16:31:36,690 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for writetest1,,1331454657410.e9d827913a056e696c39bc569ea3 f99f., current region memstore size 128.0m 2012-03-11 16:31:37,999 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf1/8162481165586107427, entries=153106, sequenceid=619316544, memsize=59.6m, filesize=31.2m 2012-03-11 16:31:38,830 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for writetest1,,1331454657410.e9d827913a056e696c39bc569ea3 f99f., current region memstore size 134.8m 2012-03-11 16:31:39,458 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf2/3425971951499794221, entries=230183, sequenceid=619316544, memsize=68.5m, filesize=26.6m 2012-03-11 16:31:39,459 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~128.1m for region writetest1,,1331454657410.e9d827913a 056e696c39bc569ea3f99f. in 2769ms, sequenceid=619316544, compaction requested=false 2012-03-11 16:31:39,459 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for writetest1,,1331454657410.e9d827913a056e696c39bc569ea3 f99f., current region memstore size 6.8m 2012-03-11 16:31:39,529 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf1/1811012969998104626, entries=8002, sequenceid=619332759, memsize=3.1m, filesize=1.6m 2012-03-11 16:31:39,640 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf2/770333473623552048, entries=12231, sequenceid=619332759, memsize=3.6m, filesize=1.4m 2012-03-11 16:31:39,641 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~134.8m for region writetest1,,1331454657410.e9d827913a 056e696c39bc569ea3f99f. in 811ms, sequenceid=619332759, compaction requested=true 2012-03-11 16:31:39,707 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf1/5656568849587368557, entries=119, sequenceid=619332979, memsize=47.4k, filesize=25.6k 2012-03-11 16:31:39,775 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e a3f99f/cf2/794343845650987521, entries=157, sequenceid=619332979, memsize=47.8k, filesize=19.3k 2012-03-11 16:31:39,777 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~6.8m for region writetest1,,1331454657410.e9d827913a05 6e696c39bc569ea3f99f. in 318ms, sequenceid=619332979, compaction requested=true -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5563) HRegionInfo#compareTo add the comparison of regionId
HRegionInfo#compareTo add the comparison of regionId Key: HBASE-5563 URL: https://issues.apache.org/jira/browse/HBASE-5563 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen In the one region multi assigned case, we could find that two regions have the same table name, same startKey, same endKey, and different regionId, so these two regions are same in TreeMap but different in HashMap. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5528) Retry splitting log if failed in the process of ServerShutdownHandler, and abort master when retries exhausted
Retry splitting log if failed in the process of ServerShutdownHandler, and abort master when retries exhausted -- Key: HBASE-5528 URL: https://issues.apache.org/jira/browse/HBASE-5528 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-5528.patch We will retry splitting log if failed in splitLogAfterStartup when master starts. However, there is no retry for failed splitting log in the process of ServerShutdownHandler. Also, if we finally failed to split log, we should abort master even if filesystem is ok to prevent data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5501) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5501 URL: https://issues.apache.org/jira/browse/HBASE-5501 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen In a live cluster, we do the following step 1.kill the master; 1.start the master, and master is initializing; 3.master complete splitLog 4.kill the META server 5.master start assigning ROOT and META 6.Now meta region data will loss since we may assign meta region before SSH finish split log for dead META server. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5454) Refuse operations from Admin befor master is initialized
Refuse operations from Admin befor master is initialized Key: HBASE-5454 URL: https://issues.apache.org/jira/browse/HBASE-5454 Project: HBase Issue Type: Improvement Reporter: chunhui shen In our testing environment, When master is initializing, we found conflict problems between master#assignAllUserRegions and EnableTable event, causing assigning region throw exception so that master abort itself. We think we'd better refuse operations from Admin, such as CreateTable, EnableTable,etc, It could reduce error. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5423) Regionserver may block on waitOnAllRegionsToClose when aborting
Regionserver may block on waitOnAllRegionsToClose when aborting --- Key: HBASE-5423 URL: https://issues.apache.org/jira/browse/HBASE-5423 Project: HBase Issue Type: Bug Components: regionserver Reporter: chunhui shen Assignee: chunhui shen If closeRegion throws any exception (It would be caused by FS ) when RS is aborting, RS will block forever on waitOnAllRegionsToClose(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5422) StartupBulkAssigner would cause a lot of timeout on RIT when assigning large numbers of regions (timeout = 3 mins)
StartupBulkAssigner would cause a lot of timeout on RIT when assigning large numbers of regions (timeout = 3 mins) -- Key: HBASE-5422 URL: https://issues.apache.org/jira/browse/HBASE-5422 Project: HBase Issue Type: Bug Components: master Reporter: chunhui shen In our produce environment We find a lot of timeout on RIT when cluster up, there are about 7w regions in the cluster( 25 regionservers ). First, we could see the following log:(See the region 33cf229845b1009aa8a3f7b0f85c9bd0) master's log 2012-02-13 18:07:41,409 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x348f4a94723da5 Async create of unassigned node for 33cf229845b1009aa8a3f7b0f85c9bd0 with OFFLINE state 2012-02-13 18:07:42,560 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback: rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=OFFLINE, ts=1329127661409, server=r03f11025.yh.aliyun.com,60020,1329127549907 2012-02-13 18:07:42,996 DEBUG org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback: rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=OFFLINE, ts=1329127661409 2012-02-13 18:10:48,072 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=PENDING_OPEN, ts=1329127662996 2012-02-13 18:10:48,072 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 2012-02-13 18:11:16,744 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_OPENED, server=r03f11025.yh.aliyun.com,60020,1329127549907, region=33cf229845b1009aa8a3f7b0f85c9bd0 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED event for 33cf229845b1009aa8a3f7b0f85c9bd0; deleting unassigned node 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x348f4a94723da5 Deleting existing unassigned node for 33cf229845b1009aa8a3f7b0f85c9bd0 that is in expected state RS_ZK_REGION_OPENED 2012-02-13 18:38:07,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x348f4a94723da5 Successfully deleted unassigned node for region 33cf229845b1009aa8a3f7b0f85c9bd0 in expected state RS_ZK_REGION_OPENED 2012-02-13 18:38:07,573 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. on r03f11025.yh.aliyun.com,60020,1329127549907 2012-02-13 18:50:54,428 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan was found (or we are ignoring an existing plan) for item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. so generated a random one; hri=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0., src=, dest=r01b05043.yh.aliyun.com,60020,1329127549041; 29 (online=29, exclude=null) available servers 2012-02-13 18:50:54,428 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. to r01b05043.yh.aliyun.com,60020,1329127549041 2012-02-13 19:31:50,514 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. state=PENDING_OPEN, ts=1329132528086 2012-02-13 19:31:50,514 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_OPEN for too long, reassigning region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. Regionserver's log 2012-02-13 18:07:43,537 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 2012-02-13 18:11:16,560 DEBUG org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing open of item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. Through the RS's log, we could find it is larger than 3mins from receive openRegion request to start processing openRegion, causing timeout on RIT in master for the region. Let's see the code of StartupBulkAssigner, we could find regionPlans are not added when assigning regions, therefore, when one region opened, it will not updateTimers of other regions whose destination is the same. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: h
[jira] [Created] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Reporter: chunhui shen Assignee: chunhui shen If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completing splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5165) Concurrent processing of DeleteTableHandler and ServerShutdownHandler may cause deleted region to assign again
Concurrent processing of DeleteTableHandler and ServerShutdownHandler may cause deleted region to assign again -- Key: HBASE-5165 URL: https://issues.apache.org/jira/browse/HBASE-5165 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: chunhui shen Concurrent processing of DeleteTableHandler and ServerShutdownHandler may cause following situation 1.Table has already be disabled. 2.ServerShutdownHandler is doing MetaReader.getServerUserRegions. 3.When step2 is processing or is completed just now, DeleteTableHandler starts to delete region(Remove region from META and Delete region from FS) 4.DeleteTableHandler set table enabled. 4.ServerShutdownHandler is starting to assign region which is alread deleted by DeleteTableHandler. The result of above operations is producing an invalid record in .META. and can't be fixed by hbck -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5152) Region is on service before completing initialized when doing rollback of split, it will affect readcorrectness
Region is on service before completing initialized when doing rollback of split, it will affect readcorrectness Key: HBASE-5152 URL: https://issues.apache.org/jira/browse/HBASE-5152 Project: HBase Issue Type: Bug Reporter: chunhui shen -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5121) MajorCompaction may affect scan's correctness
MajorCompaction may affect scan's correctness - Key: HBASE-5121 URL: https://issues.apache.org/jira/browse/HBASE-5121 Project: HBase Issue Type: Bug Components: regionserver Reporter: chunhui shen In our test, there are two families' keyvalue for one row. But we could find a infrequent problem when doing scan's next if majorCompaction happens concurrently. In the client's two continuous doing scan.next(): 1.First time, scan's next returns the result where family A is null. 2.Second time, scan's next returns the result where family B is null. The two next()'s result have the same row. If there are more families, I think the scenario will be more strange... We find the reason is that storescanner.peek() is changed after majorCompaction if there are delete type KeyValue. This change causes the PriorityQueue of RegionScanner's heap is not sure to be sorted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5100) Rollback of split would cause closed region to opened
Rollback of split would cause closed region to opened -- Key: HBASE-5100 URL: https://issues.apache.org/jira/browse/HBASE-5100 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen If master sending close region to rs and region's split transaction concurrently happen, it may cause closed region to opened. See the detailed code in SplitTransaction#createDaughters {code} List hstoreFilesToSplit = null; try{ hstoreFilesToSplit = this.parent.close(false); if (hstoreFilesToSplit == null) { // The region was closed by a concurrent thread. We can't continue // with the split, instead we must just abandon the split. If we // reopen or split this could cause problems because the region has // probably already been moved to a different server, or is in the // process of moving to a different server. throw new IOException("Failed to close region: already closed by " + "another thread"); } } finally { this.journal.add(JournalEntry.CLOSED_PARENT_REGION); } {code} when rolling back, the JournalEntry.CLOSED_PARENT_REGION causes this.parent.initialize(); Although this region is not onlined in the regionserver, it may bring some potential problem. For example, in our environment, the closed parent region is rolled back sucessfully , and then starting compaction and split again. The parent region is f892dd6107b6b4130199582abc78e9c1 master log {code} 2011-12-26 00:24:42,693 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1., src=dw87.kgb.sqa.cm4,60020,1324827866085, dest=dw80.kgb.sqa.cm4,60020,1324827865780 2011-12-26 00:24:42,693 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. (offlining) 2011-12-26 00:24:42,694 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=dw87.kgb.sqa.cm4,60020,1324827866085, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. 2011-12-26 00:24:42,699 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase-tbfs/unassigned/f892dd6107b6b4130199582abc78e9c1 (region=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1., server=dw87.kgb.sqa.cm4,60020,1324827866085, state=RS_ZK_REGION_CLOSING) 2011-12-26 00:24:42,699 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSING, server=dw87.kgb.sqa.cm4,60020,1324827866085, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,348 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=dw87.kgb.sqa.cm4,60020,1324827866085, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. state=CLOSED, ts=1324830285347 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x13447f283f40e73 Creating (or updating) unassigned node for f892dd6107b6b4130199582abc78e9c1 with OFFLINE state 2011-12-26 00:24:45,354 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=dw75.kgb.sqa.cm4:6, region=f892dd6107b6b4130199582abc78e9c1 2011-12-26 00:24:45,354 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1. destination server is + serverName=dw80.kgb.sqa.cm4,60020,1324827865780, load=(requests=0, regions=1, usedHeap=0, maxHeap=0) 2011-12-26 00:24:45,354 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1.; plan=hri=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883
[jira] [Created] (HBASE-5020) MetaReader#fullScan doesn't stop scanning when vistor returns false in 0.90 version
MetaReader#fullScan doesn't stop scanning when vistor returns false in 0.90 version Key: HBASE-5020 URL: https://issues.apache.org/jira/browse/HBASE-5020 Project: HBase Issue Type: Bug Reporter: chunhui shen Attachments: hbase-5020.patch In current 0.90 code, {code} public static void fullScan(CatalogTracker catalogTracker, final Visitor visitor, final byte [] startrow) throws IOException { HRegionInterface metaServer = catalogTracker.waitForMetaServerConnectionDefault(); Scan scan = new Scan(); if (startrow != null) scan.setStartRow(startrow); scan.addFamily(HConstants.CATALOG_FAMILY); long scannerid = metaServer.openScanner( HRegionInfo.FIRST_META_REGIONINFO.getRegionName(), scan); try { Result data; while((data = metaServer.next(scannerid)) != null) { if (!data.isEmpty()) visitor.visit(data); } } finally { metaServer.close(scannerid); } return; } {code} If visitor.visit(data) return false, the scan will not stop; However, it is not the same as the description of Visitor {code} public interface Visitor { /** * Visit the catalog table row. * @param r A row from catalog table * @return True if we are to proceed scanning the table, else false if * we are to stop now. */ public boolean visit(final Result r) throws IOException; } {code} I think it is a miss, and trunk doesn't exist this hole. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-4988) MetaServer crash cause all splitting regionserver abort
MetaServer crash cause all splitting regionserver abort --- Key: HBASE-4988 URL: https://issues.apache.org/jira/browse/HBASE-4988 Project: HBase Issue Type: Bug Reporter: chunhui shen If metaserver crash now, All the splitting regionserver will abort theirself. Becasue the code {code} this.journal.add(JournalEntry.PONR); MetaEditor.offlineParentInMeta(server.getCatalogTracker(), this.parent.getRegionInfo(), a.getRegionInfo(), b.getRegionInfo()); {code} If the JournalEntry is PONR, split's roll back will abort itselef. It is terrible in huge putting environment when metaserver crash -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-4899) Region would be assigned twice easily with continually killing server and moving region in testing environment
Region would be assigned twice easily with continually killing server and moving region in testing environment --- Key: HBASE-4899 URL: https://issues.apache.org/jira/browse/HBASE-4899 Project: HBase Issue Type: Bug Reporter: chunhui shen Before assigning region in ServerShutdownHandler#process, it will check whether region is in RIT, however, this checking doesn't work as the excepted in the following case: 1.move region A from server B to server C 2.kill server B 3.start server B immediately Let's see what happen in the code for the above case {code} for step1: 1.1 server B close the region A, 1.2 master setOffline for region A,(AssignmentManager#setOffline:this.regions.remove(regionInfo)) 1.3 server C start to open region A.(Not completed) for step3: master ServerShutdownHandler#process() for server B { .. splitlog() ... List regionsInTransition = this.services.getAssignmentManager() .processServerShutdown(this.serverName); ... Skip regions that were in transition unless CLOSING or PENDING_CLOSE ... assign region } In fact, when running ServerShutdownHandler#process()#this.services.getAssignmentManager().processServerShutdown(this.serverName), region A is in RIT (step1.3 not completed), but the return List regionsInTransition doesn't contain it, because region A has removed from AssignmentManager.regions by AssignmentManager#setOffline in step 1.2 Therefore, region A will be assigned twice. {code} Actually, one server killed and started twice will also easily cause region assigned twice. Exclude the above reason, another probability : when execute ServerShutdownHandler#process()#MetaReader.getServerUserRegions ,region is included which is in RIT now. But after completing MetaReader.getServerUserRegions, the region has been opened in other server and is not in RIT now. In our testing environment where balancing,moving and killing are executed periodly, assigning region twice often happens, and it is hateful because it will affect other test cases. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-4881) Unhealthy region is on service caused by rollback of region splitting
Unhealthy region is on service caused by rollback of region splitting - Key: HBASE-4881 URL: https://issues.apache.org/jira/browse/HBASE-4881 Project: HBase Issue Type: Bug Reporter: chunhui shen If region splitting is failed in the state of JournalEntry.CLOSED_PARENT_REGION It will be rollback as the following steps: {code} 1.case CLOSED_PARENT_REGION: this.parent.initialize(); break; 2.case CREATE_SPLIT_DIR: this.parent.writestate.writesEnabled = true; cleanupSplitDir(fs, this.splitdir); break; 3.case SET_SPLITTING_IN_ZK: if (server != null && server.getZooKeeper() != null) { cleanZK(server, this.parent.getRegionInfo()); } break; {code} If this.parent.initialize() throws IOException in step 1, If check filesystem is ok. it will do nothing. However, the parent region is on service now. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-4880) Region is on service before completing openRegionHanlder, may cause data loss
Region is on service before completing openRegionHanlder, may cause data loss - Key: HBASE-4880 URL: https://issues.apache.org/jira/browse/HBASE-4880 Project: HBase Issue Type: Bug Reporter: chunhui shen OpenRegionHandler in regionserver is processed as the following steps: {code} 1.openregion()(Through it, closed = false, closing = false) 2.addToOnlineRegions(region) 3.update .meta. table 4.update ZK's node state to RS_ZK_REGION_OPEND {code} We can find that region is on service before Step 4. It means client could put data to this region after step 3. What will happen if step 4 is failed processing? It will execute OpenRegionHandler#cleanupFailedOpen which will do closing region, and master assign this region to another regionserver. If closing region is failed, the data which is put between step 3 and step 4 may loss, because the region has been opend on another regionserver and be put new data. Therefore, it may not be recoverd through replayRecoveredEdit() because the edit's LogSeqId is smaller than current region SeqId. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-4878) Master crash when spliting hlog may cause data loss
Master crash when spliting hlog may cause data loss --- Key: HBASE-4878 URL: https://issues.apache.org/jira/browse/HBASE-4878 Project: HBase Issue Type: Bug Reporter: chunhui shen Let's see the code of HlogSplitter#splitLog(final FileStatus[] logfiles) {code} private List splitLog(final FileStatus[] logfiles) throws IOException { try { for (FileStatus log : logfiles) { parseHLog(in, logPath, entryBuffers, fs, conf, skipErrors); } archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf); } finally { status.setStatus("Finishing writing output logs and closing down."); splits = outputSink.finishWritingAndClose(); } } {code} If master is killed, after finishing archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf), but before finishing splits = outputSink.finishWritingAndClose(); Log date would loss! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-4862) Split hlog and open region currently happend may cause data loss
Split hlog and open region currently happend may cause data loss Key: HBASE-4862 URL: https://issues.apache.org/jira/browse/HBASE-4862 Project: HBase Issue Type: Bug Affects Versions: 0.90.2 Reporter: chunhui shen Case Description: 1.Split hlog thread creat writer for the file region A/recoverd.edits/123456 and is appending log entry 2.Regionserver is opening region A now, and in the process replayRecoveredEditsIfAny() ,it will delete the file region A/recoverd.edits/123456 3.Split hlog thread catches the io exception, and stop parse this log file and if skipError = true , add it to the corrupt logsHowever, data in other regions in this log file will loss 4.Or if skipError = false, it will check filesystem.Of course, the file system is ok , and it only prints a error log, continue assigning regions. Therefore, data in other log files will also loss!! The case may happen in the following: 1.Move region from server A to server B 2.kill server A and Server B 3.restart server A and Server B We could prevent this exception throuth forbiding deleting recover.edits file which is appending by split hlog thread -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira