[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler
[ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-5120: -- Attachment: HBASE-5120_4.patch Timeout monitor races with table disable handler Key: HBASE-5120 URL: https://issues.apache.org/jira/browse/HBASE-5120 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Zhihong Yu Priority: Blocker Fix For: 0.94.0, 0.92.1 Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch Here is what J-D described here: https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176 I think I will retract from my statement that it used to be extremely racy and caused more troubles than it fixed, on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it. First the region gets closed {quote} 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. {quote} 2 minutes later it times out: {quote} 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining) {quote} 100ms later the master finally gets the event: {quote} 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED {quote} At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign? {quote} 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791 {quote} Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on. The late ZK notification that the znode was deleted (but it got recreated after): {quote} 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted. {quote} Then it prints this, and much later tries to unassign it again: {quote} 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null ... 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328,
[jira] [Updated] (HBASE-5168) Backport HBASE-5100 - Rollback of split could cause closed region to be opened again
[ https://issues.apache.org/jira/browse/HBASE-5168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-5168: -- Attachment: HBASE-5100_0.90.patch Backport HBASE-5100 - Rollback of split could cause closed region to be opened again Key: HBASE-5168 URL: https://issues.apache.org/jira/browse/HBASE-5168 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Attachments: HBASE-5100_0.90.patch Considering the importance of the defect merging it to 0.90.6 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5152) Region is on service before completing initialization when doing rollback of split, it will affect read correctness
[ https://issues.apache.org/jira/browse/HBASE-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183969#comment-13183969 ] Hudson commented on HBASE-5152: --- Integrated in HBase-TRUNK #2617 (See [https://builds.apache.org/job/HBase-TRUNK/2617/]) HBASE-5152 Region is on service before completing initialization when doing rollback of split, it will affect read correctness (Chunhui) tedyu : Files : * /hbase/trunk/CHANGES.txt * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java Region is on service before completing initialization when doing rollback of split, it will affect read correctness Key: HBASE-5152 URL: https://issues.apache.org/jira/browse/HBASE-5152 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Fix For: 0.92.0, 0.94.0 Attachments: 5152-v2.txt, hbase-5152.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5052) The path where a dynamically loaded coprocessor jar is copied on the local file system depends on the region name (and implicitly, the start key)
[ https://issues.apache.org/jira/browse/HBASE-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183971#comment-13183971 ] Hudson commented on HBASE-5052: --- Integrated in HBase-TRUNK #2617 (See [https://builds.apache.org/job/HBase-TRUNK/2617/]) HBASE-5052 The path where a dynamically loaded coprocessor jar is copied on the local file system depends on the region name (and implicitly, the start key) stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/RegionCoprocessorHost.java The path where a dynamically loaded coprocessor jar is copied on the local file system depends on the region name (and implicitly, the start key) - Key: HBASE-5052 URL: https://issues.apache.org/jira/browse/HBASE-5052 Project: HBase Issue Type: Bug Components: coprocessors Affects Versions: 0.92.0 Reporter: Andrei Dragomir Assignee: Andrei Dragomir Fix For: 0.92.0 Attachments: HBASE-5052.patch When loading a coprocessor from hdfs, the jar file gets copied to a path on the local filesystem, which depends on the region name, and the region start key. The name is cleaned, but not enough, so when you have filesystem unfriendly characters (/?:, etc), the coprocessor is not loaded, and an error is thrown -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5121) MajorCompaction may affect scan's correctness
[ https://issues.apache.org/jira/browse/HBASE-5121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183972#comment-13183972 ] Hudson commented on HBASE-5121: --- Integrated in HBase-TRUNK #2617 (See [https://builds.apache.org/job/HBase-TRUNK/2617/]) HBASE-5121 MajorCompaction may affect scan's correctness (chunhui shen and Lars H) larsh : Files : * /hbase/trunk/CHANGES.txt * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/KeyValueHeap.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestScanner.java MajorCompaction may affect scan's correctness - Key: HBASE-5121 URL: https://issues.apache.org/jira/browse/HBASE-5121 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.90.4 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.94.0, 0.92.1 Attachments: 5121-0.92.txt, 5121-suggest.txt, 5121-trunk-combined.txt, 5121.90, hbase-5121-testcase.patch, hbase-5121.patch, hbase-5121v2.patch In our test, there are two families' keyvalue for one row. But we could find a infrequent problem when doing scan's next if majorCompaction happens concurrently. In the client's two continuous doing scan.next(): 1.First time, scan's next returns the result where family A is null. 2.Second time, scan's next returns the result where family B is null. The two next()'s result have the same row. If there are more families, I think the scenario will be more strange... We find the reason is that storescanner.peek() is changed after majorCompaction if there are delete type KeyValue. This change causes the PriorityQueueKeyValueScanner of RegionScanner's heap is not sure to be sorted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5141) Memory leak in MonitoredRPCHandlerImpl
[ https://issues.apache.org/jira/browse/HBASE-5141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183970#comment-13183970 ] Hudson commented on HBASE-5141: --- Integrated in HBase-TRUNK #2617 (See [https://builds.apache.org/job/HBase-TRUNK/2617/]) HBASE-5141 Memory leak in MonitoredRPCHandlerImpl -- REDO HBASE-5141 Memory leak in MonitoredRPCHandlerImpl -- REVERT. OVER-COMMITTED. REVERTING ALL SO CAN REDO COMMIT HBASE-5141 Memory leak in MonitoredRPCHandlerImpl stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/HBaseServer.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/monitoring/MonitoredRPCHandlerImpl.java stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/HBaseServer.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ClosedRegionHandler.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/monitoring/MonitoredRPCHandlerImpl.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/HBaseServer.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ClosedRegionHandler.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/monitoring/MonitoredRPCHandlerImpl.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java Memory leak in MonitoredRPCHandlerImpl -- Key: HBASE-5141 URL: https://issues.apache.org/jira/browse/HBASE-5141 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Blocker Fix For: 0.92.0, 0.94.0 Attachments: HBASE-5141-v2.patch, HBASE-5141.patch, Screen Shot 2012-01-06 at 3.03.09 PM.png I got a pretty reliable way of OOME'ing my region servers. Using a big payload (64MB in my case), a default heap and default number of handlers, it's not too long that all the MonitoredRPCHandlerImpl hold on a 64MB reference and once a compaction kicks in it kills everything. The issue is that even after the RPC call is done, the packet still lives in MonitoredRPCHandlerImpl. Will attach a screen shot of jprofiler's analysis in a moment. This is a blocker for 0.92.0, anyone using a high number of handlers and bigish values will kill themselves. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5041) Major compaction on non existing table does not throw error
[ https://issues.apache.org/jira/browse/HBASE-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183973#comment-13183973 ] Hudson commented on HBASE-5041: --- Integrated in HBase-TRUNK #2617 (See [https://builds.apache.org/job/HBase-TRUNK/2617/]) HBASE-5041 Major compaction on non existing table does not throw error (Shrijeet) tedyu : Files : * /hbase/trunk/CHANGES.txt * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HBaseAdmin.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/client/TestAdmin.java Major compaction on non existing table does not throw error Key: HBASE-5041 URL: https://issues.apache.org/jira/browse/HBASE-5041 Project: HBase Issue Type: Bug Components: regionserver, shell Affects Versions: 0.90.3 Reporter: Shrijeet Paliwal Assignee: Shrijeet Paliwal Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 0002-HBASE-5041-Throw-error-if-table-does-not-exist.patch, 0002-HBASE-5041-Throw-error-if-table-does-not-exist.patch, 0003-HBASE-5041-Throw-error-if-table-does-not-exist.0.90.patch Following will not complain even if fubar does not exist {code} echo major_compact 'fubar' | $HBASE_HOME/bin/hbase shell {code} The downside for this defect is that major compaction may be skipped due to a typo by Ops. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5134) Remove getRegionServerWithoutRetries and getRegionServerWithRetries from HConnection Interface
[ https://issues.apache.org/jira/browse/HBASE-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183974#comment-13183974 ] Hudson commented on HBASE-5134: --- Integrated in HBase-TRUNK #2617 (See [https://builds.apache.org/job/HBase-TRUNK/2617/]) HBASE-5134 Remove getRegionServerWithoutRetries and getRegionServerWithRetries from HConnection Interface stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/HBaseConfiguration.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/ClientScanner.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/ConnectionUtils.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnection.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HTable.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/MetaScanner.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/ServerCallable.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/ipc/ExecRPCInvoker.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ClosedRegionHandler.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/client/HConnectionTestingUtility.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/mapreduce/TestLoadIncrementalHFilesSplitRecovery.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestAssignmentManager.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/master/TestCatalogJanitor.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegionServerBulkLoad.java Remove getRegionServerWithoutRetries and getRegionServerWithRetries from HConnection Interface -- Key: HBASE-5134 URL: https://issues.apache.org/jira/browse/HBASE-5134 Project: HBase Issue Type: Improvement Reporter: stack Assignee: stack Fix For: 0.94.0 Attachments: 5134-v2.txt, 5134-v3.txt, 5134-v4.txt, 5134-v5.txt, 5134-v6.txt, 5134-v6.txt Its broke having these meta methods in HConnection. They take ServerCallables which themselves have HConnections inevitably. It makes for a tangle in the model and frustrates being able to do mocked implemenations of HConnection. These methods better belong in something like HConnectionManager, or elsewhere altogether. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5172) HTableInterface should extend java.io.Closeable
[ https://issues.apache.org/jira/browse/HBASE-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183976#comment-13183976 ] Hudson commented on HBASE-5172: --- Integrated in HBase-TRUNK #2617 (See [https://builds.apache.org/job/HBase-TRUNK/2617/]) HBASE-5172 HTableInterface should extend java.io.Closeable stack : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HTable.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HTableInterface.java HTableInterface should extend java.io.Closeable --- Key: HBASE-5172 URL: https://issues.apache.org/jira/browse/HBASE-5172 Project: HBase Issue Type: Bug Reporter: Zhihong Yu Assignee: stack Fix For: 0.94.0 Attachments: 5172.txt Ioan Eugen Stan found this issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5173) Commit hbase-4480 findHangingTest.sh script under dev-support
[ https://issues.apache.org/jira/browse/HBASE-5173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183979#comment-13183979 ] Hudson commented on HBASE-5173: --- Integrated in HBase-TRUNK #2617 (See [https://builds.apache.org/job/HBase-TRUNK/2617/]) HBASE-5173 Commit hbase-4480 findHangingTest.sh script under dev-support stack : Files : * /hbase/trunk/dev-support/findHangingTest.sh Commit hbase-4480 findHangingTest.sh script under dev-support - Key: HBASE-5173 URL: https://issues.apache.org/jira/browse/HBASE-5173 Project: HBase Issue Type: Task Reporter: stack Fix For: 0.94.0 Attachments: 5173.txt See hbase-4480 for the script from Ted -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5088) A concurrency issue on SoftValueSortedMap
[ https://issues.apache.org/jira/browse/HBASE-5088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183980#comment-13183980 ] Hudson commented on HBASE-5088: --- Integrated in HBase-TRUNK #2617 (See [https://builds.apache.org/job/HBase-TRUNK/2617/]) HBASE-5088 addendum HBASE-5088 A concurrency issue on SoftValueSortedMap (Jieshan Bean and Lars H) larsh : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/SoftValueSortedMap.java larsh : Files : * /hbase/trunk/CHANGES.txt * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/HConnectionManager.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/SoftValueSortedMap.java A concurrency issue on SoftValueSortedMap - Key: HBASE-5088 URL: https://issues.apache.org/jira/browse/HBASE-5088 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.4, 0.94.0 Reporter: Jieshan Bean Assignee: Lars Hofhansl Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5088-0.90.txt, 5088-0.92-trunk-addendum.txt, 5088-final3.txt, HBase-5088-90.patch, HBase-5088-trunk.patch, HBase5088-90-replaceSoftValueSortedMap.patch, HBase5088-90-replaceTreeMap.patch, HBase5088-trunk-replaceTreeMap.patch, HBase5088Reproduce.java, PerformanceTestResults.png SoftValueSortedMap is backed by a TreeMap. All the methods in this class are synchronized. If we use this method to add/delete elements, it's ok. But in HConnectionManager#getCachedLocation, it use headMap to get a view from SoftValueSortedMap#internalMap. Once we operate on this view map(like add/delete) in other threads, a concurrency issue may occur. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4480) Testing script to simplify local testing
[ https://issues.apache.org/jira/browse/HBASE-4480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183977#comment-13183977 ] Hudson commented on HBASE-4480: --- Integrated in HBase-TRUNK #2617 (See [https://builds.apache.org/job/HBase-TRUNK/2617/]) HBASE-5173 Commit hbase-4480 findHangingTest.sh script under dev-support Testing script to simplify local testing Key: HBASE-4480 URL: https://issues.apache.org/jira/browse/HBASE-4480 Project: HBase Issue Type: Improvement Affects Versions: 0.90.4 Reporter: Jesse Yates Priority: Minor Labels: test Fix For: 0.94.0 Attachments: HBASE-4480.patch, HBASE-4480_v2.patch, HBASE-4480_v3.patch, HBASE-4480_v4.patch, findHangingTest.sh, runtest-no-npe-check.sh, runtest.sh, runtest2.sh As mentioned by http://search-hadoop.com/m/r2Ab624ES3e and http://search-hadoop.com/m/cZjDH1ykGIA it would be nice if we could have a script that would handle more of the finer points of running/checking our test suite. This script should: (1) Allow people to determine which tests are hanging/taking a long time to run (2) Allow rerunning of particular tests to make sure it wasn't an artifact of running the whole suite that caused the failure (3) Allow people to specify to run just unit tests or also integration tests (essentially wrapping calls to 'maven test' and 'maven verify'). This script should just be a convenience script - running tests directly from maven should not be impacted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5137) MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException
[ https://issues.apache.org/jira/browse/HBASE-5137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183978#comment-13183978 ] Hudson commented on HBASE-5137: --- Integrated in HBase-TRUNK #2617 (See [https://builds.apache.org/job/HBase-TRUNK/2617/]) HBASE-5137 MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException(Ram Ted) ramkrishna : Files : * /hbase/trunk/CHANGES.txt * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java MasterFileSystem.splitLog() should abort even if waitOnSafeMode() throws IOException Key: HBASE-5137 URL: https://issues.apache.org/jira/browse/HBASE-5137 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 0.92.0, 0.90.6 Attachments: 5137-trunk.txt, HBASE-5137.patch, HBASE-5137.patch I am not sure if this bug was already raised in JIRA. In our test cluster we had a scenario where the RS had gone down and ServerShutDownHandler started with splitLog. But as the HDFS was down the check waitOnSafeMode throws IOException. {code} try { // If FS is in safe mode, just wait till out of it. FSUtils.waitOnSafeMode(conf, conf.getInt(HConstants.THREAD_WAKE_FREQUENCY, 1000)); splitter.splitLog(); } catch (OrphanHLogAfterSplitException e) { {code} We catch the exception {code} } catch (IOException e) { checkFileSystem(); LOG.error(Failed splitting + logDir.toString(), e); } {code} So the HLog split itself did not happen. We encontered like 4 regions that was recently splitted in the crashed RS was lost. Can we abort the Master in such scenarios? Pls suggest. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3949) Add Master link to RegionServer pages
[ https://issues.apache.org/jira/browse/HBASE-3949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183975#comment-13183975 ] Hudson commented on HBASE-3949: --- Integrated in HBase-TRUNK #2617 (See [https://builds.apache.org/job/HBase-TRUNK/2617/]) HBASE-3949. Add Master link to RegionServer pages. Contributed by Gregory Chanan. todd : Files : * /hbase/trunk/src/main/jamon/org/apache/hbase/tmpl/regionserver/RSStatusTmpl.jamon * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestRSStatusServlet.java Add Master link to RegionServer pages --- Key: HBASE-3949 URL: https://issues.apache.org/jira/browse/HBASE-3949 Project: HBase Issue Type: Improvement Components: regionserver Affects Versions: 0.90.3, 0.92.0 Reporter: Lars George Assignee: Gregory Chanan Priority: Minor Labels: noob Fix For: 0.94.0 Use the ZK info where the master is to add a UI link on the top of each RegionServer page. Currently you cannot navigate directly to the Master UI once you are on a RS page. Not sure if the info port is exposed OTTOMH, but we could either use the RS local config setting for that or add it to ZK to enable lookup. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler
[ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183989#comment-13183989 ] Hadoop QA commented on HBASE-5120: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12510165/HBASE-5120_4.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -147 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 79 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat org.apache.hadoop.hbase.io.hfile.TestLruBlockCache org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestImportTsv Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/726//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/726//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/726//console This message is automatically generated. Timeout monitor races with table disable handler Key: HBASE-5120 URL: https://issues.apache.org/jira/browse/HBASE-5120 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Zhihong Yu Priority: Blocker Fix For: 0.94.0, 0.92.1 Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch Here is what J-D described here: https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176 I think I will retract from my statement that it used to be extremely racy and caused more troubles than it fixed, on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it. First the region gets closed {quote} 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. {quote} 2 minutes later it times out: {quote} 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining) {quote} 100ms later the master finally gets the event: {quote} 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED {quote} At this point everything is fine, the region was
[jira] [Updated] (HBASE-5153) HConnection re-creation in HTable after HConnection abort
[ https://issues.apache.org/jira/browse/HBASE-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jieshan Bean updated HBASE-5153: Attachment: HBASE-5153-V3.patch HConnection re-creation in HTable after HConnection abort - Key: HBASE-5153 URL: https://issues.apache.org/jira/browse/HBASE-5153 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.4 Reporter: Jieshan Bean Assignee: Jieshan Bean Fix For: 0.90.6 Attachments: HBASE-5153-V2.patch, HBASE-5153-V3.patch, HBASE-5153.patch HBASE-4893 is related to this issue. In that issue, we know, if multi-threads share a same connection, once this connection got abort in one thread, the other threads will got a HConnectionManager$HConnectionImplementation@18fb1f7 closed exception. It solve the problem of stale connection can't removed. But the orignal HTable instance cann't be continue to use. The connection in HTable should be recreated. Actually, there's two aproach to solve this: 1. In user code, once catch an IOE, close connection and re-create HTable instance. We can use this as a workaround. 2. In HBase Client side, catch this exception, and re-create connection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5128) [uber hbck] Enable hbck to automatically repair table integrity problems as well as region consistency problems while online.
[ https://issues.apache.org/jira/browse/HBASE-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184031#comment-13184031 ] jirapos...@reviews.apache.org commented on HBASE-5128: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3435/ --- (Updated 2012-01-11 12:46:37.524636) Review request for hbase, Todd Lipcon, Ted Yu, Michael Stack, and Jean-Daniel Cryans. Changes --- Fixed bug link. Added JD. JD -- the code that is similar to merging is - #handleOverlapGroup - inMeta !inHdfs isDeployed (in another rev I've added an unassign and believe I still have the disable/delete problem). Summary --- I'm posting a preliminary version that I'm currently testing on real clusters. The tests are flakey on the 0.90 branch (so there is something async that I didn't synchronize properly), and there are a few more TODO's I want to knock out before this is ready for full review to be considered for committing. It's got some problems I need some advice figuring out. Problem 1: In the unit tests, I have a few cases where I fabricate new regions and try to force the overlapping regions to be closed. For some of these, I cannot delete a table after it is repaired without causing subsequent tests to fail. I think this is due to a few things: 1) The disable table handler uses in-memory assignment manager state while delete uses in META assignment information. 2) Currently I'm using the sneaky closeRegion that purposely doesn't go through the master and in turn doesn't modify in-memory state – disable uses out of date in-memory region assignments. If I use the unassign method sends RIT transitions to the master, but which ends up attempting to assign it again, causing timing/transient states. What is a good way to clear the HMaster's assignment manager's assignment data for particular regions or to force it to re-read from META? (without modifying the 0.90 HBase's it is meant to repair). Problem 2: Sometimes test fail reporting HOLE_IN_REGION_CHAIN and SERVER_DOES_NOT_MATCH_META. This means the old and new regions are confiused with each other and basically something is still happening asynchronously. I think this is the new region is being assigned and is still transitioning. Sound about right? To make the unit test deterministic, should hbck wait for these to settle or should just the unit test wait? This addresses bug HBASE-5128. https://issues.apache.org/jira/browse/HBASE-5128 Diffs - src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java 6d3401d src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java a3d8b8b src/main/java/org/apache/hadoop/hbase/util/hbck/OfflineMetaRepair.java 29e8bb2 src/main/java/org/apache/hadoop/hbase/util/hbck/TableIntegrityErrorHandler.java PRE-CREATION src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java a640d57 src/test/java/org/apache/hadoop/hbase/util/hbck/HbckTestingUtil.java dbb97f8 src/test/java/org/apache/hadoop/hbase/util/hbck/TestOfflineMetaRebuildBase.java 3e8729d src/test/java/org/apache/hadoop/hbase/util/hbck/TestOfflineMetaRebuildHole.java 11a1151 src/test/java/org/apache/hadoop/hbase/util/hbck/TestOfflineMetaRebuildOverlap.java 4a09ce2 Diff: https://reviews.apache.org/r/3435/diff Testing --- All unit tests pass sometimes. Some fail sometimes (generally the cases that fabricate new regions). Not ready for commit. Thanks, jmhsieh [uber hbck] Enable hbck to automatically repair table integrity problems as well as region consistency problems while online. - Key: HBASE-5128 URL: https://issues.apache.org/jira/browse/HBASE-5128 Project: HBase Issue Type: New Feature Components: hbck Affects Versions: 0.92.0, 0.90.5 Reporter: Jonathan Hsieh Assignee: Jonathan Hsieh The current (0.90.5, 0.92.0rc2) versions of hbck detects most of region consistency and table integrity invariant violations. However with '-fix' it can only automatically repair region consistency cases having to do with deployment problems. This updated version should be able to handle all cases (including a new orphan regiondir case). When complete will likely deprecate the OfflineMetaRepair tool and subsume several open META-hole related issue. Here's the approach (from the comment of at the top of the new version of the file). {code} /** * HBaseFsck (hbck) is a tool for checking and repairing region consistency and * table integrity. * * Region consistency checks
[jira] [Commented] (HBASE-5153) HConnection re-creation in HTable after HConnection abort
[ https://issues.apache.org/jira/browse/HBASE-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184033#comment-13184033 ] Hadoop QA commented on HBASE-5153: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12510179/HBASE-5153-V3.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/727//console This message is automatically generated. HConnection re-creation in HTable after HConnection abort - Key: HBASE-5153 URL: https://issues.apache.org/jira/browse/HBASE-5153 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.4 Reporter: Jieshan Bean Assignee: Jieshan Bean Fix For: 0.90.6 Attachments: HBASE-5153-V2.patch, HBASE-5153-V3.patch, HBASE-5153.patch HBASE-4893 is related to this issue. In that issue, we know, if multi-threads share a same connection, once this connection got abort in one thread, the other threads will got a HConnectionManager$HConnectionImplementation@18fb1f7 closed exception. It solve the problem of stale connection can't removed. But the orignal HTable instance cann't be continue to use. The connection in HTable should be recreated. Actually, there's two aproach to solve this: 1. In user code, once catch an IOE, close connection and re-create HTable instance. We can use this as a workaround. 2. In HBase Client side, catch this exception, and re-create connection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5153) HConnection re-creation in HTable after HConnection abort
[ https://issues.apache.org/jira/browse/HBASE-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184092#comment-13184092 ] Zhihong Yu commented on HBASE-5153: --- @Jieshan: Can you prepare a patch for trunk ? HConnection re-creation in HTable after HConnection abort - Key: HBASE-5153 URL: https://issues.apache.org/jira/browse/HBASE-5153 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.4 Reporter: Jieshan Bean Assignee: Jieshan Bean Fix For: 0.90.6 Attachments: HBASE-5153-V2.patch, HBASE-5153-V3.patch, HBASE-5153.patch HBASE-4893 is related to this issue. In that issue, we know, if multi-threads share a same connection, once this connection got abort in one thread, the other threads will got a HConnectionManager$HConnectionImplementation@18fb1f7 closed exception. It solve the problem of stale connection can't removed. But the orignal HTable instance cann't be continue to use. The connection in HTable should be recreated. Actually, there's two aproach to solve this: 1. In user code, once catch an IOE, close connection and re-create HTable instance. We can use this as a workaround. 2. In HBase Client side, catch this exception, and re-create connection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA (The directory is already locked.)
[ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5163: -- Summary: TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA (The directory is already locked.) (was: TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.)) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA (The directory is already locked.) -- Key: HBASE-5163 URL: https://issues.apache.org/jira/browse/HBASE-5163 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5163.patch The stack is typically: {noformat} error message=Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. type=java.io.IOExceptionjava.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376) at org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460) at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470) // ... {noformat} It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time. This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops: {noformat} for (int i = curDatanodesNum; i curDatanodesNum+numDataNodes; i++) { if (manageDfsDirs) { File dir1 = new File(data_dir, data+(2*i+1)); File dir2 = new File(data_dir, data+(2*i+2)); dir1.mkdirs(); dir2.mkdirs(); // [...] {noformat} This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior: {noformat} @Test public void testMiniDFSCluster_startDataNode() throws Exception { assertTrue( dfsCluster.getDataNodes().size() == 2 ); // Works, as we kill the last datanode, we can now start a datanode dfsCluster.stopDataNode(1); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); // Fails, as it's not the last datanode, the directory will conflict on // creation dfsCluster.stopDataNode(0); try { dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); fail(There should be an exception because the directory already exists); } catch (IOException e) { assertTrue( e.getMessage().contains(The directory is already locked.)); LOG.info(Expected (!) exception caught + e.getMessage()); } // Works, as we kill the last datanode, we can now restart 2 datanodes // This makes us back with 2 nodes dfsCluster.stopDataNode(0); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null); } {noformat} And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do {noformat} DatanodeInfo[] pipeline = getPipeline(log); assertTrue(pipeline.length == fs.getDefaultReplication()); {noformat} and then kill the datanodes in the pipeline, we will have: - most of the time: pipeline = 1 2, so after killing 12 we can start a new datanode that will reuse the available 2's directory. - sometimes: pipeline = 1 3. In this case,when we
[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA (The directory is already locked.)
[ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184113#comment-13184113 ] Zhihong Yu commented on HBASE-5163: --- Integrated to TRUNK. Thanks for the patch, N. Thanks for the review, Stack. TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA (The directory is already locked.) -- Key: HBASE-5163 URL: https://issues.apache.org/jira/browse/HBASE-5163 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5163.patch The stack is typically: {noformat} error message=Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. type=java.io.IOExceptionjava.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376) at org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460) at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470) // ... {noformat} It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time. This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops: {noformat} for (int i = curDatanodesNum; i curDatanodesNum+numDataNodes; i++) { if (manageDfsDirs) { File dir1 = new File(data_dir, data+(2*i+1)); File dir2 = new File(data_dir, data+(2*i+2)); dir1.mkdirs(); dir2.mkdirs(); // [...] {noformat} This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior: {noformat} @Test public void testMiniDFSCluster_startDataNode() throws Exception { assertTrue( dfsCluster.getDataNodes().size() == 2 ); // Works, as we kill the last datanode, we can now start a datanode dfsCluster.stopDataNode(1); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); // Fails, as it's not the last datanode, the directory will conflict on // creation dfsCluster.stopDataNode(0); try { dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); fail(There should be an exception because the directory already exists); } catch (IOException e) { assertTrue( e.getMessage().contains(The directory is already locked.)); LOG.info(Expected (!) exception caught + e.getMessage()); } // Works, as we kill the last datanode, we can now restart 2 datanodes // This makes us back with 2 nodes dfsCluster.stopDataNode(0); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null); } {noformat} And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do {noformat} DatanodeInfo[] pipeline = getPipeline(log); assertTrue(pipeline.length == fs.getDefaultReplication()); {noformat} and then kill the datanodes in the pipeline, we will have: - most of the time: pipeline = 1 2, so after killing 12 we can start a new datanode that will reuse the available 2's directory. - sometimes: pipeline = 1 3. In this case,when we try to launch the new datanode, it fails because it wants to use the same directory as the still alive '2'. There are two
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184120#comment-13184120 ] Zhihong Yu commented on HBASE-5179: --- {code} + private final SetServerName processingDeadServers = new HashSetServerName(); {code} The field name above sounds like method name. How about naming it deadServersUnderProcessing ? Related method names should be changed as well. {code} + * Called on startup. Figures whether a fresh cluster start of we are joining {code} should read 'start or we are'. For ServerManager.java and DeadServer.java: {code} + public SetServerName getProcessingDeadServers() { +return this.deadservers.cloneProcessingDeadServers(); + } {code} The method should be called cloneDeadServersUnderProcessing(). Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5179: -- Status: Patch Available (was: Open) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA (The directory is already locked.)
[ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184147#comment-13184147 ] Hudson commented on HBASE-5163: --- Integrated in HBase-TRUNK #2618 (See [https://builds.apache.org/job/HBase-TRUNK/2618/]) HBASE-5163 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA (The directory is already locked.) (N Keywal) tedyu : Files : * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA (The directory is already locked.) -- Key: HBASE-5163 URL: https://issues.apache.org/jira/browse/HBASE-5163 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5163.patch The stack is typically: {noformat} error message=Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. type=java.io.IOExceptionjava.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376) at org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460) at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470) // ... {noformat} It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time. This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops: {noformat} for (int i = curDatanodesNum; i curDatanodesNum+numDataNodes; i++) { if (manageDfsDirs) { File dir1 = new File(data_dir, data+(2*i+1)); File dir2 = new File(data_dir, data+(2*i+2)); dir1.mkdirs(); dir2.mkdirs(); // [...] {noformat} This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior: {noformat} @Test public void testMiniDFSCluster_startDataNode() throws Exception { assertTrue( dfsCluster.getDataNodes().size() == 2 ); // Works, as we kill the last datanode, we can now start a datanode dfsCluster.stopDataNode(1); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); // Fails, as it's not the last datanode, the directory will conflict on // creation dfsCluster.stopDataNode(0); try { dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); fail(There should be an exception because the directory already exists); } catch (IOException e) { assertTrue( e.getMessage().contains(The directory is already locked.)); LOG.info(Expected (!) exception caught + e.getMessage()); } // Works, as we kill the last datanode, we can now restart 2 datanodes // This makes us back with 2 nodes dfsCluster.stopDataNode(0); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null); } {noformat} And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do {noformat} DatanodeInfo[] pipeline = getPipeline(log); assertTrue(pipeline.length == fs.getDefaultReplication()); {noformat} and then kill the datanodes in the pipeline, we will have: - most of the time: pipeline = 1 2, so after killing
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184155#comment-13184155 ] Hadoop QA commented on HBASE-5179: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12510164/hbase-5179.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -147 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 78 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/728//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/728//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/728//console This message is automatically generated. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5155) ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted
[ https://issues.apache.org/jira/browse/HBASE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184164#comment-13184164 ] ramkrishna.s.vasudevan commented on HBASE-5155: --- I could not upload the patch today as still some test case is failing. Will upload it tomorrow. ServerShutDownHandler And Disable/Delete should not happen parallely leading to recreation of regions that were deleted --- Key: HBASE-5155 URL: https://issues.apache.org/jira/browse/HBASE-5155 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.4 Reporter: ramkrishna.s.vasudevan Priority: Blocker ServerShutDownHandler and disable/delete table handler races. This is not an issue due to TM. - A regionserver goes down. In our cluster the regionserver holds lot of regions. - A region R1 has two daughters D1 and D2. - The ServerShutdownHandler gets called and scans the META and gets all the user regions - Parallely a table is disabled. (No problem in this step). - Delete table is done. - The tables and its regions are deleted including R1, D1 and D2.. (So META is cleaned) - Now ServerShutdownhandler starts to processTheDeadRegion {code} if (hri.isOffline() hri.isSplit()) { LOG.debug(Offlined and split region + hri.getRegionNameAsString() + ; checking daughter presence); fixupDaughters(result, assignmentManager, catalogTracker); {code} As part of fixUpDaughters as the daughers D1 and D2 is missing for R1 {code} if (isDaughterMissing(catalogTracker, daughter)) { LOG.info(Fixup; missing daughter + daughter.getRegionNameAsString()); MetaEditor.addDaughter(catalogTracker, daughter, null); // TODO: Log WARN if the regiondir does not exist in the fs. If its not // there then something wonky about the split -- things will keep going // but could be missing references to parent region. // And assign it. assignmentManager.assign(daughter, true); {code} we call assign of the daughers. Now after this we again start with the below code. {code} if (processDeadRegion(e.getKey(), e.getValue(), this.services.getAssignmentManager(), this.server.getCatalogTracker())) { this.services.getAssignmentManager().assign(e.getKey(), true); {code} Now when the SSH scanned the META it had R1, D1 and D2. So as part of the above code D1 and D2 which where assigned by fixUpDaughters is again assigned by {code} this.services.getAssignmentManager().assign(e.getKey(), true); {code} Thus leading to a zookeeper issue due to bad version and killing the master. The important part here is the regions that were deleted are recreated which i think is more critical. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184170#comment-13184170 ] ramkrishna.s.vasudevan commented on HBASE-5179: --- @Chunhui Is this issue applicable for 0.90.6? If so can you prepare a patch for 0.90 also? Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5115) Change HBase color from purple to International Orange (Engineering)
[ https://issues.apache.org/jira/browse/HBASE-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5115: - Attachment: 01_orange.svg 01_orange.png Change HBase color from purple to International Orange (Engineering) Key: HBASE-5115 URL: https://issues.apache.org/jira/browse/HBASE-5115 Project: HBase Issue Type: Task Reporter: stack Assignee: stack Attachments: 01_orange.png, 01_orange.svg See http://en.wikipedia.org/wiki/International_orange See the bit about the color of the golden gate bridge. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-5115) Change HBase color from purple to International Orange (Engineering)
[ https://issues.apache.org/jira/browse/HBASE-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack reassigned HBASE-5115: Assignee: stack Change HBase color from purple to International Orange (Engineering) Key: HBASE-5115 URL: https://issues.apache.org/jira/browse/HBASE-5115 Project: HBase Issue Type: Task Reporter: stack Assignee: stack Attachments: 01_orange.png, 01_orange.svg See http://en.wikipedia.org/wiki/International_orange See the bit about the color of the golden gate bridge. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5115) Change HBase color from purple to International Orange (Engineering)
[ https://issues.apache.org/jira/browse/HBASE-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184171#comment-13184171 ] stack commented on HBASE-5115: -- Here is logo done in IA(Engineering). Change HBase color from purple to International Orange (Engineering) Key: HBASE-5115 URL: https://issues.apache.org/jira/browse/HBASE-5115 Project: HBase Issue Type: Task Reporter: stack Assignee: stack Attachments: 01_orange.png, 01_orange.svg, H_orange.png, H_orange.svg See http://en.wikipedia.org/wiki/International_orange See the bit about the color of the golden gate bridge. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5115) Change HBase color from purple to International Orange (Engineering)
[ https://issues.apache.org/jira/browse/HBASE-5115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5115: - Attachment: H_orange.svg H_orange.png Change HBase color from purple to International Orange (Engineering) Key: HBASE-5115 URL: https://issues.apache.org/jira/browse/HBASE-5115 Project: HBase Issue Type: Task Reporter: stack Assignee: stack Attachments: 01_orange.png, 01_orange.svg, H_orange.png, H_orange.svg See http://en.wikipedia.org/wiki/International_orange See the bit about the color of the golden gate bridge. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184181#comment-13184181 ] ramkrishna.s.vasudevan commented on HBASE-5179: --- @Chunhui Can you take a look at HBAE-4748. It is similar to this but there the data loss was w.r.t META leading to more critical data loss. But it is quite rare but still possible. Do you have any suggestions for that? Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-3565) Add a metric to keep track of slow HLog appends
[ https://issues.apache.org/jira/browse/HBASE-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-3565: -- Status: Patch Available (was: Open) Add a metric to keep track of slow HLog appends --- Key: HBASE-3565 URL: https://issues.apache.org/jira/browse/HBASE-3565 Project: HBase Issue Type: Improvement Components: metrics, regionserver Reporter: Benoit Sigoure Assignee: Mubarak Seyed Labels: monitoring Fix For: 0.94.0 Attachments: HBASE-3565.trunk.v1.patch Whenever an edit takes too long to be written to an HLog, HBase logs a warning such as this one: {code} 2011-02-23 20:03:14,703 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC Server handler 21 on 60020 took 15065ms appending an edit to hlog; editcount=126050 {code} I would like to have a counter incremented each time this happens and this counter exposed via the metrics stuff in HBase so I can collect it in my monitoring system. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5179: -- Attachment: 5179-v2.txt Chunhui's patch for TRUNK with minor renaming. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-v2.txt, hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler
[ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184203#comment-13184203 ] ramkrishna.s.vasudevan commented on HBASE-5120: --- Latest patch available.. Timeout monitor races with table disable handler Key: HBASE-5120 URL: https://issues.apache.org/jira/browse/HBASE-5120 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Zhihong Yu Priority: Blocker Fix For: 0.94.0, 0.92.1 Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch Here is what J-D described here: https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176 I think I will retract from my statement that it used to be extremely racy and caused more troubles than it fixed, on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it. First the region gets closed {quote} 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. {quote} 2 minutes later it times out: {quote} 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining) {quote} 100ms later the master finally gets the event: {quote} 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED {quote} At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign? {quote} 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791 {quote} Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on. The late ZK notification that the znode was deleted (but it got recreated after): {quote} 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted. {quote} Then it prints this, and much later tries to unassign it again: {quote} 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null ... 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition;
[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler
[ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184205#comment-13184205 ] Zhihong Yu commented on HBASE-5120: --- Can you change LOG.debug() to LOG.error() in deleteClosingOrClosedNode() ? Timeout monitor races with table disable handler Key: HBASE-5120 URL: https://issues.apache.org/jira/browse/HBASE-5120 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Zhihong Yu Priority: Blocker Fix For: 0.94.0, 0.92.1 Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch Here is what J-D described here: https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176 I think I will retract from my statement that it used to be extremely racy and caused more troubles than it fixed, on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it. First the region gets closed {quote} 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. {quote} 2 minutes later it times out: {quote} 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining) {quote} 100ms later the master finally gets the event: {quote} 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED {quote} At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign? {quote} 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791 {quote} Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on. The late ZK notification that the znode was deleted (but it got recreated after): {quote} 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted. {quote} Then it prints this, and much later tries to unassign it again: {quote} 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null ... 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition;
[jira] [Issue Comment Edited] (HBASE-5120) Timeout monitor races with table disable handler
[ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184205#comment-13184205 ] Zhihong Yu edited comment on HBASE-5120 at 1/11/12 5:17 PM: Can you change LOG.debug() to LOG.error() in deleteClosingOrClosedNode() ? {code} +LOG.debug(The deletion of the CLOSED node for the region ++ region.getEncodedName() + returned + deleteNode); {code} was (Author: zhi...@ebaysf.com): Can you change LOG.debug() to LOG.error() in deleteClosingOrClosedNode() ? Timeout monitor races with table disable handler Key: HBASE-5120 URL: https://issues.apache.org/jira/browse/HBASE-5120 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Zhihong Yu Priority: Blocker Fix For: 0.94.0, 0.92.1 Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch Here is what J-D described here: https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176 I think I will retract from my statement that it used to be extremely racy and caused more troubles than it fixed, on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it. First the region gets closed {quote} 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. {quote} 2 minutes later it times out: {quote} 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining) {quote} 100ms later the master finally gets the event: {quote} 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED {quote} At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign? {quote} 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791 {quote} Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on. The late ZK notification that the znode was deleted (but it got recreated after): {quote} 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted. {quote} Then it prints this, and much later tries to unassign it again: {quote} 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on
[jira] [Commented] (HBASE-5153) HConnection re-creation in HTable after HConnection abort
[ https://issues.apache.org/jira/browse/HBASE-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184211#comment-13184211 ] stack commented on HBASE-5153: -- Patch looks good. I like your addition of a specific Exception for closed state. Does this have to be public Jieshan? {code} getRegionServerWithRetries {code} Same for processBatch and getRegionLocation. If public should be in HTableInterface but they seem implementation methods rather than something that should be part of public interface. A style nit -- i.e. not important but if you are going to redo the patch you miight want to address it -- is that you do this in handleConnectionClosedException {code} +if (ioe instanceof ConnectionClosedException) { {code} and the whole method is dealing with the case where above is true. I'd suggest that you might do: {code} if (!(ioe instanceof ConnectionClosedException)) return; {code} ... then you save a whole indent and its clear that the method is all about dealing with ConnectionClosedException. Is it right including this in HTable? {code} getPauseTime {code} In trunk that is in a new ConnectionUtils class. Maybe you have to do it for 0.90? I'm wondering if the class ConnectionClosedException needs to be public also? Its only used in this package, right? HConnection re-creation in HTable after HConnection abort - Key: HBASE-5153 URL: https://issues.apache.org/jira/browse/HBASE-5153 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.4 Reporter: Jieshan Bean Assignee: Jieshan Bean Fix For: 0.90.6 Attachments: HBASE-5153-V2.patch, HBASE-5153-V3.patch, HBASE-5153.patch HBASE-4893 is related to this issue. In that issue, we know, if multi-threads share a same connection, once this connection got abort in one thread, the other threads will got a HConnectionManager$HConnectionImplementation@18fb1f7 closed exception. It solve the problem of stale connection can't removed. But the orignal HTable instance cann't be continue to use. The connection in HTable should be recreated. Actually, there's two aproach to solve this: 1. In user code, once catch an IOE, close connection and re-create HTable instance. We can use this as a workaround. 2. In HBase Client side, catch this exception, and re-create connection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3565) Add a metric to keep track of slow HLog appends
[ https://issues.apache.org/jira/browse/HBASE-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184220#comment-13184220 ] Hadoop QA commented on HBASE-3565: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12510132/HBASE-3565.trunk.v1.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -147 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 78 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplicationPeer org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/729//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/729//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/729//console This message is automatically generated. Add a metric to keep track of slow HLog appends --- Key: HBASE-3565 URL: https://issues.apache.org/jira/browse/HBASE-3565 Project: HBase Issue Type: Improvement Components: metrics, regionserver Reporter: Benoit Sigoure Assignee: Mubarak Seyed Labels: monitoring Fix For: 0.94.0 Attachments: HBASE-3565.trunk.v1.patch Whenever an edit takes too long to be written to an HLog, HBase logs a warning such as this one: {code} 2011-02-23 20:03:14,703 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC Server handler 21 on 60020 took 15065ms appending an edit to hlog; editcount=126050 {code} I would like to have a counter incremented each time this happens and this counter exposed via the metrics stuff in HBase so I can collect it in my monitoring system. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5150) Fail in a thread may not fail a test, clean up log splitting test
[ https://issues.apache.org/jira/browse/HBASE-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184221#comment-13184221 ] Jimmy Xiang commented on HBASE-5150: Those failed tests passed on my local box. Fail in a thread may not fail a test, clean up log splitting test - Key: HBASE-5150 URL: https://issues.apache.org/jira/browse/HBASE-5150 Project: HBase Issue Type: Test Affects Versions: 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: hbase-5150.txt, hbase_5150_v3.patch This is to clean up some tests for HBASE-5081. The Assert.fail method in a separate thread will terminate the thread, but may not fail the test. We can use callable, so that we can get the error in getting the result. Some documentation to explain the test will be helpful too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler
[ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-5120: -- Status: Patch Available (was: Open) Timeout monitor races with table disable handler Key: HBASE-5120 URL: https://issues.apache.org/jira/browse/HBASE-5120 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Zhihong Yu Assignee: ramkrishna.s.vasudevan Priority: Blocker Fix For: 0.94.0, 0.92.1 Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch Here is what J-D described here: https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176 I think I will retract from my statement that it used to be extremely racy and caused more troubles than it fixed, on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it. First the region gets closed {quote} 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. {quote} 2 minutes later it times out: {quote} 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining) {quote} 100ms later the master finally gets the event: {quote} 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED {quote} At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign? {quote} 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791 {quote} Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on. The late ZK notification that the znode was deleted (but it got recreated after): {quote} 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted. {quote} Then it prints this, and much later tries to unassign it again: {quote} 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null ... 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition;
[jira] [Assigned] (HBASE-5120) Timeout monitor races with table disable handler
[ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-5120: - Assignee: ramkrishna.s.vasudevan Timeout monitor races with table disable handler Key: HBASE-5120 URL: https://issues.apache.org/jira/browse/HBASE-5120 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Zhihong Yu Assignee: ramkrishna.s.vasudevan Priority: Blocker Fix For: 0.94.0, 0.92.1 Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch Here is what J-D described here: https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176 I think I will retract from my statement that it used to be extremely racy and caused more troubles than it fixed, on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it. First the region gets closed {quote} 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. {quote} 2 minutes later it times out: {quote} 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining) {quote} 100ms later the master finally gets the event: {quote} 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED {quote} At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign? {quote} 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791 {quote} Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on. The late ZK notification that the znode was deleted (but it got recreated after): {quote} 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted. {quote} Then it prints this, and much later tries to unassign it again: {quote} 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null ... 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition;
[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler
[ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-5120: -- Attachment: HBASE-5120_5.patch Changed debug to error. Timeout monitor races with table disable handler Key: HBASE-5120 URL: https://issues.apache.org/jira/browse/HBASE-5120 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Zhihong Yu Priority: Blocker Fix For: 0.94.0, 0.92.1 Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch Here is what J-D described here: https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176 I think I will retract from my statement that it used to be extremely racy and caused more troubles than it fixed, on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it. First the region gets closed {quote} 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. {quote} 2 minutes later it times out: {quote} 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining) {quote} 100ms later the master finally gets the event: {quote} 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED {quote} At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign? {quote} 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791 {quote} Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on. The late ZK notification that the znode was deleted (but it got recreated after): {quote} 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted. {quote} Then it prints this, and much later tries to unassign it again: {quote} 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null ... 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition;
[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler
[ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan updated HBASE-5120: -- Status: Open (was: Patch Available) Timeout monitor races with table disable handler Key: HBASE-5120 URL: https://issues.apache.org/jira/browse/HBASE-5120 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Zhihong Yu Priority: Blocker Fix For: 0.94.0, 0.92.1 Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch Here is what J-D described here: https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176 I think I will retract from my statement that it used to be extremely racy and caused more troubles than it fixed, on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it. First the region gets closed {quote} 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. {quote} 2 minutes later it times out: {quote} 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining) {quote} 100ms later the master finally gets the event: {quote} 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED {quote} At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign? {quote} 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791 {quote} Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on. The late ZK notification that the znode was deleted (but it got recreated after): {quote} 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted. {quote} Then it prints this, and much later tries to unassign it again: {quote} 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null ... 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184227#comment-13184227 ] ramkrishna.s.vasudevan commented on HBASE-5179: --- Patch looks good to me.. Tomorrow will try out in the cluster. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-v2.txt, hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5150) Fail in a thread may not fail a test, clean up log splitting test
[ https://issues.apache.org/jira/browse/HBASE-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184234#comment-13184234 ] Jimmy Xiang commented on HBASE-5150: @Prakash and Ted, are you ok with this patch? I changed the 3sec wait time to 2sec. Fail in a thread may not fail a test, clean up log splitting test - Key: HBASE-5150 URL: https://issues.apache.org/jira/browse/HBASE-5150 Project: HBase Issue Type: Test Affects Versions: 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: hbase-5150.txt, hbase_5150_v3.patch This is to clean up some tests for HBASE-5081. The Assert.fail method in a separate thread will terminate the thread, but may not fail the test. We can use callable, so that we can get the error in getting the result. Some documentation to explain the test will be helpful too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5179: -- Attachment: 5179-90.txt Chunhui's patch rebased for 0.90 Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184239#comment-13184239 ] Hadoop QA commented on HBASE-5179: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12510206/5179-v2.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -147 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 78 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.master.TestSplitLogManager org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat org.apache.hadoop.hbase.client.TestAdmin org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestImportTsv Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/730//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/730//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/730//console This message is automatically generated. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184240#comment-13184240 ] Hadoop QA commented on HBASE-5179: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12510215/5179-90.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/732//console This message is automatically generated. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5179: -- Comment: was deleted (was: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12510215/5179-90.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/732//console This message is automatically generated.) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184244#comment-13184244 ] Zhihong Yu commented on HBASE-5179: --- I ran the following on MacBook and they passed: {code} 1143 mt -Dtest=TestSplitLogManager 1145 mt -Dtest=TestAdmin#testShouldCloseTheRegionBasedOnTheEncodedRegionName {code} Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler
[ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184251#comment-13184251 ] Zhihong Yu commented on HBASE-5120: --- +1 on patch v5. Timeout monitor races with table disable handler Key: HBASE-5120 URL: https://issues.apache.org/jira/browse/HBASE-5120 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Zhihong Yu Assignee: ramkrishna.s.vasudevan Priority: Blocker Fix For: 0.94.0, 0.92.1 Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch Here is what J-D described here: https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176 I think I will retract from my statement that it used to be extremely racy and caused more troubles than it fixed, on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it. First the region gets closed {quote} 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. {quote} 2 minutes later it times out: {quote} 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining) {quote} 100ms later the master finally gets the event: {quote} 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED {quote} At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign? {quote} 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791 {quote} Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on. The late ZK notification that the znode was deleted (but it got recreated after): {quote} 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted. {quote} Then it prints this, and much later tries to unassign it again: {quote} 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null ... 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition;
[jira] [Commented] (HBASE-5120) Timeout monitor races with table disable handler
[ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184258#comment-13184258 ] Hadoop QA commented on HBASE-5120: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12510211/HBASE-5120_5.patch against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 javadoc. The javadoc tool appears to have generated -147 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 79 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these unit tests: org.apache.hadoop.hbase.mapreduce.TestImportTsv org.apache.hadoop.hbase.mapred.TestTableMapReduce org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/731//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/731//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/731//console This message is automatically generated. Timeout monitor races with table disable handler Key: HBASE-5120 URL: https://issues.apache.org/jira/browse/HBASE-5120 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Zhihong Yu Assignee: ramkrishna.s.vasudevan Priority: Blocker Fix For: 0.94.0, 0.92.1 Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch Here is what J-D described here: https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176 I think I will retract from my statement that it used to be extremely racy and caused more troubles than it fixed, on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it. First the region gets closed {quote} 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. {quote} 2 minutes later it times out: {quote} 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining) {quote} 100ms later the master finally gets the event: {quote} 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED {quote} At this point everything is fine, the region was processed
[jira] [Commented] (HBASE-5139) Compute (weighted) median using AggregateProtocol
[ https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184264#comment-13184264 ] Zhihong Yu commented on HBASE-5139: --- I am going to integrate patch v2 if there is no objection. Compute (weighted) median using AggregateProtocol - Key: HBASE-5139 URL: https://issues.apache.org/jira/browse/HBASE-5139 Project: HBase Issue Type: Sub-task Reporter: Zhihong Yu Assignee: Zhihong Yu Attachments: 5139-v2.txt Suppose cf:cq1 stores numeric values and optionally cf:cq2 stores weights. This task finds out the median value among the values of cf:cq1 (See http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html) This can be done in two passes. The first pass utilizes AggregateProtocol where the following tuple is returned from each region: (partial-sum-of-values, partial-sum-of-weights) The start rowkey (supplied by coprocessor framework) would be used to sort the tuples. This way we can determine which region (called R) contains the (weighted) median. partial-sum-of-weights can be 0 if unweighted median is sought The second pass involves scanning the table, beginning with startrow of region R and computing partial (weighted) sum until the threshold of S/2 is crossed. The (weighted) median is returned. However, this approach wouldn't work if there is mutation in the underlying table between pass one and pass two. In that case, sequential scanning seems to be the solution which is slower than the above approach. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4224) Need a flush by regionserver rather than by table option
[ https://issues.apache.org/jira/browse/HBASE-4224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184281#comment-13184281 ] Harsh J commented on HBASE-4224: [Dropping by from the dev lists…, have not followed otherwise] I'd certainly like reading flushAllRegions() over flushRegions(null). Can we not also have it as a utility function in HRServer instead if HRI/f, if the interface changing is much to be worried about? Need a flush by regionserver rather than by table option Key: HBASE-4224 URL: https://issues.apache.org/jira/browse/HBASE-4224 Project: HBase Issue Type: Bug Components: shell Reporter: stack Assignee: Akash Ashok Attachments: HBase-4224-v2.patch, HBase-4224.patch This evening needed to clean out logs on the cluster. logs are by regionserver. to let go of logs, we need to have all edits emptied from memory. only flush is by table or region. We need to be able to flush the regionserver. Need to add this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4440) add an option to presplit table to PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184284#comment-13184284 ] Sujee Maniyam commented on HBASE-4440: -- so you are proposing that 1) whether we use presplit option or not, table has to be recreated for all write-mode tests. This changes the behavior for all write-tests. Currently table is only created if it doesn't exist. 2) or pre-split should try to split the table without re-creating it. add an option to presplit table to PerformanceEvaluation Key: HBASE-4440 URL: https://issues.apache.org/jira/browse/HBASE-4440 Project: HBase Issue Type: Improvement Components: util Reporter: Sujee Maniyam Assignee: Sujee Maniyam Priority: Minor Labels: benchmark Fix For: 0.94.0 Attachments: PerformanceEvaluation.java, PerformanceEvaluation_HBASE_4440.patch, PerformanceEvaluation_HBASE_4440_2.patch PerformanceEvaluation a quick way to 'benchmark' a HBase cluster. The current 'write*' operations do not pre-split the table. Pre splitting the table will really boost the insert performance. It would be nice to have an option to enable pre-splitting table before the inserts begin. it would look something like: (a) hbase ...PerformanceEvaluation --presplit=10 other options (b) hbase ...PerformanceEvaluation --presplit other options (b) will try to presplit the table on some default value (say number of region servers) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184287#comment-13184287 ] stack commented on HBASE-5179: -- Its hard to do a test for this? Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3565) Add metrics to keep track of slow HLog appends
[ https://issues.apache.org/jira/browse/HBASE-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184288#comment-13184288 ] Zhihong Yu commented on HBASE-3565: --- Integrated to TRUNK. Thanks for the patch Mubarak. Thanks for the review, Stack. Add metrics to keep track of slow HLog appends -- Key: HBASE-3565 URL: https://issues.apache.org/jira/browse/HBASE-3565 Project: HBase Issue Type: Improvement Components: metrics, regionserver Reporter: Benoit Sigoure Assignee: Mubarak Seyed Labels: monitoring Fix For: 0.94.0 Attachments: HBASE-3565.trunk.v1.patch Whenever an edit takes too long to be written to an HLog, HBase logs a warning such as this one: {code} 2011-02-23 20:03:14,703 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC Server handler 21 on 60020 took 15065ms appending an edit to hlog; editcount=126050 {code} I would like to have a counter incremented each time this happens and this counter exposed via the metrics stuff in HBase so I can collect it in my monitoring system. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-3565) Add metrics to keep track of slow HLog appends
[ https://issues.apache.org/jira/browse/HBASE-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-3565: -- Summary: Add metrics to keep track of slow HLog appends (was: Add a metric to keep track of slow HLog appends) Add metrics to keep track of slow HLog appends -- Key: HBASE-3565 URL: https://issues.apache.org/jira/browse/HBASE-3565 Project: HBase Issue Type: Improvement Components: metrics, regionserver Reporter: Benoit Sigoure Assignee: Mubarak Seyed Labels: monitoring Fix For: 0.94.0 Attachments: HBASE-3565.trunk.v1.patch Whenever an edit takes too long to be written to an HLog, HBase logs a warning such as this one: {code} 2011-02-23 20:03:14,703 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC Server handler 21 on 60020 took 15065ms appending an edit to hlog; editcount=126050 {code} I would like to have a counter incremented each time this happens and this counter exposed via the metrics stuff in HBase so I can collect it in my monitoring system. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Issue Comment Edited] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184286#comment-13184286 ] Zhihong Yu edited comment on HBASE-5179 at 1/11/12 7:07 PM: I agree with the spirit of this class. Good stuff Chunhui. This is awkward name for a method, getDeadServersUnderProcessing. Should it be getDeadServers? Does it need to be a public method? Seems fine that it be package private. Is serversWithoutSplitLog a good name for a local variable? Should it be deadServers with a comment saying that deadServers are processed by servershutdownhandler and it will be taking care of the log splitting? Is this right -- for trunk? {code} - } else if (!serverManager.isServerOnline(regionLocation.getServerName())) { + } else if (!onlineServers.contains(regionLocation.getHostname())) { {code} Online servers is keyed by a ServerName, not a hostname. What is a deadServersUnderProcessing? Does DeadServers keep list of all servers that ever died? Is that a good idea? Shouldn't finish remove item from deadservers rather than just from deadServersUnderProcessing Change name of this method, cloneProcessingDeadServers. Just call it getDeadServers? That its a clone is an internal implementation detail? was (Author: stack): I agree with the spirit of this class. Good stuff Chunhui. This is awkward name for a method, getDeadServersUnderProcessing. Should it be getDeadServers? Does it need to be a public method? Seems fine that it be package private. Is serversWithoutSplitLog a good name for a local variable? Should it be deadServers with a comment saying that deadServers are processed by servershutdownhandler and it will be taking care of the log splitting? Is this right -- for trunk? {code} - } else if (!serverManager.isServerOnline(regionLocation.getServerName())) { + } else if (!onlineServers.contains(regionLocation.getHostname())) { Online servers is keyed by a ServerName, not a hostname. What is a deadServersUnderProcessing? Does DeadServers keep list of all servers that ever died? Is that a good idea? Shouldn't finish remove item from deadservers rather than just from deadServersUnderProcessing Change name of this method, cloneProcessingDeadServers. Just call it getDeadServers? That its a clone is an internal implementation detail? Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4440) add an option to presplit table to PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184294#comment-13184294 ] Jean-Daniel Cryans commented on HBASE-4440: --- bq. whether we use presplit option or not, table has to be recreated for all write-mode tests. No, it shouldn't be different from the default behavior of not recreating the table. bq. or pre-split should try to split the table without re-creating it. It should not. Code speaks more than words, here's what I'm using for testing 0.92: {code} private boolean checkTable(HBaseAdmin admin) throws IOException { HTableDescriptor tableDescriptor = getTableDescriptor(); boolean tableExists = admin.tableExists(tableDescriptor.getName()); if (!tableExists) { if (this.presplitRegions 0) { byte[][] splits = getSplits(); for (int i=0; i splits.length; i++) { LOG.debug( split + i + : + Bytes.toStringBinary(splits[i])); } admin.createTable(tableDescriptor, splits); LOG.info (Table created with + this.presplitRegions + splits); } else { admin.createTable(tableDescriptor); LOG.info(Table + tableDescriptor + created); } } return !tableExists; } {code} add an option to presplit table to PerformanceEvaluation Key: HBASE-4440 URL: https://issues.apache.org/jira/browse/HBASE-4440 Project: HBase Issue Type: Improvement Components: util Reporter: Sujee Maniyam Assignee: Sujee Maniyam Priority: Minor Labels: benchmark Fix For: 0.94.0 Attachments: PerformanceEvaluation.java, PerformanceEvaluation_HBASE_4440.patch, PerformanceEvaluation_HBASE_4440_2.patch PerformanceEvaluation a quick way to 'benchmark' a HBase cluster. The current 'write*' operations do not pre-split the table. Pre splitting the table will really boost the insert performance. It would be nice to have an option to enable pre-splitting table before the inserts begin. it would look something like: (a) hbase ...PerformanceEvaluation --presplit=10 other options (b) hbase ...PerformanceEvaluation --presplit other options (b) will try to presplit the table on some default value (say number of region servers) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184296#comment-13184296 ] Zhihong Yu commented on HBASE-5179: --- @Stack: The following code is for 0.90 branch: {code} - } else if (!serverManager.isServerOnline(regionLocation.getServerName())) { + } else if (!onlineServers.contains(regionLocation.getHostname())) { {code} I agree that serversWithoutSplitLog isn't a very good name. It holds both online servers and dead servers. How about naming it knownServers ? ServerManager.java already has: {code} public SetServerName getDeadServers() { return this.deadservers.clone(); } {code} Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184303#comment-13184303 ] Zhihong Yu commented on HBASE-5179: --- TestRollingRestart fails in 0.90 with patch. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4440) add an option to presplit table to PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184309#comment-13184309 ] Sujee Maniyam commented on HBASE-4440: -- I see. looks good. If the table exists, and presplit option is supplied, it will have no effect. It might mislead the user in believing the pre-split option took effect, while in fact it didn't. may be a WARN would suffice to notify the user? add an option to presplit table to PerformanceEvaluation Key: HBASE-4440 URL: https://issues.apache.org/jira/browse/HBASE-4440 Project: HBase Issue Type: Improvement Components: util Reporter: Sujee Maniyam Assignee: Sujee Maniyam Priority: Minor Labels: benchmark Fix For: 0.94.0 Attachments: PerformanceEvaluation.java, PerformanceEvaluation_HBASE_4440.patch, PerformanceEvaluation_HBASE_4440_2.patch PerformanceEvaluation a quick way to 'benchmark' a HBase cluster. The current 'write*' operations do not pre-split the table. Pre splitting the table will really boost the insert performance. It would be nice to have an option to enable pre-splitting table before the inserts begin. it would look something like: (a) hbase ...PerformanceEvaluation --presplit=10 other options (b) hbase ...PerformanceEvaluation --presplit other options (b) will try to presplit the table on some default value (say number of region servers) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4440) add an option to presplit table to PerformanceEvaluation
[ https://issues.apache.org/jira/browse/HBASE-4440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184311#comment-13184311 ] Jean-Daniel Cryans commented on HBASE-4440: --- We could show a WARN, but I don't think we would need more than that. In fact, we could always show a message when the table exists saying something like: Using the existing ${tablename} which has ${X} regions. About the pre-splitting itself, it seems that it creates N+1 regions and the first one has the end key 00 so it never gets data. Not a biggie, but could be fixed in another jira. add an option to presplit table to PerformanceEvaluation Key: HBASE-4440 URL: https://issues.apache.org/jira/browse/HBASE-4440 Project: HBase Issue Type: Improvement Components: util Reporter: Sujee Maniyam Assignee: Sujee Maniyam Priority: Minor Labels: benchmark Fix For: 0.94.0 Attachments: PerformanceEvaluation.java, PerformanceEvaluation_HBASE_4440.patch, PerformanceEvaluation_HBASE_4440_2.patch PerformanceEvaluation a quick way to 'benchmark' a HBase cluster. The current 'write*' operations do not pre-split the table. Pre splitting the table will really boost the insert performance. It would be nice to have an option to enable pre-splitting table before the inserts begin. it would look something like: (a) hbase ...PerformanceEvaluation --presplit=10 other options (b) hbase ...PerformanceEvaluation --presplit other options (b) will try to presplit the table on some default value (say number of region servers) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5139) Compute (weighted) median using AggregateProtocol
[ https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184320#comment-13184320 ] Zhihong Yu commented on HBASE-5139: --- Integrated to TRUNK. Compute (weighted) median using AggregateProtocol - Key: HBASE-5139 URL: https://issues.apache.org/jira/browse/HBASE-5139 Project: HBase Issue Type: Sub-task Reporter: Zhihong Yu Assignee: Zhihong Yu Attachments: 5139-v2.txt Suppose cf:cq1 stores numeric values and optionally cf:cq2 stores weights. This task finds out the median value among the values of cf:cq1 (See http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html) This can be done in two passes. The first pass utilizes AggregateProtocol where the following tuple is returned from each region: (partial-sum-of-values, partial-sum-of-weights) The start rowkey (supplied by coprocessor framework) would be used to sort the tuples. This way we can determine which region (called R) contains the (weighted) median. partial-sum-of-weights can be 0 if unweighted median is sought The second pass involves scanning the table, beginning with startrow of region R and computing partial (weighted) sum until the threshold of S/2 is crossed. The (weighted) median is returned. However, this approach wouldn't work if there is mutation in the underlying table between pass one and pass two. In that case, sequential scanning seems to be the solution which is slower than the above approach. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3565) Add metrics to keep track of slow HLog appends
[ https://issues.apache.org/jira/browse/HBASE-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184325#comment-13184325 ] Hudson commented on HBASE-3565: --- Integrated in HBase-TRUNK #2619 (See [https://builds.apache.org/job/HBase-TRUNK/2619/]) HBASE-3565 Add metrics to keep track of slow HLog appends (Mubarak) tedyu : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/metrics/RegionServerMetrics.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/wal/HLog.java Add metrics to keep track of slow HLog appends -- Key: HBASE-3565 URL: https://issues.apache.org/jira/browse/HBASE-3565 Project: HBase Issue Type: Improvement Components: metrics, regionserver Reporter: Benoit Sigoure Assignee: Mubarak Seyed Labels: monitoring Fix For: 0.94.0 Attachments: HBASE-3565.trunk.v1.patch Whenever an edit takes too long to be written to an HLog, HBase logs a warning such as this one: {code} 2011-02-23 20:03:14,703 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC Server handler 21 on 60020 took 15065ms appending an edit to hlog; editcount=126050 {code} I would like to have a counter incremented each time this happens and this counter exposed via the metrics stuff in HBase so I can collect it in my monitoring system. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-3565) Add metrics to keep track of slow HLog appends
[ https://issues.apache.org/jira/browse/HBASE-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-3565: -- Resolution: Fixed Status: Resolved (was: Patch Available) Add metrics to keep track of slow HLog appends -- Key: HBASE-3565 URL: https://issues.apache.org/jira/browse/HBASE-3565 Project: HBase Issue Type: Improvement Components: metrics, regionserver Reporter: Benoit Sigoure Assignee: Mubarak Seyed Labels: monitoring Fix For: 0.94.0 Attachments: HBASE-3565.trunk.v1.patch Whenever an edit takes too long to be written to an HLog, HBase logs a warning such as this one: {code} 2011-02-23 20:03:14,703 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC Server handler 21 on 60020 took 15065ms appending an edit to hlog; editcount=126050 {code} I would like to have a counter incremented each time this happens and this counter exposed via the metrics stuff in HBase so I can collect it in my monitoring system. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5139) Compute (weighted) median using AggregateProtocol
[ https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184367#comment-13184367 ] Hudson commented on HBASE-5139: --- Integrated in HBase-TRUNK-security #73 (See [https://builds.apache.org/job/HBase-TRUNK-security/73/]) HBASE-5139 Compute (weighted) median using AggregateProtocol tedyu : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateImplementation.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateProtocol.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/coprocessor/TestAggregateProtocol.java Compute (weighted) median using AggregateProtocol - Key: HBASE-5139 URL: https://issues.apache.org/jira/browse/HBASE-5139 Project: HBase Issue Type: Sub-task Reporter: Zhihong Yu Assignee: Zhihong Yu Attachments: 5139-v2.txt Suppose cf:cq1 stores numeric values and optionally cf:cq2 stores weights. This task finds out the median value among the values of cf:cq1 (See http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html) This can be done in two passes. The first pass utilizes AggregateProtocol where the following tuple is returned from each region: (partial-sum-of-values, partial-sum-of-weights) The start rowkey (supplied by coprocessor framework) would be used to sort the tuples. This way we can determine which region (called R) contains the (weighted) median. partial-sum-of-weights can be 0 if unweighted median is sought The second pass involves scanning the table, beginning with startrow of region R and computing partial (weighted) sum until the threshold of S/2 is crossed. The (weighted) median is returned. However, this approach wouldn't work if there is mutation in the underlying table between pass one and pass two. In that case, sequential scanning seems to be the solution which is slower than the above approach. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA (The directory is already locked.)
[ https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184366#comment-13184366 ] Hudson commented on HBASE-5163: --- Integrated in HBase-TRUNK-security #73 (See [https://builds.apache.org/job/HBase-TRUNK-security/73/]) HBASE-5163 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA (The directory is already locked.) (N Keywal) tedyu : Files : * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/wal/TestLogRolling.java TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA (The directory is already locked.) -- Key: HBASE-5163 URL: https://issues.apache.org/jira/browse/HBASE-5163 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.94.0 Environment: all Reporter: nkeywal Assignee: nkeywal Priority: Minor Attachments: 5163.patch The stack is typically: {noformat} error message=Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. type=java.io.IOExceptionjava.io.IOException: Cannot lock storage /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3. The directory is already locked. at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376) at org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417) at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460) at org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470) // ... {noformat} It can be reproduced without parallelization or without executing the other tests in the class. It seems to fail about 5% of the time. This comes from the naming policy for the directories in MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in the cluster, and does not take into account previous starts/stops: {noformat} for (int i = curDatanodesNum; i curDatanodesNum+numDataNodes; i++) { if (manageDfsDirs) { File dir1 = new File(data_dir, data+(2*i+1)); File dir2 = new File(data_dir, data+(2*i+2)); dir1.mkdirs(); dir2.mkdirs(); // [...] {noformat} This means that it if we want to stop/start a datanode, we should always stop the last one, if not the names will conflict. This test exhibits the behavior: {noformat} @Test public void testMiniDFSCluster_startDataNode() throws Exception { assertTrue( dfsCluster.getDataNodes().size() == 2 ); // Works, as we kill the last datanode, we can now start a datanode dfsCluster.stopDataNode(1); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); // Fails, as it's not the last datanode, the directory will conflict on // creation dfsCluster.stopDataNode(0); try { dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null); fail(There should be an exception because the directory already exists); } catch (IOException e) { assertTrue( e.getMessage().contains(The directory is already locked.)); LOG.info(Expected (!) exception caught + e.getMessage()); } // Works, as we kill the last datanode, we can now restart 2 datanodes // This makes us back with 2 nodes dfsCluster.stopDataNode(0); dfsCluster .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null); } {noformat} And then this behavior is randomly triggered in testLogRollOnDatanodeDeath because when we do {noformat} DatanodeInfo[] pipeline = getPipeline(log); assertTrue(pipeline.length == fs.getDefaultReplication()); {noformat} and then kill the datanodes in the pipeline, we will have: - most of the time: pipeline = 1 2, so
[jira] [Commented] (HBASE-5136) Redundant MonitoredTask instances in case of distributed log splitting retry
[ https://issues.apache.org/jira/browse/HBASE-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184377#comment-13184377 ] Zhihong Yu commented on HBASE-5136: --- Can someone review the patch ? Thanks Redundant MonitoredTask instances in case of distributed log splitting retry Key: HBASE-5136 URL: https://issues.apache.org/jira/browse/HBASE-5136 Project: HBase Issue Type: Task Reporter: Zhihong Yu Assignee: Zhihong Yu Attachments: 5136.txt In case of log splitting retry, the following code would be executed multiple times: {code} public long splitLogDistributed(final ListPath logDirs) throws IOException { MonitoredTask status = TaskMonitor.get().createStatus( Doing distributed log split in + logDirs); {code} leading to multiple MonitoredTask instances. User may get confused by multiple distributed log splitting entries for the same region server on master UI -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5139) Compute (weighted) median using AggregateProtocol
[ https://issues.apache.org/jira/browse/HBASE-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184393#comment-13184393 ] Hudson commented on HBASE-5139: --- Integrated in HBase-TRUNK #2620 (See [https://builds.apache.org/job/HBase-TRUNK/2620/]) HBASE-5139 Compute (weighted) median using AggregateProtocol tedyu : Files : * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/client/coprocessor/AggregationClient.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateImplementation.java * /hbase/trunk/src/main/java/org/apache/hadoop/hbase/coprocessor/AggregateProtocol.java * /hbase/trunk/src/test/java/org/apache/hadoop/hbase/coprocessor/TestAggregateProtocol.java Compute (weighted) median using AggregateProtocol - Key: HBASE-5139 URL: https://issues.apache.org/jira/browse/HBASE-5139 Project: HBase Issue Type: Sub-task Reporter: Zhihong Yu Assignee: Zhihong Yu Attachments: 5139-v2.txt Suppose cf:cq1 stores numeric values and optionally cf:cq2 stores weights. This task finds out the median value among the values of cf:cq1 (See http://www.stat.ucl.ac.be/ISdidactique/Rhelp/library/R.basic/html/weighted.median.html) This can be done in two passes. The first pass utilizes AggregateProtocol where the following tuple is returned from each region: (partial-sum-of-values, partial-sum-of-weights) The start rowkey (supplied by coprocessor framework) would be used to sort the tuples. This way we can determine which region (called R) contains the (weighted) median. partial-sum-of-weights can be 0 if unweighted median is sought The second pass involves scanning the table, beginning with startrow of region R and computing partial (weighted) sum until the threshold of S/2 is crossed. The (weighted) median is returned. However, this approach wouldn't work if there is mutation in the underlying table between pass one and pass two. In that case, sequential scanning seems to be the solution which is slower than the above approach. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5128) [uber hbck] Enable hbck to automatically repair table integrity problems as well as region consistency problems while online.
[ https://issues.apache.org/jira/browse/HBASE-5128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184400#comment-13184400 ] jirapos...@reviews.apache.org commented on HBASE-5128: -- --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/3435/#review4317 --- src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java https://reviews.apache.org/r/3435/#comment9714 Should be 'to end key'. src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java https://reviews.apache.org/r/3435/#comment9715 Should insert some text between newRegion and region. src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java https://reviews.apache.org/r/3435/#comment9716 This should be outside the for loop. src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java https://reviews.apache.org/r/3435/#comment9717 Space between and 0. - Ted On 2012-01-11 12:46:37, jmhsieh wrote: bq. bq. --- bq. This is an automatically generated e-mail. To reply, visit: bq. https://reviews.apache.org/r/3435/ bq. --- bq. bq. (Updated 2012-01-11 12:46:37) bq. bq. bq. Review request for hbase, Todd Lipcon, Ted Yu, Michael Stack, and Jean-Daniel Cryans. bq. bq. bq. Summary bq. --- bq. bq. I'm posting a preliminary version that I'm currently testing on real clusters. The tests are flakey on the 0.90 branch (so there is something async that I didn't synchronize properly), and there are a few more TODO's I want to knock out before this is ready for full review to be considered for committing. It's got some problems I need some advice figuring out. bq. bq. Problem 1: bq. bq. In the unit tests, I have a few cases where I fabricate new regions and try to force the overlapping regions to be closed. For some of these, I cannot delete a table after it is repaired without causing subsequent tests to fail. I think this is due to a few things: bq. bq. 1) The disable table handler uses in-memory assignment manager state while delete uses in META assignment information. bq. 2) Currently I'm using the sneaky closeRegion that purposely doesn't go through the master and in turn doesn't modify in-memory state – disable uses out of date in-memory region assignments. If I use the unassign method sends RIT transitions to the master, but which ends up attempting to assign it again, causing timing/transient states. bq. bq. What is a good way to clear the HMaster's assignment manager's assignment data for particular regions or to force it to re-read from META? (without modifying the 0.90 HBase's it is meant to repair). bq. bq. Problem 2: bq. bq. Sometimes test fail reporting HOLE_IN_REGION_CHAIN and SERVER_DOES_NOT_MATCH_META. This means the old and new regions are confiused with each other and basically something is still happening asynchronously. I think this is the new region is being assigned and is still transitioning. Sound about right? To make the unit test deterministic, should hbck wait for these to settle or should just the unit test wait? bq. bq. bq. This addresses bug HBASE-5128. bq. https://issues.apache.org/jira/browse/HBASE-5128 bq. bq. bq. Diffs bq. - bq. bq.src/main/java/org/apache/hadoop/hbase/util/HBaseFsck.java 6d3401d bq.src/main/java/org/apache/hadoop/hbase/util/HBaseFsckRepair.java a3d8b8b bq.src/main/java/org/apache/hadoop/hbase/util/hbck/OfflineMetaRepair.java 29e8bb2 bq. src/main/java/org/apache/hadoop/hbase/util/hbck/TableIntegrityErrorHandler.java PRE-CREATION bq.src/test/java/org/apache/hadoop/hbase/util/TestHBaseFsck.java a640d57 bq.src/test/java/org/apache/hadoop/hbase/util/hbck/HbckTestingUtil.java dbb97f8 bq. src/test/java/org/apache/hadoop/hbase/util/hbck/TestOfflineMetaRebuildBase.java 3e8729d bq. src/test/java/org/apache/hadoop/hbase/util/hbck/TestOfflineMetaRebuildHole.java 11a1151 bq. src/test/java/org/apache/hadoop/hbase/util/hbck/TestOfflineMetaRebuildOverlap.java 4a09ce2 bq. bq. Diff: https://reviews.apache.org/r/3435/diff bq. bq. bq. Testing bq. --- bq. bq. All unit tests pass sometimes. Some fail sometimes (generally the cases that fabricate new regions). bq. bq. Not ready for commit. bq. bq. bq. Thanks, bq. bq. jmhsieh bq. bq. [uber hbck] Enable hbck to automatically repair table integrity problems as well as region consistency problems while online. - Key: HBASE-5128 URL:
[jira] [Updated] (HBASE-5167) We shouldn't be injecting 'Killing [daemon]' into logs, when we aren't doing that.
[ https://issues.apache.org/jira/browse/HBASE-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5167: - Resolution: Fixed Assignee: Harsh J Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Committed trunk. Thanks Harsh. We shouldn't be injecting 'Killing [daemon]' into logs, when we aren't doing that. -- Key: HBASE-5167 URL: https://issues.apache.org/jira/browse/HBASE-5167 Project: HBase Issue Type: Improvement Components: scripts Affects Versions: 0.92.0 Reporter: Harsh J Assignee: Harsh J Priority: Trivial Fix For: 0.94.0 Attachments: HBASE-5167.patch HBASE-4209 changed the behavior of the scripts such that we do not kill the daemons away anymore. We should have also changed the message shown in the logs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5168) Backport HBASE-5100 - Rollback of split could cause closed region to be opened again
[ https://issues.apache.org/jira/browse/HBASE-5168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184424#comment-13184424 ] stack commented on HBASE-5168: -- +1 Backport HBASE-5100 - Rollback of split could cause closed region to be opened again Key: HBASE-5168 URL: https://issues.apache.org/jira/browse/HBASE-5168 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Attachments: HBASE-5100_0.90.patch Considering the importance of the defect merging it to 0.90.6 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5180) [book] book.xml - fixed scanner example
[book] book.xml - fixed scanner example --- Key: HBASE-5180 URL: https://issues.apache.org/jira/browse/HBASE-5180 Project: HBase Issue Type: Bug Reporter: Doug Meil Assignee: Doug Meil Attachments: book_HBASE_5180.xml.patch book.xml - the scanner example wasn't closing the scanner! that's bad practice. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5180) [book] book.xml - fixed scanner example
[ https://issues.apache.org/jira/browse/HBASE-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Meil updated HBASE-5180: - Status: Patch Available (was: Open) [book] book.xml - fixed scanner example --- Key: HBASE-5180 URL: https://issues.apache.org/jira/browse/HBASE-5180 Project: HBase Issue Type: Bug Reporter: Doug Meil Assignee: Doug Meil Attachments: book_HBASE_5180.xml.patch book.xml - the scanner example wasn't closing the scanner! that's bad practice. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5180) [book] book.xml - fixed scanner example
[ https://issues.apache.org/jira/browse/HBASE-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Meil updated HBASE-5180: - Description: book.xml - the scanner example wasn't closing the ResultScanner! that's bad practice. (was: book.xml - the scanner example wasn't closing the scanner! that's bad practice.) [book] book.xml - fixed scanner example --- Key: HBASE-5180 URL: https://issues.apache.org/jira/browse/HBASE-5180 Project: HBase Issue Type: Bug Reporter: Doug Meil Assignee: Doug Meil Attachments: book_HBASE_5180.xml.patch book.xml - the scanner example wasn't closing the ResultScanner! that's bad practice. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5180) [book] book.xml - fixed scanner example
[ https://issues.apache.org/jira/browse/HBASE-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Meil updated HBASE-5180: - Resolution: Fixed Status: Resolved (was: Patch Available) [book] book.xml - fixed scanner example --- Key: HBASE-5180 URL: https://issues.apache.org/jira/browse/HBASE-5180 Project: HBase Issue Type: Bug Reporter: Doug Meil Assignee: Doug Meil Attachments: book_HBASE_5180.xml.patch book.xml - the scanner example wasn't closing the ResultScanner! that's bad practice. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5129) book is inconsistent regarding disabling - major compaction
[ https://issues.apache.org/jira/browse/HBASE-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Meil updated HBASE-5129: - Assignee: Doug Meil book is inconsistent regarding disabling - major compaction --- Key: HBASE-5129 URL: https://issues.apache.org/jira/browse/HBASE-5129 Project: HBase Issue Type: Bug Components: documentation Affects Versions: 0.90.1 Reporter: Mikael Sitruk Assignee: Doug Meil Priority: Minor It seems that the book has some inconsistencies regarding the way to disable major compactions According to the book in chapter 2.6.1.1. HBase Default Configuration hbase.hregion.majorcompaction - The time (in miliseconds) between 'major' compactions of all HStoreFiles in a region. Default: 1 day. Set to 0 to disable automated major compactions. Default: 8640 (http://hbase.apache.org/book.html#hbase_default_configurations) According to the book at chapter 2.8.2.8. Managed Compactions A common administrative technique is to manage major compactions manually, rather than letting HBase do it. By default, HConstants.MAJOR_COMPACTION_PERIOD is one day and major compactions may kick in when you least desire it - especially on a busy system. To turn off automatic major compactions set the value to Long.MAX_VALUE. According to the code org.apache.hadoop.hbase.regionserver.Store.java, 0 is the right answer. (affect all documentation from 0.90.1) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5129) book is inconsistent regarding disabling - major compaction
[ https://issues.apache.org/jira/browse/HBASE-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Meil updated HBASE-5129: - Attachment: configuration_HBASE_5129.xml.patch book is inconsistent regarding disabling - major compaction --- Key: HBASE-5129 URL: https://issues.apache.org/jira/browse/HBASE-5129 Project: HBase Issue Type: Bug Components: documentation Affects Versions: 0.90.1 Reporter: Mikael Sitruk Assignee: Doug Meil Priority: Minor Attachments: configuration_HBASE_5129.xml.patch It seems that the book has some inconsistencies regarding the way to disable major compactions According to the book in chapter 2.6.1.1. HBase Default Configuration hbase.hregion.majorcompaction - The time (in miliseconds) between 'major' compactions of all HStoreFiles in a region. Default: 1 day. Set to 0 to disable automated major compactions. Default: 8640 (http://hbase.apache.org/book.html#hbase_default_configurations) According to the book at chapter 2.8.2.8. Managed Compactions A common administrative technique is to manage major compactions manually, rather than letting HBase do it. By default, HConstants.MAJOR_COMPACTION_PERIOD is one day and major compactions may kick in when you least desire it - especially on a busy system. To turn off automatic major compactions set the value to Long.MAX_VALUE. According to the code org.apache.hadoop.hbase.regionserver.Store.java, 0 is the right answer. (affect all documentation from 0.90.1) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5129) book is inconsistent regarding disabling - major compaction
[ https://issues.apache.org/jira/browse/HBASE-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Meil updated HBASE-5129: - Resolution: Fixed Status: Resolved (was: Patch Available) book is inconsistent regarding disabling - major compaction --- Key: HBASE-5129 URL: https://issues.apache.org/jira/browse/HBASE-5129 Project: HBase Issue Type: Bug Components: documentation Affects Versions: 0.90.1 Reporter: Mikael Sitruk Assignee: Doug Meil Priority: Minor Attachments: configuration_HBASE_5129.xml.patch It seems that the book has some inconsistencies regarding the way to disable major compactions According to the book in chapter 2.6.1.1. HBase Default Configuration hbase.hregion.majorcompaction - The time (in miliseconds) between 'major' compactions of all HStoreFiles in a region. Default: 1 day. Set to 0 to disable automated major compactions. Default: 8640 (http://hbase.apache.org/book.html#hbase_default_configurations) According to the book at chapter 2.8.2.8. Managed Compactions A common administrative technique is to manage major compactions manually, rather than letting HBase do it. By default, HConstants.MAJOR_COMPACTION_PERIOD is one day and major compactions may kick in when you least desire it - especially on a busy system. To turn off automatic major compactions set the value to Long.MAX_VALUE. According to the code org.apache.hadoop.hbase.regionserver.Store.java, 0 is the right answer. (affect all documentation from 0.90.1) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5129) book is inconsistent regarding disabling - major compaction
[ https://issues.apache.org/jira/browse/HBASE-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184437#comment-13184437 ] Doug Meil commented on HBASE-5129: -- Thanks for the catch Mikael! book is inconsistent regarding disabling - major compaction --- Key: HBASE-5129 URL: https://issues.apache.org/jira/browse/HBASE-5129 Project: HBase Issue Type: Bug Components: documentation Affects Versions: 0.90.1 Reporter: Mikael Sitruk Assignee: Doug Meil Priority: Minor Attachments: configuration_HBASE_5129.xml.patch It seems that the book has some inconsistencies regarding the way to disable major compactions According to the book in chapter 2.6.1.1. HBase Default Configuration hbase.hregion.majorcompaction - The time (in miliseconds) between 'major' compactions of all HStoreFiles in a region. Default: 1 day. Set to 0 to disable automated major compactions. Default: 8640 (http://hbase.apache.org/book.html#hbase_default_configurations) According to the book at chapter 2.8.2.8. Managed Compactions A common administrative technique is to manage major compactions manually, rather than letting HBase do it. By default, HConstants.MAJOR_COMPACTION_PERIOD is one day and major compactions may kick in when you least desire it - especially on a busy system. To turn off automatic major compactions set the value to Long.MAX_VALUE. According to the code org.apache.hadoop.hbase.regionserver.Store.java, 0 is the right answer. (affect all documentation from 0.90.1) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5129) book is inconsistent regarding disabling - major compaction
[ https://issues.apache.org/jira/browse/HBASE-5129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doug Meil updated HBASE-5129: - Status: Patch Available (was: Open) book is inconsistent regarding disabling - major compaction --- Key: HBASE-5129 URL: https://issues.apache.org/jira/browse/HBASE-5129 Project: HBase Issue Type: Bug Components: documentation Affects Versions: 0.90.1 Reporter: Mikael Sitruk Assignee: Doug Meil Priority: Minor Attachments: configuration_HBASE_5129.xml.patch It seems that the book has some inconsistencies regarding the way to disable major compactions According to the book in chapter 2.6.1.1. HBase Default Configuration hbase.hregion.majorcompaction - The time (in miliseconds) between 'major' compactions of all HStoreFiles in a region. Default: 1 day. Set to 0 to disable automated major compactions. Default: 8640 (http://hbase.apache.org/book.html#hbase_default_configurations) According to the book at chapter 2.8.2.8. Managed Compactions A common administrative technique is to manage major compactions manually, rather than letting HBase do it. By default, HConstants.MAJOR_COMPACTION_PERIOD is one day and major compactions may kick in when you least desire it - especially on a busy system. To turn off automatic major compactions set the value to Long.MAX_VALUE. According to the code org.apache.hadoop.hbase.regionserver.Store.java, 0 is the right answer. (affect all documentation from 0.90.1) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5179: -- Attachment: (was: 5179-90.txt) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-v2.txt, hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5179: -- Attachment: 5179-90.txt New patch for 0.90 Now TestRollingRestart passes. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5179: -- Comment: was deleted (was: TestRollingRestart fails in 0.90 with patch.) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184448#comment-13184448 ] Zhihong Yu commented on HBASE-5179: --- I think the reason Chunhui introduced a new Set for the dead servers being processed is that DeadServer is supposed to remember dead servers: {code} * Set of known dead servers. On znode expiration, servers are added here. {code} DeadServer.cleanPreviousInstance() is called by ServerManager.checkIsDead() when the server becomes live again. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184450#comment-13184450 ] Hadoop QA commented on HBASE-5179: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12510261/5179-90.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/733//console This message is automatically generated. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5179: -- Comment: was deleted (was: -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12510261/5179-90.txt against trunk revision . +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/733//console This message is automatically generated.) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-90.txt, 5179-v2.txt, hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-5182) TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly
TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly -- Key: HBASE-5182 URL: https://issues.apache.org/jira/browse/HBASE-5182 Project: HBase Issue Type: Bug Components: regionserver Reporter: Scott Chen Priority: Minor TBoundedThreadPoolServer does not take the configured threadKeepAliveTimeSec. It uses the default value instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5179: -- Attachment: 5179-v3.txt Patch v3 addresses Stack's comments Some names are open to suggestion. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-90.txt, 5179-v2.txt, 5179-v3.txt, hbase-5179.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5182) TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly
[ https://issues.apache.org/jira/browse/HBASE-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Scott Chen updated HBASE-5182: -- Attachment: hbase-5182.txt TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly -- Key: HBASE-5182 URL: https://issues.apache.org/jira/browse/HBASE-5182 Project: HBase Issue Type: Bug Components: regionserver Reporter: Scott Chen Priority: Minor Attachments: hbase-5182.txt TBoundedThreadPoolServer does not take the configured threadKeepAliveTimeSec. It uses the default value instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5182) TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly
[ https://issues.apache.org/jira/browse/HBASE-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184474#comment-13184474 ] Zhihong Yu commented on HBASE-5182: --- +1 on patch. TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly -- Key: HBASE-5182 URL: https://issues.apache.org/jira/browse/HBASE-5182 Project: HBase Issue Type: Bug Components: regionserver Reporter: Scott Chen Priority: Minor Attachments: hbase-5182.txt TBoundedThreadPoolServer does not take the configured threadKeepAliveTimeSec. It uses the default value instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5181) Improve error message when Master fail-over happens and ZK unassigned node contains stale znode(s)
[ https://issues.apache.org/jira/browse/HBASE-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184472#comment-13184472 ] Zhihong Yu commented on HBASE-5181: --- Thanks for the suggestion, Mubarak. Do you want to attach a patch ? Improve error message when Master fail-over happens and ZK unassigned node contains stale znode(s) -- Key: HBASE-5181 URL: https://issues.apache.org/jira/browse/HBASE-5181 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.92.0, 0.90.5 Reporter: Mubarak Seyed Priority: Minor Labels: noob When master fail-over happens, if we have number of RITs under /hbase/unassigned and if we have stale znode(s) (encoded region names) under /hbase/unassigned, we are getting {code} 2011-12-30 10:27:35,623 INFO org.apache.hadoop.hbase.master.HMaster: Master startup proceeding: master failover 2011-12-30 10:27:36,002 INFO org.apache.hadoop.hbase.master.AssignmentManager: Failed-over master needs to process 1717 regions in transition 2011-12-30 10:27:36,004 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. java.lang.ArrayIndexOutOfBoundsException: -256 at org.apache.hadoop.hbase.executor.RegionTransitionData.readFields(RegionTransitionData.java:148) at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:105) at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75) at org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransitionData.java:198) at org.apache.hadoop.hbase.zookeeper.ZKAssign.getData(ZKAssign.java:743) at org.apache.hadoop.hbase.master.AssignmentManager.processRegionInTransition(AssignmentManager.java:262) at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:223) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:401) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283) {code} and there is no clue on how to clean-up the stale znode(s) from unassigned using zkCli.sh (del /hbase/unassigned/bad region name). It would be good if we include the bad region name in IOException from RegionTransitionData.readFields(). {code} @Override public void readFields(DataInput in) throws IOException { // the event type byte eventType = EventType.values()[in.readShort()]; // the timestamp stamp = in.readLong(); // the encoded name of the region being transitioned regionName = Bytes.readByteArray(in); // remaining fields are optional so prefixed with boolean // the name of the regionserver sending the data if (in.readBoolean()) { byte [] versionedBytes = Bytes.readByteArray(in); this.origin = ServerName.parseVersionedServerName(versionedBytes); } if (in.readBoolean()) { this.payload = Bytes.readByteArray(in); } } {code} If the code execution has survived until regionName then we can include the regionName in IOException with error message to clean-up the stale znode(s) under /hbase/unassigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-5182) TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly
[ https://issues.apache.org/jira/browse/HBASE-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu reassigned HBASE-5182: - Assignee: Scott Chen TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly -- Key: HBASE-5182 URL: https://issues.apache.org/jira/browse/HBASE-5182 Project: HBase Issue Type: Bug Components: regionserver Reporter: Scott Chen Assignee: Scott Chen Priority: Minor Fix For: 0.94.0 Attachments: hbase-5182.txt TBoundedThreadPoolServer does not take the configured threadKeepAliveTimeSec. It uses the default value instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5182) TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly
[ https://issues.apache.org/jira/browse/HBASE-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhihong Yu updated HBASE-5182: -- Status: Patch Available (was: Open) TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly -- Key: HBASE-5182 URL: https://issues.apache.org/jira/browse/HBASE-5182 Project: HBase Issue Type: Bug Components: regionserver Reporter: Scott Chen Assignee: Scott Chen Priority: Minor Fix For: 0.94.0 Attachments: hbase-5182.txt TBoundedThreadPoolServer does not take the configured threadKeepAliveTimeSec. It uses the default value instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5181) Improve error message when Master fail-over happens and ZK unassigned node contains stale znode(s)
[ https://issues.apache.org/jira/browse/HBASE-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184478#comment-13184478 ] Mubarak Seyed commented on HBASE-5181: -- Working on corporate approval to contribute this patch. Thanks. Improve error message when Master fail-over happens and ZK unassigned node contains stale znode(s) -- Key: HBASE-5181 URL: https://issues.apache.org/jira/browse/HBASE-5181 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.92.0, 0.90.5 Reporter: Mubarak Seyed Priority: Minor Labels: noob When master fail-over happens, if we have number of RITs under /hbase/unassigned and if we have stale znode(s) (encoded region names) under /hbase/unassigned, we are getting {code} 2011-12-30 10:27:35,623 INFO org.apache.hadoop.hbase.master.HMaster: Master startup proceeding: master failover 2011-12-30 10:27:36,002 INFO org.apache.hadoop.hbase.master.AssignmentManager: Failed-over master needs to process 1717 regions in transition 2011-12-30 10:27:36,004 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. java.lang.ArrayIndexOutOfBoundsException: -256 at org.apache.hadoop.hbase.executor.RegionTransitionData.readFields(RegionTransitionData.java:148) at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:105) at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75) at org.apache.hadoop.hbase.executor.RegionTransitionData.fromBytes(RegionTransitionData.java:198) at org.apache.hadoop.hbase.zookeeper.ZKAssign.getData(ZKAssign.java:743) at org.apache.hadoop.hbase.master.AssignmentManager.processRegionInTransition(AssignmentManager.java:262) at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:223) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:401) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283) {code} and there is no clue on how to clean-up the stale znode(s) from unassigned using zkCli.sh (del /hbase/unassigned/bad region name). It would be good if we include the bad region name in IOException from RegionTransitionData.readFields(). {code} @Override public void readFields(DataInput in) throws IOException { // the event type byte eventType = EventType.values()[in.readShort()]; // the timestamp stamp = in.readLong(); // the encoded name of the region being transitioned regionName = Bytes.readByteArray(in); // remaining fields are optional so prefixed with boolean // the name of the regionserver sending the data if (in.readBoolean()) { byte [] versionedBytes = Bytes.readByteArray(in); this.origin = ServerName.parseVersionedServerName(versionedBytes); } if (in.readBoolean()) { this.payload = Bytes.readByteArray(in); } } {code} If the code execution has survived until regionName then we can include the regionName in IOException with error message to clean-up the stale znode(s) under /hbase/unassigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5182) TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly
[ https://issues.apache.org/jira/browse/HBASE-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-5182: - Resolution: Fixed Status: Resolved (was: Patch Available) Committed to trunk. Thanks for the patch Scott. TBoundedThreadPoolServer threadKeepAliveTimeSec is not configured properly -- Key: HBASE-5182 URL: https://issues.apache.org/jira/browse/HBASE-5182 Project: HBase Issue Type: Bug Components: regionserver Reporter: Scott Chen Assignee: Scott Chen Priority: Minor Fix For: 0.94.0 Attachments: hbase-5182.txt TBoundedThreadPoolServer does not take the configured threadKeepAliveTimeSec. It uses the default value instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5167) We shouldn't be injecting 'Killing [daemon]' into logs, when we aren't doing that.
[ https://issues.apache.org/jira/browse/HBASE-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184482#comment-13184482 ] Hudson commented on HBASE-5167: --- Integrated in HBase-TRUNK #2621 (See [https://builds.apache.org/job/HBase-TRUNK/2621/]) HBASE-5167 We shouldn't be injecting 'Killing [daemon]' into logs, when we aren't doing that. stack : Files : * /hbase/trunk/bin/hbase-daemon.sh We shouldn't be injecting 'Killing [daemon]' into logs, when we aren't doing that. -- Key: HBASE-5167 URL: https://issues.apache.org/jira/browse/HBASE-5167 Project: HBase Issue Type: Improvement Components: scripts Affects Versions: 0.92.0 Reporter: Harsh J Assignee: Harsh J Priority: Trivial Fix For: 0.94.0 Attachments: HBASE-5167.patch HBASE-4209 changed the behavior of the scripts such that we do not kill the daemons away anymore. We should have also changed the message shown in the logs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira