[jira] [Commented] (HBASE-5824) HRegion.incrementColumnValue is not used in trunk
[ https://issues.apache.org/jira/browse/HBASE-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258373#comment-13258373 ] Jimmy Xiang commented on HBASE-5824: Yes, this patch is for 0.96 only. RetriesExhaustedWithDetailsException applies to batch processing only. For single action, individual exception is used. Currently only Put is implicitly batched. Should I change single Put to use RetriesExhaustedWithDetailsException too? HRegion.incrementColumnValue is not used in trunk - Key: HBASE-5824 URL: https://issues.apache.org/jira/browse/HBASE-5824 Project: HBase Issue Type: Bug Reporter: Elliott Clark Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: 5824-addendum-v2.txt, hbase-5824.patch, hbase-5824_v2.patch, hbase_5824.addendum on 0.94 a call to client.HTable#incrementColumnValue will cause HRegion#incrementColumnValue. On trunk all calls to HTable.incrementColumnValue got to HRegion#increment. My guess is that HTable#incrementColumnValue and HTable#increment serialize to the same thing over the wire so that the remote HRegionServer no longer knows which htable method was called. To repro I checked out trunk and put a break point in HRegion#incrementColumnValue and then ran TestFromClientSide. The breakpoint wasn't hit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5621) Convert admin protocol of HRegionInterface to PB
[ https://issues.apache.org/jira/browse/HBASE-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258379#comment-13258379 ] Jimmy Xiang commented on HBASE-5621: Looking into the failed unit tests. Convert admin protocol of HRegionInterface to PB Key: HBASE-5621 URL: https://issues.apache.org/jira/browse/HBASE-5621 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: hbase-5621_v3.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5824) HRegion.incrementColumnValue is not used in trunk
[ https://issues.apache.org/jira/browse/HBASE-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258381#comment-13258381 ] Jimmy Xiang commented on HBASE-5824: @Ted, I filed HBASE-5845. Thanks for pointing out the issue. Good catch. HRegion.incrementColumnValue is not used in trunk - Key: HBASE-5824 URL: https://issues.apache.org/jira/browse/HBASE-5824 Project: HBase Issue Type: Bug Reporter: Elliott Clark Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: 5824-addendum-v2.txt, hbase-5824.patch, hbase-5824_v2.patch, hbase_5824.addendum on 0.94 a call to client.HTable#incrementColumnValue will cause HRegion#incrementColumnValue. On trunk all calls to HTable.incrementColumnValue got to HRegion#increment. My guess is that HTable#incrementColumnValue and HTable#increment serialize to the same thing over the wire so that the remote HRegionServer no longer knows which htable method was called. To repro I checked out trunk and put a break point in HRegion#incrementColumnValue and then ran TestFromClientSide. The breakpoint wasn't hit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5824) HRegion.incrementColumnValue is not used in trunk
[ https://issues.apache.org/jira/browse/HBASE-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257865#comment-13257865 ] Jimmy Xiang commented on HBASE-5824: If autoFlush is not enabled, Puts are most likely batched. It is not very efficient to check if a batch contains only one Put, which is kind of duplicate some of the multiput logic. You can say the patch is not strictly for single put. HRegion.incrementColumnValue is not used in trunk - Key: HBASE-5824 URL: https://issues.apache.org/jira/browse/HBASE-5824 Project: HBase Issue Type: Bug Reporter: Elliott Clark Assignee: Jimmy Xiang Attachments: hbase-5824.patch, hbase-5824_v2.patch on 0.94 a call to client.HTable#incrementColumnValue will cause HRegion#incrementColumnValue. On trunk all calls to HTable.incrementColumnValue got to HRegion#increment. My guess is that HTable#incrementColumnValue and HTable#increment serialize to the same thing over the wire so that the remote HRegionServer no longer knows which htable method was called. To repro I checked out trunk and put a break point in HRegion#incrementColumnValue and then ran TestFromClientSide. The breakpoint wasn't hit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5824) HRegion.incrementColumnValue is not used in trunk
[ https://issues.apache.org/jira/browse/HBASE-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257904#comment-13257904 ] Jimmy Xiang commented on HBASE-5824: I am looking into it. HRegion.incrementColumnValue is not used in trunk - Key: HBASE-5824 URL: https://issues.apache.org/jira/browse/HBASE-5824 Project: HBase Issue Type: Bug Reporter: Elliott Clark Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: hbase-5824.patch, hbase-5824_v2.patch on 0.94 a call to client.HTable#incrementColumnValue will cause HRegion#incrementColumnValue. On trunk all calls to HTable.incrementColumnValue got to HRegion#increment. My guess is that HTable#incrementColumnValue and HTable#increment serialize to the same thing over the wire so that the remote HRegionServer no longer knows which htable method was called. To repro I checked out trunk and put a break point in HRegion#incrementColumnValue and then ran TestFromClientSide. The breakpoint wasn't hit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5824) HRegion.incrementColumnValue is not used in trunk
[ https://issues.apache.org/jira/browse/HBASE-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257083#comment-13257083 ] Jimmy Xiang commented on HBASE-5824: I will add a unit test for this and fix it. HRegion.incrementColumnValue is not used in trunk - Key: HBASE-5824 URL: https://issues.apache.org/jira/browse/HBASE-5824 Project: HBase Issue Type: Bug Reporter: Elliott Clark Assignee: Jimmy Xiang on 0.94 a call to client.HTable#incrementColumnValue will cause HRegion#incrementColumnValue. On trunk all calls to HTable.incrementColumnValue got to HRegion#increment. My guess is that HTable#incrementColumnValue and HTable#increment serialize to the same thing over the wire so that the remote HRegionServer no longer knows which htable method was called. To repro I checked out trunk and put a break point in HRegion#incrementColumnValue and then ran TestFromClientSide. The breakpoint wasn't hit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5824) HRegion.incrementColumnValue is not used in trunk
[ https://issues.apache.org/jira/browse/HBASE-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257143#comment-13257143 ] Jimmy Xiang commented on HBASE-5824: I looked into it and it seems not a bug. HRegion#incrementColumnValue is a redundant method. HRegion#increment can do the same thing. That's why I used HRegion#increment. Anything wrong with that? As to the single puts, the reason is that the client side tries to use batch processing. This behaves the same as before. Of course, we can enhance it. I will do it in HBASE-5621. HRegion.incrementColumnValue is not used in trunk - Key: HBASE-5824 URL: https://issues.apache.org/jira/browse/HBASE-5824 Project: HBase Issue Type: Bug Reporter: Elliott Clark Assignee: Jimmy Xiang on 0.94 a call to client.HTable#incrementColumnValue will cause HRegion#incrementColumnValue. On trunk all calls to HTable.incrementColumnValue got to HRegion#increment. My guess is that HTable#incrementColumnValue and HTable#increment serialize to the same thing over the wire so that the remote HRegionServer no longer knows which htable method was called. To repro I checked out trunk and put a break point in HRegion#incrementColumnValue and then ran TestFromClientSide. The breakpoint wasn't hit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5620) Convert the client protocol of HRegionInterface to PB
[ https://issues.apache.org/jira/browse/HBASE-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13255704#comment-13255704 ] Jimmy Xiang commented on HBASE-5620: @Stack, not every invocation will throw an exception. In case it throws an exception, it should be a ServiceException for pb. It used to be IOException. Without the change, for pb calls, it won't get a ServiceException in case something goes wrong. It gets an undeclared exception with the cause to be an IOE, and the upper layer doesn't know how to handle it. The Set in Invocation is used to decide if a protocol a pb one, so ServiceException should be used. I put it there because it is used for both WritableRpcEngine and SecureRpcEngine. Convert the client protocol of HRegionInterface to PB - Key: HBASE-5620 URL: https://issues.apache.org/jira/browse/HBASE-5620 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: hbase-5620-sec.patch, hbase-5620_v3.patch, hbase-5620_v4.patch, hbase-5620_v4.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5620) Convert the client protocol of HRegionInterface to PB
[ https://issues.apache.org/jira/browse/HBASE-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13254979#comment-13254979 ] Jimmy Xiang commented on HBASE-5620: I did some testing with YCSB (mostly inserts). It gave me better performance for the patch which was a surprise to me. I will do some read-only testing with YCSB too. Convert the client protocol of HRegionInterface to PB - Key: HBASE-5620 URL: https://issues.apache.org/jira/browse/HBASE-5620 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: hbase-5620-sec.patch, hbase-5620_v3.patch, hbase-5620_v4.patch, hbase-5620_v4.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5620) Convert the client protocol of HRegionInterface to PB
[ https://issues.apache.org/jira/browse/HBASE-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13254360#comment-13254360 ] Jimmy Xiang commented on HBASE-5620: Thanks for reviewing. Both regular test suite and security test suite are green for me. I mean all tests in the suite. Convert the client protocol of HRegionInterface to PB - Key: HBASE-5620 URL: https://issues.apache.org/jira/browse/HBASE-5620 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: hbase-5620-sec.patch, hbase-5620_v3.patch, hbase-5620_v4.patch, hbase-5620_v4.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5620) Convert the client protocol of HRegionInterface to PB
[ https://issues.apache.org/jira/browse/HBASE-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13254153#comment-13254153 ] Jimmy Xiang commented on HBASE-5620: @Stack, thanks. @Ted, I am looking into it now. Convert the client protocol of HRegionInterface to PB - Key: HBASE-5620 URL: https://issues.apache.org/jira/browse/HBASE-5620 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: hbase-5620_v3.patch, hbase-5620_v4.patch, hbase-5620_v4.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5620) Convert the client protocol of HRegionInterface to PB
[ https://issues.apache.org/jira/browse/HBASE-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13254216#comment-13254216 ] Jimmy Xiang commented on HBASE-5620: TestForceCacheImportantBlocks is green for me. Convert the client protocol of HRegionInterface to PB - Key: HBASE-5620 URL: https://issues.apache.org/jira/browse/HBASE-5620 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: hbase-5620-sec.patch, hbase-5620_v3.patch, hbase-5620_v4.patch, hbase-5620_v4.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5620) Convert the client protocol of HRegionInterface to PB
[ https://issues.apache.org/jira/browse/HBASE-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13253377#comment-13253377 ] Jimmy Xiang commented on HBASE-5620: I will take a look at the test failures. Convert the client protocol of HRegionInterface to PB - Key: HBASE-5620 URL: https://issues.apache.org/jira/browse/HBASE-5620 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: hbase-5620_v3.patch, hbase-5620_v4.patch, hbase-5620_v4.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5620) Convert the client protocol of HRegionInterface to PB
[ https://issues.apache.org/jira/browse/HBASE-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13253538#comment-13253538 ] Jimmy Xiang commented on HBASE-5620: TestWALPlayer passed for me. I didn't have the latest from trunk? Convert the client protocol of HRegionInterface to PB - Key: HBASE-5620 URL: https://issues.apache.org/jira/browse/HBASE-5620 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: hbase-5620_v3.patch, hbase-5620_v4.patch, hbase-5620_v4.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5620) Convert the client protocol of HRegionInterface to PB
[ https://issues.apache.org/jira/browse/HBASE-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13253799#comment-13253799 ] Jimmy Xiang commented on HBASE-5620: @Stack, I will move ClientProtocol.java and AdminProtocol.java to top level in HBASE-5621 since they are common. I added HBASE-5785 to track the unit test issue. @Ted, can I check the licenses without doing a release build? Convert the client protocol of HRegionInterface to PB - Key: HBASE-5620 URL: https://issues.apache.org/jira/browse/HBASE-5620 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: hbase-5620_v3.patch, hbase-5620_v4.patch, hbase-5620_v4.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5620) Convert the client protocol of HRegionInterface to PB
[ https://issues.apache.org/jira/browse/HBASE-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13253840#comment-13253840 ] Jimmy Xiang commented on HBASE-5620: I ran Apache Rat check like mvn apache-rat:check, and it is ok. Convert the client protocol of HRegionInterface to PB - Key: HBASE-5620 URL: https://issues.apache.org/jira/browse/HBASE-5620 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: hbase-5620_v3.patch, hbase-5620_v4.patch, hbase-5620_v4.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5620) Convert the client protocol of HRegionInterface to PB
[ https://issues.apache.org/jira/browse/HBASE-5620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13253852#comment-13253852 ] Jimmy Xiang commented on HBASE-5620: @Stack, thanks a lot! I moved them to top-level in HBase-5621 and posted a review request. Could you please review? I am ok to move them to client package. Convert the client protocol of HRegionInterface to PB - Key: HBASE-5620 URL: https://issues.apache.org/jira/browse/HBASE-5620 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: hbase-5620_v3.patch, hbase-5620_v4.patch, hbase-5620_v4.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5777) MiniHBaseCluster cannot start multiple region servers
[ https://issues.apache.org/jira/browse/HBASE-5777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13252899#comment-13252899 ] Jimmy Xiang commented on HBASE-5777: I see. When I run unit tests in eclipse, the hbase-site.xml at src/test is not used. Maybe we can disable the UI in MiniHBaseCluster too, how about that? MiniHBaseCluster cannot start multiple region servers - Key: HBASE-5777 URL: https://issues.apache.org/jira/browse/HBASE-5777 Project: HBase Issue Type: Test Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: hbase-5777.patch MiniHBaseCluster can try to start multiple region servers. But all of them except one will die in putting up the web UI because of BindException since HConstants.REGIONSERVER_INFO_PORT_AUTO is set to false by default. This issue will make many unit tests depending on multiple region servers flaky, such as TestAdmin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5740) Compaction interruption may be due to balacing
[ https://issues.apache.org/jira/browse/HBASE-5740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249951#comment-13249951 ] Jimmy Xiang commented on HBASE-5740: @JD, any comments on the second patch? Compaction interruption may be due to balacing -- Key: HBASE-5740 URL: https://issues.apache.org/jira/browse/HBASE-5740 Project: HBase Issue Type: Bug Affects Versions: 0.96.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Trivial Fix For: 0.96.0 Attachments: hbase-5740.patch, hbase-5740_v2.patch Currently, the log shows Aborting compaction of store LOG in region because user requested stop. But it is actually because of balancing. Currently, there is no way to figure out who closed the region. So it is better to change the message to say it is because of user, or balancing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5740) Compaction interruption may be due to balacing
[ https://issues.apache.org/jira/browse/HBASE-5740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13250267#comment-13250267 ] Jimmy Xiang commented on HBASE-5740: @Stack, I am fine with the generic message. Please make the change on commit. Thanks a lot. We don't know for sure who interrupted it anyway for now. Compaction interruption may be due to balacing -- Key: HBASE-5740 URL: https://issues.apache.org/jira/browse/HBASE-5740 Project: HBase Issue Type: Bug Affects Versions: 0.96.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Trivial Fix For: 0.96.0 Attachments: hbase-5740.patch, hbase-5740_v2.patch Currently, the log shows Aborting compaction of store LOG in region because user requested stop. But it is actually because of balancing. Currently, there is no way to figure out who closed the region. So it is better to change the message to say it is because of user, or balancing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5734) Change hbck sideline root
[ https://issues.apache.org/jira/browse/HBASE-5734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248511#comment-13248511 ] Jimmy Xiang commented on HBASE-5734: It is nice to expose it as an argument. However, it offers not too much value since we don't expect hbck to be ran all the time. They can rename it afterwards if they really want. We already have lots of arguments. Change hbck sideline root - Key: HBASE-5734 URL: https://issues.apache.org/jira/browse/HBASE-5734 Project: HBase Issue Type: Improvement Components: hbck Affects Versions: 0.94.0, 0.96.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Trivial Fix For: 0.96.0 Attachments: hbase-5734.patch Currently hbck sideline root is the root which can run into permission issue. We can change it to /hbck -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5740) Compaction interruption may be due to balacing
[ https://issues.apache.org/jira/browse/HBASE-5740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249045#comment-13249045 ] Jimmy Xiang commented on HBASE-5740: Add a new patch, and not saying who trigged it since we don't know for now. Compaction interruption may be due to balacing -- Key: HBASE-5740 URL: https://issues.apache.org/jira/browse/HBASE-5740 Project: HBase Issue Type: Bug Affects Versions: 0.96.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Trivial Fix For: 0.96.0 Attachments: hbase-5740.patch, hbase-5740_v2.patch Currently, the log shows Aborting compaction of store LOG in region because user requested stop. But it is actually because of balancing. Currently, there is no way to figure out who closed the region. So it is better to change the message to say it is because of user, or balancing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5606) SplitLogManger async delete node hangs log splitting when ZK connection is lost
[ https://issues.apache.org/jira/browse/HBASE-5606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245763#comment-13245763 ] Jimmy Xiang commented on HBASE-5606: It is ok with me. Hopefully, there is no other place. SplitLogManger async delete node hangs log splitting when ZK connection is lost Key: HBASE-5606 URL: https://issues.apache.org/jira/browse/HBASE-5606 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0 Reporter: Gopinathan A Assignee: Prakash Khemani Priority: Critical Fix For: 0.92.2 Attachments: 0001-HBASE-5606-SplitLogManger-async-delete-node-hangs-lo.patch, 0001-HBASE-5606-SplitLogManger-async-delete-node-hangs-lo.patch 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All tasks are failed due to ZK connection lost, so the all the tasks were deleted asynchronously; 3. Servershutdownhandler retried the log splitting; 4. The asynchronously deletion in step 2 finally happened for new task 5. This made the SplitLogManger in hanging state. This leads to .META. region not assigened for long time {noformat} hbase-root-master-HOST-192-168-47-204.log.2012-03-14(55413,79):2012-03-14 19:28:47,932 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task at znode /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170 hbase-root-master-HOST-192-168-47-204.log.2012-03-14(89303,79):2012-03-14 19:34:32,387 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task at znode /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170 {noformat} {noformat} hbase-root-master-HOST-192-168-47-204.log.2012-03-14(80417,99):2012-03-14 19:34:31,196 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170 hbase-root-master-HOST-192-168-47-204.log.2012-03-14(89456,99):2012-03-14 19:34:32,497 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170 {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5443) Add PB-based calls to HRegionInterface
[ https://issues.apache.org/jira/browse/HBASE-5443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243954#comment-13243954 ] Jimmy Xiang commented on HBASE-5443: The main reason is that the HBase writable RPC already supports pb. Hadoop uses pb too. Add PB-based calls to HRegionInterface -- Key: HBASE-5443 URL: https://issues.apache.org/jira/browse/HBASE-5443 Project: HBase Issue Type: Task Components: ipc, master, migration, regionserver Reporter: Todd Lipcon Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: region_java-proto-mapping.pdf -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5619) Create PB protocols for HRegionInterface
[ https://issues.apache.org/jira/browse/HBASE-5619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242720#comment-13242720 ] Jimmy Xiang commented on HBASE-5619: @Stack, thanks! Create PB protocols for HRegionInterface Key: HBASE-5619 URL: https://issues.apache.org/jira/browse/HBASE-5619 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: 5619v6.txt, 5619v6.txt, hbase-5619.patch, hbase-5619_v3.patch, hbase-5619_v4.patch, hbase-5619_v5.patch Subtask of HBase-5443, separate HRegionInterface into admin protocol and client protocol, create the PB protocol buffer files -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5619) Create PB protocols for HRegionInterface
[ https://issues.apache.org/jira/browse/HBASE-5619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241374#comment-13241374 ] Jimmy Xiang commented on HBASE-5619: @Stack, could you please commit this patch? I do have some changes to pb files. But I'd like to address them in HBASE-5620. Thanks. Create PB protocols for HRegionInterface Key: HBASE-5619 URL: https://issues.apache.org/jira/browse/HBASE-5619 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: hbase-5619.patch, hbase-5619_v3.patch, hbase-5619_v4.patch Subtask of HBase-5443, separate HRegionInterface into admin protocol and client protocol, create the PB protocol buffer files -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5619) Create PB protocols for HRegionInterface
[ https://issues.apache.org/jira/browse/HBASE-5619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241622#comment-13241622 ] Jimmy Xiang commented on HBASE-5619: @Stack, could you please install protoc and give it a try again? From now on, we need protoc to compile. :) Create PB protocols for HRegionInterface Key: HBASE-5619 URL: https://issues.apache.org/jira/browse/HBASE-5619 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: hbase-5619.patch, hbase-5619_v3.patch, hbase-5619_v4.patch Subtask of HBase-5443, separate HRegionInterface into admin protocol and client protocol, create the PB protocol buffer files -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5619) Create PB protocols for HRegionInterface
[ https://issues.apache.org/jira/browse/HBASE-5619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241819#comment-13241819 ] Jimmy Xiang commented on HBASE-5619: @Stack, so far, I could not find a good protoc maven plugin. I don't remember I tried to install it on my Ubuntu. That's the download site for protobuf compiler: http://code.google.com/p/protobuf/downloads/list But for Linux, I think it is easy to install with rpm/apt-get. Create PB protocols for HRegionInterface Key: HBASE-5619 URL: https://issues.apache.org/jira/browse/HBASE-5619 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: hbase-5619.patch, hbase-5619_v3.patch, hbase-5619_v4.patch Subtask of HBase-5443, separate HRegionInterface into admin protocol and client protocol, create the PB protocol buffer files -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5667) RegexStringComparator supports java.util.regex.Pattern flags
[ https://issues.apache.org/jira/browse/HBASE-5667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241861#comment-13241861 ] Jimmy Xiang commented on HBASE-5667: @Stack, I prefer to change them to pb so we should not bother to make them VersionedWritables for now. We have lots of filters. We need to abstract them out and have a generic way to define them in pb. RegexStringComparator supports java.util.regex.Pattern flags Key: HBASE-5667 URL: https://issues.apache.org/jira/browse/HBASE-5667 Project: HBase Issue Type: Improvement Components: filters Reporter: David Arthur Priority: Minor Attachments: HBASE-5667.diff * Add constructor that takes in a Pattern * Add Pattern's flags to Writable fields, and actually use them when recomposing the Filter -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5667) RegexStringComparator supports java.util.regex.Pattern flags
[ https://issues.apache.org/jira/browse/HBASE-5667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241870#comment-13241870 ] Jimmy Xiang commented on HBASE-5667: For this patch, it changes the constructor of RegexStringComparator. A Pattern is hard to be pb'd. Can we specify the flags in a different way, for example, using string, and/or some primitive parameters? RegexStringComparator supports java.util.regex.Pattern flags Key: HBASE-5667 URL: https://issues.apache.org/jira/browse/HBASE-5667 Project: HBase Issue Type: Improvement Components: filters Reporter: David Arthur Priority: Minor Attachments: HBASE-5667.diff * Add constructor that takes in a Pattern * Add Pattern's flags to Writable fields, and actually use them when recomposing the Filter -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5619) Create PB protocols for HRegionInterface
[ https://issues.apache.org/jira/browse/HBASE-5619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241880#comment-13241880 ] Jimmy Xiang commented on HBASE-5619: But for proto files, other projects depend on protoc, for example HADOOP/HDFS. We are moving towards pb, protoc dependency should be fine. I can try to setup a temp protoc dynamically. Create PB protocols for HRegionInterface Key: HBASE-5619 URL: https://issues.apache.org/jira/browse/HBASE-5619 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: hbase-5619.patch, hbase-5619_v3.patch, hbase-5619_v4.patch Subtask of HBase-5443, separate HRegionInterface into admin protocol and client protocol, create the PB protocol buffer files -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5667) RegexStringComparator supports java.util.regex.Pattern flags
[ https://issues.apache.org/jira/browse/HBASE-5667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241914#comment-13241914 ] Jimmy Xiang commented on HBASE-5667: Pattern is fine if we can get it. RegexStringComparator supports java.util.regex.Pattern flags Key: HBASE-5667 URL: https://issues.apache.org/jira/browse/HBASE-5667 Project: HBase Issue Type: Improvement Components: filters Reporter: David Arthur Priority: Minor Attachments: HBASE-5667.diff * Add constructor that takes in a Pattern * Add Pattern's flags to Writable fields, and actually use them when recomposing the Filter -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5619) Create PB protocols for HRegionInterface
[ https://issues.apache.org/jira/browse/HBASE-5619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13241954#comment-13241954 ] Jimmy Xiang commented on HBASE-5619: That's what do for thrift now, not avro. Create PB protocols for HRegionInterface Key: HBASE-5619 URL: https://issues.apache.org/jira/browse/HBASE-5619 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: hbase-5619.patch, hbase-5619_v3.patch, hbase-5619_v4.patch Subtask of HBase-5443, separate HRegionInterface into admin protocol and client protocol, create the PB protocol buffer files -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5606) SplitLogManger async delete node hangs log splitting when ZK connection is lost
[ https://issues.apache.org/jira/browse/HBASE-5606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13238811#comment-13238811 ] Jimmy Xiang commented on HBASE-5606: @Prakash, could there be other places which failed delete can cause this issue? Is it a cleaner fix to change async delete to sync delete? With sync delete, we can avoid all these havoc racing problems, and the retry will get a fresh start each time. SplitLogManger async delete node hangs log splitting when ZK connection is lost Key: HBASE-5606 URL: https://issues.apache.org/jira/browse/HBASE-5606 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0 Reporter: Gopinathan A Priority: Critical Fix For: 0.92.2 Attachments: 0001-HBASE-5606-SplitLogManger-async-delete-node-hangs-lo.patch, 5606.txt 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All tasks are failed due to ZK connection lost, so the all the tasks were deleted asynchronously; 3. Servershutdownhandler retried the log splitting; 4. The asynchronously deletion in step 2 finally happened for new task 5. This made the SplitLogManger in hanging state. This leads to .META. region not assigened for long time {noformat} hbase-root-master-HOST-192-168-47-204.log.2012-03-14(55413,79):2012-03-14 19:28:47,932 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task at znode /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170 hbase-root-master-HOST-192-168-47-204.log.2012-03-14(89303,79):2012-03-14 19:34:32,387 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task at znode /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170 {noformat} {noformat} hbase-root-master-HOST-192-168-47-204.log.2012-03-14(80417,99):2012-03-14 19:34:31,196 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170 hbase-root-master-HOST-192-168-47-204.log.2012-03-14(89456,99):2012-03-14 19:34:32,497 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170 {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5606) SplitLogManger async delete node hangs log splitting when ZK connection is lost
[ https://issues.apache.org/jira/browse/HBASE-5606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237647#comment-13237647 ] Jimmy Xiang commented on HBASE-5606: This is similar issue as HBASE-5081, right? Will my original fix proposed for HBASE-5081 help: don't retry distributed log splitting before tasks are actually deleted? We can abort the master after several retry to delete the tasks. SplitLogManger async delete node hangs log splitting when ZK connection is lost Key: HBASE-5606 URL: https://issues.apache.org/jira/browse/HBASE-5606 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0 Reporter: Gopinathan A Priority: Critical Fix For: 0.92.2 Attachments: 5606.txt 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All tasks are failed due to ZK connection lost, so the all the tasks were deleted asynchronously; 3. Servershutdownhandler retried the log splitting; 4. The asynchronously deletion in step 2 finally happened for new task 5. This made the SplitLogManger in hanging state. This leads to .META. region not assigened for long time {noformat} hbase-root-master-HOST-192-168-47-204.log.2012-03-14(55413,79):2012-03-14 19:28:47,932 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task at znode /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170 hbase-root-master-HOST-192-168-47-204.log.2012-03-14(89303,79):2012-03-14 19:34:32,387 DEBUG org.apache.hadoop.hbase.master.SplitLogManager: put up splitlog task at znode /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170 {noformat} {noformat} hbase-root-master-HOST-192-168-47-204.log.2012-03-14(80417,99):2012-03-14 19:34:31,196 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170 hbase-root-master-HOST-192-168-47-204.log.2012-03-14(89456,99):2012-03-14 19:34:32,497 DEBUG org.apache.hadoop.hbase.master.SplitLogManager$DeleteAsyncCallback: deleted /hbase/splitlog/hdfs%3A%2F%2F192.168.47.205%3A9000%2Fhbase%2F.logs%2Flinux-114.site%2C60020%2C1331720381665-splitting%2Flinux-114.site%252C60020%252C1331720381665.1331752316170 {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5443) Add PB-based calls to HRegionInterface
[ https://issues.apache.org/jira/browse/HBASE-5443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235780#comment-13235780 ] Jimmy Xiang commented on HBASE-5443: I have done some code changes, and some tests failed. It is very hard to look into them. So I'd like to break it into small pieces and tag them one by one. Add PB-based calls to HRegionInterface -- Key: HBASE-5443 URL: https://issues.apache.org/jira/browse/HBASE-5443 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Todd Lipcon Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: region_java-proto-mapping.pdf -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5443) Add PB-based calls to HRegionInterface
[ https://issues.apache.org/jira/browse/HBASE-5443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221155#comment-13221155 ] Jimmy Xiang commented on HBASE-5443: I updated the review with new diff, which incorporated the feedbacks from all reviewers. Thanks a lot for review. Add PB-based calls to HRegionInterface -- Key: HBASE-5443 URL: https://issues.apache.org/jira/browse/HBASE-5443 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Todd Lipcon Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: region_java-proto-mapping.pdf -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5451) Switch RPC call envelope/headers to PBs
[ https://issues.apache.org/jira/browse/HBASE-5451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13221166#comment-13221166 ] Jimmy Xiang commented on HBASE-5451: I hope we can. I know the RPC won't be backward compatible. How about the client code? We definitely won't break any existing client applications, right? Switch RPC call envelope/headers to PBs --- Key: HBASE-5451 URL: https://issues.apache.org/jira/browse/HBASE-5451 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Affects Versions: 0.94.0 Reporter: Todd Lipcon Assignee: Devaraj Das Fix For: 0.96.0 Attachments: rpc-proto.2.txt, rpc-proto.3.txt, rpc-proto.patch.1_2 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5451) Switch RPC call envelope/headers to PBs
[ https://issues.apache.org/jira/browse/HBASE-5451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13220249#comment-13220249 ] Jimmy Xiang commented on HBASE-5451: I did a quick review last night. Looks ok with me. For the pom change, we have the same change. So it should be fine. For me, I put the generated files under org.apache.hadoop.hbase.protobuf. Should I put them under org.apache.hadoop.hbase.ipc.protobuf too? Switch RPC call envelope/headers to PBs --- Key: HBASE-5451 URL: https://issues.apache.org/jira/browse/HBASE-5451 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Affects Versions: 0.94.0 Reporter: Todd Lipcon Assignee: Devaraj Das Fix For: 0.96.0 Attachments: rpc-proto.2.txt, rpc-proto.3.txt, rpc-proto.patch.1_2 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3909) Add dynamic config
[ https://issues.apache.org/jira/browse/HBASE-3909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13220404#comment-13220404 ] Jimmy Xiang commented on HBASE-3909: If they lose them, it could be very bad. It may be too later when someone see something weird, then realize their configs are gone. I think it is safer to persist them somewhere. Add dynamic config -- Key: HBASE-3909 URL: https://issues.apache.org/jira/browse/HBASE-3909 Project: HBase Issue Type: Bug Reporter: stack Fix For: 0.96.0 I'm sure this issue exists already, at least as part of the discussion around making online schema edits possible, but no hard this having its own issue. Ted started a conversation on this topic up on dev and Todd suggested we lookd at how Hadoop did it over in HADOOP-7001 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217364#comment-13217364 ] Jimmy Xiang commented on HBASE-5270: @Stack, I agree. I think we should reuse the existing exception if we can. Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Assignee: chunhui shen Fix For: 0.92.1, 0.94.0 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch, hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5443) Add PB-based calls to HRegionInterface
[ https://issues.apache.org/jira/browse/HBASE-5443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217442#comment-13217442 ] Jimmy Xiang commented on HBASE-5443: We can still support multi(MultiAction). Should we still support it in the RPC layer? Can we put some logic in the client side, like aggregating the actions based on region, action type (put/delete/get), and so on? Add PB-based calls to HRegionInterface -- Key: HBASE-5443 URL: https://issues.apache.org/jira/browse/HBASE-5443 Project: HBase Issue Type: Sub-task Components: ipc, master, migration, regionserver Reporter: Todd Lipcon Assignee: Jimmy Xiang Fix For: 0.96.0 Attachments: region_java-proto-mapping.pdf -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3909) Add dynamic config
[ https://issues.apache.org/jira/browse/HBASE-3909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216758#comment-13216758 ] Jimmy Xiang commented on HBASE-3909: @Stack, we don't have to poll fs to find changes. We can just put the lastmodifieddate of the file in ZK. Once the last modified date is changed, we can load the file again. When a new regionserver joins a cluster, it should always try to check if any configuration is changed based on the configuration file last modified date, which is kind of the version number of the file. Add dynamic config -- Key: HBASE-3909 URL: https://issues.apache.org/jira/browse/HBASE-3909 Project: HBase Issue Type: Bug Reporter: stack Fix For: 0.94.0 I'm sure this issue exists already, at least as part of the discussion around making online schema edits possible, but no hard this having its own issue. Ted started a conversation on this topic up on dev and Todd suggested we lookd at how Hadoop did it over in HADOOP-7001 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3909) Add dynamic config
[ https://issues.apache.org/jira/browse/HBASE-3909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216095#comment-13216095 ] Jimmy Xiang commented on HBASE-3909: Can we put dynamic configuration somewhere in the HDFS, for example, some file under hbase.rootdir? We can put static configuration in hbase-site.xml, and dynamic configuration in a file under hbase.rootdir. We can also enhance hbase shell or master UI to view/change those dynamic configurations. Add dynamic config -- Key: HBASE-3909 URL: https://issues.apache.org/jira/browse/HBASE-3909 Project: HBase Issue Type: Bug Reporter: stack Fix For: 0.94.0 I'm sure this issue exists already, at least as part of the discussion around making online schema edits possible, but no hard this having its own issue. Ted started a conversation on this topic up on dev and Todd suggested we lookd at how Hadoop did it over in HADOOP-7001 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3909) Add dynamic config
[ https://issues.apache.org/jira/browse/HBASE-3909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216105#comment-13216105 ] Jimmy Xiang commented on HBASE-3909: For these dynamic configurations, we can cache them in memory. In the meantime, create a separate thread to re-load the cache periodically. So it is apparent to the configuration reader. Add dynamic config -- Key: HBASE-3909 URL: https://issues.apache.org/jira/browse/HBASE-3909 Project: HBase Issue Type: Bug Reporter: stack Fix For: 0.94.0 I'm sure this issue exists already, at least as part of the discussion around making online schema edits possible, but no hard this having its own issue. Ted started a conversation on this topic up on dev and Todd suggested we lookd at how Hadoop did it over in HADOOP-7001 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler
[ https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216114#comment-13216114 ] Jimmy Xiang commented on HBASE-5270: Instead of introducing safe mode, can we add something to the RPC server and don't allow it to sever traffic before the actual server is ready, for example, fully initialized? Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler - Key: HBASE-5270 URL: https://issues.apache.org/jira/browse/HBASE-5270 Project: HBase Issue Type: Sub-task Components: master Reporter: Zhihong Yu Assignee: chunhui shen Fix For: 0.92.1, 0.94.0 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch, hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, sampletest.txt This JIRA continues the effort from HBASE-5179. Starting with Stack's comments about patches for 0.92 and TRUNK: Reviewing 0.92v17 isDeadServerInProgress is a new public method in ServerManager but it does not seem to be used anywhere. Does isDeadRootServerInProgress need to be public? Ditto for meta version. This method param names are not right 'definitiveRootServer'; what is meant by definitive? Do they need this qualifier? Is there anything in place to stop us expiring a server twice if its carrying root and meta? What is difference between asking assignment manager isCarryingRoot and this variable that is passed in? Should be doc'd at least. Ditto for meta. I think I've asked for this a few times - onlineServers needs to be explained... either in javadoc or in comment. This is the param passed into joinCluster. How does it arise? I think I know but am unsure. God love the poor noob that comes awandering this code trying to make sense of it all. It looks like we get the list by trawling zk for regionserver znodes that have not checked in. Don't we do this operation earlier in master setup? Are we doing it again here? Though distributed split log is configured, we will do in master single process splitting under some conditions with this patch. Its not explained in code why we would do this. Why do we think master log splitting 'high priority' when it could very well be slower. Should we only go this route if distributed splitting is not going on. Do we know if concurrent distributed log splitting and master splitting works? Why would we have dead servers in progress here in master startup? Because a servershutdownhandler fired? This patch is different to the patch for 0.90. Should go into trunk first with tests, then 0.92. Should it be in this issue? This issue is really hard to follow now. Maybe this issue is for 0.90.x and new issue for more work on this trunk patch? This patch needs to have the v18 differences applied. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3909) Add dynamic config
[ https://issues.apache.org/jira/browse/HBASE-3909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216131#comment-13216131 ] Jimmy Xiang commented on HBASE-3909: Yes, I meant transparent to configuration reader. My assumption is that the change doesn't have to take effect right away. Some delay is fine. If we really want to use ZK, we can use a central file as persistence. Add dynamic config -- Key: HBASE-3909 URL: https://issues.apache.org/jira/browse/HBASE-3909 Project: HBase Issue Type: Bug Reporter: stack Fix For: 0.94.0 I'm sure this issue exists already, at least as part of the discussion around making online schema edits possible, but no hard this having its own issue. Ted started a conversation on this topic up on dev and Todd suggested we lookd at how Hadoop did it over in HADOOP-7001 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5472) LoadIncrementalHFiles loops forever if the target table misses a CF
[ https://issues.apache.org/jira/browse/HBASE-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13216141#comment-13216141 ] Jimmy Xiang commented on HBASE-5472: In such a case, should the tool ignore the missing column family, or just error out? LoadIncrementalHFiles loops forever if the target table misses a CF --- Key: HBASE-5472 URL: https://issues.apache.org/jira/browse/HBASE-5472 Project: HBase Issue Type: Bug Components: mapreduce Reporter: Lars Hofhansl Priority: Minor I have some HFiles for two column families 'y','z', but I specified a target table that only has CF 'y'. I see the following repeated forever. ... 12/02/23 22:57:37 WARN mapreduce.LoadIncrementalHFiles: Attempt to bulk load region containing into table z with files [family:y path:hdfs://bunnypig:9000/bulk/z2/y/bd6f1c3cc8b443fc9e9e5fddcdaa3b09, family:z path:hdfs://bunnypig:9000/bulk/z2/z/38f12fdbb7de40e8bf0e6489ef34365d] failed. This is recoverable and they will be retried. 12/02/23 22:57:37 DEBUG client.MetaScanner: Scanning .META. starting at row=z,,00 for max=2147483647 rows using org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@7b7a4989 12/02/23 22:57:37 INFO mapreduce.LoadIncrementalHFiles: Split occured while grouping HFiles, retry attempt 1596 with 2 files remaining to group or split 12/02/23 22:57:37 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://bunnypig:9000/bulk/z2/y/bd6f1c3cc8b443fc9e9e5fddcdaa3b09 first=r last=r 12/02/23 22:57:37 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://bunnypig:9000/bulk/z2/z/38f12fdbb7de40e8bf0e6489ef34365d first=r last=r 12/02/23 22:57:37 DEBUG mapreduce.LoadIncrementalHFiles: Going to connect to server region=z,,1330066309814.d5fa76a38c9565f614755e34eacf8316., hostname=localhost, port=60020 for row ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4403) Adopt interface stability/audience classifications from Hadoop
[ https://issues.apache.org/jira/browse/HBASE-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13214146#comment-13214146 ] Jimmy Xiang commented on HBASE-4403: It should use hbase-4403.patch instead of hbase-4403-interface_v3.txt. :) Just retried. Adopt interface stability/audience classifications from Hadoop -- Key: HBASE-4403 URL: https://issues.apache.org/jira/browse/HBASE-4403 Project: HBase Issue Type: Task Affects Versions: 0.90.5, 0.92.0 Reporter: Todd Lipcon Assignee: Jimmy Xiang Fix For: 0.94.0 Attachments: hbase-4403-interface.txt, hbase-4403-interface_v2.txt, hbase-4403-interface_v3.txt, hbase-4403-nowhere-near-done.txt, hbase-4403.patch, hbase-4403.patch As HBase gets more widely used, we need to be more explicit about which APIs are stable and not expected to break between versions, which APIs are still evolving, etc. We also have many public classes that are really internal to the RS or Master and not meant to be used by users. Hadoop has adopted a classification scheme for audience (public, private, or limited-private) as well as stability (stable, evolving, unstable). I think we should copy these annotations to HBase and start to classify our public classes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4403) Adopt interface stability/audience classifications from Hadoop
[ https://issues.apache.org/jira/browse/HBASE-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210666#comment-13210666 ] Jimmy Xiang commented on HBASE-4403: Sounds great. Adopt interface stability/audience classifications from Hadoop -- Key: HBASE-4403 URL: https://issues.apache.org/jira/browse/HBASE-4403 Project: HBase Issue Type: Task Affects Versions: 0.90.5, 0.92.0 Reporter: Todd Lipcon Assignee: Jimmy Xiang Attachments: hbase-4403-interface.txt, hbase-4403-interface_v2.txt, hbase-4403-nowhere-near-done.txt As HBase gets more widely used, we need to be more explicit about which APIs are stable and not expected to break between versions, which APIs are still evolving, etc. We also have many public classes that are really internal to the RS or Master and not meant to be used by users. Hadoop has adopted a classification scheme for audience (public, private, or limited-private) as well as stability (stable, evolving, unstable). I think we should copy these annotations to HBase and start to classify our public classes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4403) Adopt interface stability/audience classifications from Hadoop
[ https://issues.apache.org/jira/browse/HBASE-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13209738#comment-13209738 ] Jimmy Xiang commented on HBASE-4403: @Stack, thanks a lot for review. I will incorporate the changes to the next version. How about those coprocessor and rest related classes? As the classification definition, HADOOP-5073 has some info and background. Todd has a short summary which is very good: {quote} if it's Private, we can change it (and dont' need a stability mark). If it's public but unstable, we can change it. If it's public/evolving, we're allowed to change it but should try not to. If it's public and stable we can't change it without a deprecation path or with a GREAT reason. {quote} Adopt interface stability/audience classifications from Hadoop -- Key: HBASE-4403 URL: https://issues.apache.org/jira/browse/HBASE-4403 Project: HBase Issue Type: Task Affects Versions: 0.90.5, 0.92.0 Reporter: Todd Lipcon Assignee: Jimmy Xiang Attachments: hbase-4403-interface.txt, hbase-4403-nowhere-near-done.txt As HBase gets more widely used, we need to be more explicit about which APIs are stable and not expected to break between versions, which APIs are still evolving, etc. We also have many public classes that are really internal to the RS or Master and not meant to be used by users. Hadoop has adopted a classification scheme for audience (public, private, or limited-private) as well as stability (stable, evolving, unstable). I think we should copy these annotations to HBase and start to classify our public classes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4403) Adopt interface stability/audience classifications from Hadoop
[ https://issues.apache.org/jira/browse/HBASE-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13209772#comment-13209772 ] Jimmy Xiang commented on HBASE-4403: Cool, thanks. Please add the definitions to the book in a new one. This one may take a while. Adopt interface stability/audience classifications from Hadoop -- Key: HBASE-4403 URL: https://issues.apache.org/jira/browse/HBASE-4403 Project: HBase Issue Type: Task Affects Versions: 0.90.5, 0.92.0 Reporter: Todd Lipcon Assignee: Jimmy Xiang Attachments: hbase-4403-interface.txt, hbase-4403-nowhere-near-done.txt As HBase gets more widely used, we need to be more explicit about which APIs are stable and not expected to break between versions, which APIs are still evolving, etc. We also have many public classes that are really internal to the RS or Master and not meant to be used by users. Hadoop has adopted a classification scheme for audience (public, private, or limited-private) as well as stability (stable, evolving, unstable). I think we should copy these annotations to HBase and start to classify our public classes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4403) Adopt interface stability/audience classifications from Hadoop
[ https://issues.apache.org/jira/browse/HBASE-4403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13209901#comment-13209901 ] Jimmy Xiang commented on HBASE-4403: Yes, I will do that in a separate jira. Adopt interface stability/audience classifications from Hadoop -- Key: HBASE-4403 URL: https://issues.apache.org/jira/browse/HBASE-4403 Project: HBase Issue Type: Task Affects Versions: 0.90.5, 0.92.0 Reporter: Todd Lipcon Assignee: Jimmy Xiang Attachments: hbase-4403-interface.txt, hbase-4403-nowhere-near-done.txt As HBase gets more widely used, we need to be more explicit about which APIs are stable and not expected to break between versions, which APIs are still evolving, etc. We also have many public classes that are really internal to the RS or Master and not meant to be used by users. Hadoop has adopted a classification scheme for audience (public, private, or limited-private) as well as stability (stable, evolving, unstable). I think we should copy these annotations to HBase and start to classify our public classes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5394) Add ability to include Protobufs in HbaseObjectWritable
[ https://issues.apache.org/jira/browse/HBASE-5394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208584#comment-13208584 ] Jimmy Xiang commented on HBASE-5394: These tests are passed on my local box. Add ability to include Protobufs in HbaseObjectWritable --- Key: HBASE-5394 URL: https://issues.apache.org/jira/browse/HBASE-5394 Project: HBase Issue Type: Improvement Affects Versions: 0.94.0 Reporter: Zhihong Yu Assignee: Jimmy Xiang Fix For: 0.94.0 Attachments: hbase-5394.txt This is a port of HADOOP-7379 This is to add the cases to HbaseObjectWritable to handle subclasses of Message, the superclass of codegenned protobufs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5398) HBase shell disable_all/enable_all/drop_all promp wrong tables for confirmation
[ https://issues.apache.org/jira/browse/HBASE-5398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207914#comment-13207914 ] Jimmy Xiang commented on HBASE-5398: Yes, it takes a regex pattern and disable all matched tables. Joey did this feature HBASE-3506. HBase shell disable_all/enable_all/drop_all promp wrong tables for confirmation --- Key: HBASE-5398 URL: https://issues.apache.org/jira/browse/HBASE-5398 Project: HBase Issue Type: Bug Components: scripts Affects Versions: 0.94.0, 0.92.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.94.0, 0.92.0 Attachments: hbase-5398.patch When using hbase shell to disable_all/enable_all/drop_all tables, the tables prompted for confirmation are wrong. For example, disable_all 'test*' will ask form confirmation to diable tables like: mytest1 test123 Fortunately, these tables will not be disabled actually since Java pattern doesn't match this way. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5312) Closed parent region present in Hlog.lastSeqWritten
[ https://issues.apache.org/jira/browse/HBASE-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205596#comment-13205596 ] Jimmy Xiang commented on HBASE-5312: I checked the lock mechanism and it looks fine. If it is not a bug in java reentrant lock, I suspect the region is removed from the online regions list before it is properly closed, either during region spliting, or region closing. Closed parent region present in Hlog.lastSeqWritten --- Key: HBASE-5312 URL: https://issues.apache.org/jira/browse/HBASE-5312 Project: HBase Issue Type: Bug Affects Versions: 0.90.5 Reporter: ramkrishna.s.vasudevan Fix For: 0.90.7 This is in reference to the mail sent in the dev mailing list Closed parent region present in Hlog.lastSeqWritten. The sceanrio described is We had a region that was split into two daughters. When the hlog roll tried to flush the region there was an entry in the HLog.lastSeqWritten that was not flushed or removed from the lastSeqWritten during the parent close. Because this flush was not happening subsequent flushes were getting blocked {code} 05:06:44,422 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=122, maxlogs=32; forcing flush of 1 regions(s): 2acaf8e3acfd2e8a5825a1f6f0aca4a8 05:06:44,422 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of 2acaf8e3acfd2e8a5825a1f6f0aca4a8r=null, requester=null 05:10:48,666 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=123, maxlogs=32; forcing flush of 1 regions(s): 2acaf8e3acfd2e8a5825a1f6f0aca4a8 05:10:48,666 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of 2acaf8e3acfd2e8a5825a1f6f0aca4a8r=null, requester=null 05:14:46,075 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=124, maxlogs=32; forcing flush of 1 regions(s): 2acaf8e3acfd2e8a5825a1f6f0aca4a8 05:14:46,075 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of 2acaf8e3acfd2e8a5825a1f6f0aca4a8r=null, requester=null 05:15:41,584 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=125, maxlogs=32; forcing flush of 1 regions(s): 2acaf8e3acfd2e8a5825a1f6f0aca4a8 05:15:41,584 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of 2acaf8e3acfd2e8a5825a1f6f0aca4a8r=null, {code} Lets see what happened for the region 2acaf8e3acfd2e8a5825a1f6f0aca4a8 {code} 2012-01-06 00:30:55,214 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://192.168.1.103:9000/hbase/Htable_UFDR_031/2acaf8e3acfd2e8a5825a1f6f0aca4a8/.tmp/1755862026714756815 to hdfs://192.168.1.103:9000/hbase/Htable_UFDR_031/2acaf8e3acfd2e8a5825a1f6f0aca4a8/value/973789709483406123 2012-01-06 00:30:58,946 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Instantiated Htable_UFDR_016,049790700093168-0456520,1325809837958.0ebe5bd7fcbc09ee074d5600b9d4e062. 2012-01-06 00:30:59,614 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://192.168.1.103:9000/hbase/Htable_UFDR_031/2acaf8e3acfd2e8a5825a1f6f0aca4a8/value/973789709483406123, entries=7537, sequenceid=20312223, memsize=4.2m, filesize=2.9m 2012-01-06 00:30:59,787 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished snapshotting, commencing flushing stores 2012-01-06 00:30:59,787 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~133.5m for region Htable_UFDR_031,00332,1325808823997.2acaf8e3acfd2e8a5825a1f6f0aca4a8. in 21816ms, sequenceid=20312223, compaction requested=true 2012-01-06 00:30:59,787 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for Htable_UFDR_031,00332,1325808823997.2acaf8e3acfd2e8a5825a1f6f0aca4a8. because regionserver20020.cacheFlusher; priority=0, compaction queue size=5840 {code} A user triggered split has been issued to this region which can be seen in the above logs. The flushing of this region has resulted in a seq id 20312223. The region has been splitted and the parent region has been closed {code} 00:31:12,607 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region Htable_UFDR_031,00332,1325808823997.2acaf8e3acfd2e8a5825a1f6f0aca4a8. 00:31:13,694 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing Htable_UFDR_031,00332,1325808823997.2acaf8e3acfd2e8a5825a1f6f0aca4a8.: disabling compactions flushes 00:31:13,694 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for region Htable_UFDR_031,00332,1325808823997.2acaf8e3acfd2e8a5825a1f6f0aca4a8. 00:31:13,718 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed
[jira] [Commented] (HBASE-5376) Add more logging to triage HBASE-5312: Closed parent region present in Hlog.lastSeqWritten
[ https://issues.apache.org/jira/browse/HBASE-5376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205636#comment-13205636 ] Jimmy Xiang commented on HBASE-5376: I was thinking to use YCSB to load lots of data while set the region size small, so that lots of region split will be triggered. How is that? Add more logging to triage HBASE-5312: Closed parent region present in Hlog.lastSeqWritten -- Key: HBASE-5376 URL: https://issues.apache.org/jira/browse/HBASE-5376 Project: HBase Issue Type: Sub-task Reporter: Jimmy Xiang Priority: Minor Fix For: 0.90.7 It is hard to find out what exactly caused HBASE-5312. Some logging will be helpful to shine some lights. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5327) Print a message when an invalid hbase.rootdir is passed
[ https://issues.apache.org/jira/browse/HBASE-5327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205778#comment-13205778 ] Jimmy Xiang commented on HBASE-5327: I looked into it. For new Path(path), the path doesn't have to be a complete and valid path. It could be a relative path so it can't be validated. new Path(parent, child) takes two paths to form a new one (String is converted to Path implicitly). If parent = hdfs://localhost:999 and child = /test, the new path will be hdfs://localhost:999/test and it is valid and all are happy. However is child = test, in combining there to a URI, the result is hdfs://localhost:999test which is invalid. That's the reason for URISyntaxException. v2 patch doesn't look good, but I am ok with it. Print a message when an invalid hbase.rootdir is passed --- Key: HBASE-5327 URL: https://issues.apache.org/jira/browse/HBASE-5327 Project: HBase Issue Type: Bug Affects Versions: 0.90.5 Reporter: Jean-Daniel Cryans Assignee: Jimmy Xiang Fix For: 0.94.0, 0.90.7, 0.92.1 Attachments: hbase-5327.txt, hbase-5327_v2.txt As seen on the mailing list: http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/24124 If hbase.rootdir doesn't specify a folder on hdfs we crash while opening a path to .oldlogs: {noformat} 2012-02-02 23:07:26,292 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hdfs://sv4r11s38:9100.oldlogs at org.apache.hadoop.fs.Path.initialize(Path.java:148) at org.apache.hadoop.fs.Path.init(Path.java:71) at org.apache.hadoop.fs.Path.init(Path.java:50) at org.apache.hadoop.hbase.master.MasterFileSystem.init(MasterFileSystem.java:112) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:448) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:326) at java.lang.Thread.run(Thread.java:662) Caused by: java.net.URISyntaxException: Relative path in absolute URI: hdfs://sv4r11s38:9100.oldlogs at java.net.URI.checkPath(URI.java:1787) at java.net.URI.init(URI.java:735) at org.apache.hadoop.fs.Path.initialize(Path.java:145) ... 6 more {noformat} It could also crash anywhere else, this just happens to be the first place we use hbase.rootdir. We need to verify that it's an actual folder. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5327) Print a message when an invalid hbase.rootdir is passed
[ https://issues.apache.org/jira/browse/HBASE-5327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13204714#comment-13204714 ] Jimmy Xiang commented on HBASE-5327: This patch fix two issues: (1) check the root dir to make sure it is valid before generating the old log dir. So that it can give a meaningful error message. (2) make sure the root dir is a dir instead of a file. If it is a file, the master will hang and try to create the version file forever. @Jon, I added some actionable log message. Print a message when an invalid hbase.rootdir is passed --- Key: HBASE-5327 URL: https://issues.apache.org/jira/browse/HBASE-5327 Project: HBase Issue Type: Bug Affects Versions: 0.90.5 Reporter: Jean-Daniel Cryans Assignee: Jimmy Xiang Fix For: 0.94.0, 0.90.7, 0.92.1 Attachments: hbase-5327.txt, hbase-5327_v2.txt As seen on the mailing list: http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/24124 If hbase.rootdir doesn't specify a folder on hdfs we crash while opening a path to .oldlogs: {noformat} 2012-02-02 23:07:26,292 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hdfs://sv4r11s38:9100.oldlogs at org.apache.hadoop.fs.Path.initialize(Path.java:148) at org.apache.hadoop.fs.Path.init(Path.java:71) at org.apache.hadoop.fs.Path.init(Path.java:50) at org.apache.hadoop.hbase.master.MasterFileSystem.init(MasterFileSystem.java:112) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:448) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:326) at java.lang.Thread.run(Thread.java:662) Caused by: java.net.URISyntaxException: Relative path in absolute URI: hdfs://sv4r11s38:9100.oldlogs at java.net.URI.checkPath(URI.java:1787) at java.net.URI.init(URI.java:735) at org.apache.hadoop.fs.Path.initialize(Path.java:145) ... 6 more {noformat} It could also crash anywhere else, this just happens to be the first place we use hbase.rootdir. We need to verify that it's an actual folder. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5327) Print a message when an invalid hbase.rootdir is passed
[ https://issues.apache.org/jira/browse/HBASE-5327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205064#comment-13205064 ] Jimmy Xiang commented on HBASE-5327: I prefer the first version actually. If the root dir is invalid, HDFS will throw an IAE. That's how we know a path an invalid HDFS path. Print a message when an invalid hbase.rootdir is passed --- Key: HBASE-5327 URL: https://issues.apache.org/jira/browse/HBASE-5327 Project: HBase Issue Type: Bug Affects Versions: 0.90.5 Reporter: Jean-Daniel Cryans Assignee: Jimmy Xiang Fix For: 0.94.0, 0.90.7, 0.92.1 Attachments: hbase-5327.txt, hbase-5327_v2.txt As seen on the mailing list: http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/24124 If hbase.rootdir doesn't specify a folder on hdfs we crash while opening a path to .oldlogs: {noformat} 2012-02-02 23:07:26,292 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: hdfs://sv4r11s38:9100.oldlogs at org.apache.hadoop.fs.Path.initialize(Path.java:148) at org.apache.hadoop.fs.Path.init(Path.java:71) at org.apache.hadoop.fs.Path.init(Path.java:50) at org.apache.hadoop.hbase.master.MasterFileSystem.init(MasterFileSystem.java:112) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:448) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:326) at java.lang.Thread.run(Thread.java:662) Caused by: java.net.URISyntaxException: Relative path in absolute URI: hdfs://sv4r11s38:9100.oldlogs at java.net.URI.checkPath(URI.java:1787) at java.net.URI.init(URI.java:735) at org.apache.hadoop.fs.Path.initialize(Path.java:145) ... 6 more {noformat} It could also crash anywhere else, this just happens to be the first place we use hbase.rootdir. We need to verify that it's an actual folder. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5312) Closed parent region present in Hlog.lastSeqWritten
[ https://issues.apache.org/jira/browse/HBASE-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13205214#comment-13205214 ] Jimmy Xiang commented on HBASE-5312: Have anyone seen this issue on 0.92 release? Could we add some logging so that we will have some clue when it happens again? Closed parent region present in Hlog.lastSeqWritten --- Key: HBASE-5312 URL: https://issues.apache.org/jira/browse/HBASE-5312 Project: HBase Issue Type: Bug Affects Versions: 0.90.5 Reporter: ramkrishna.s.vasudevan Fix For: 0.90.7 This is in reference to the mail sent in the dev mailing list Closed parent region present in Hlog.lastSeqWritten. The sceanrio described is We had a region that was split into two daughters. When the hlog roll tried to flush the region there was an entry in the HLog.lastSeqWritten that was not flushed or removed from the lastSeqWritten during the parent close. Because this flush was not happening subsequent flushes were getting blocked {code} 05:06:44,422 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=122, maxlogs=32; forcing flush of 1 regions(s): 2acaf8e3acfd2e8a5825a1f6f0aca4a8 05:06:44,422 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of 2acaf8e3acfd2e8a5825a1f6f0aca4a8r=null, requester=null 05:10:48,666 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=123, maxlogs=32; forcing flush of 1 regions(s): 2acaf8e3acfd2e8a5825a1f6f0aca4a8 05:10:48,666 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of 2acaf8e3acfd2e8a5825a1f6f0aca4a8r=null, requester=null 05:14:46,075 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=124, maxlogs=32; forcing flush of 1 regions(s): 2acaf8e3acfd2e8a5825a1f6f0aca4a8 05:14:46,075 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of 2acaf8e3acfd2e8a5825a1f6f0aca4a8r=null, requester=null 05:15:41,584 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=125, maxlogs=32; forcing flush of 1 regions(s): 2acaf8e3acfd2e8a5825a1f6f0aca4a8 05:15:41,584 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of 2acaf8e3acfd2e8a5825a1f6f0aca4a8r=null, {code} Lets see what happened for the region 2acaf8e3acfd2e8a5825a1f6f0aca4a8 {code} 2012-01-06 00:30:55,214 INFO org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://192.168.1.103:9000/hbase/Htable_UFDR_031/2acaf8e3acfd2e8a5825a1f6f0aca4a8/.tmp/1755862026714756815 to hdfs://192.168.1.103:9000/hbase/Htable_UFDR_031/2acaf8e3acfd2e8a5825a1f6f0aca4a8/value/973789709483406123 2012-01-06 00:30:58,946 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Instantiated Htable_UFDR_016,049790700093168-0456520,1325809837958.0ebe5bd7fcbc09ee074d5600b9d4e062. 2012-01-06 00:30:59,614 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://192.168.1.103:9000/hbase/Htable_UFDR_031/2acaf8e3acfd2e8a5825a1f6f0aca4a8/value/973789709483406123, entries=7537, sequenceid=20312223, memsize=4.2m, filesize=2.9m 2012-01-06 00:30:59,787 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished snapshotting, commencing flushing stores 2012-01-06 00:30:59,787 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~133.5m for region Htable_UFDR_031,00332,1325808823997.2acaf8e3acfd2e8a5825a1f6f0aca4a8. in 21816ms, sequenceid=20312223, compaction requested=true 2012-01-06 00:30:59,787 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction requested for Htable_UFDR_031,00332,1325808823997.2acaf8e3acfd2e8a5825a1f6f0aca4a8. because regionserver20020.cacheFlusher; priority=0, compaction queue size=5840 {code} A user triggered split has been issued to this region which can be seen in the above logs. The flushing of this region has resulted in a seq id 20312223. The region has been splitted and the parent region has been closed {code} 00:31:12,607 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region Htable_UFDR_031,00332,1325808823997.2acaf8e3acfd2e8a5825a1f6f0aca4a8. 00:31:13,694 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing Htable_UFDR_031,00332,1325808823997.2acaf8e3acfd2e8a5825a1f6f0aca4a8.: disabling compactions flushes 00:31:13,694 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for region Htable_UFDR_031,00332,1325808823997.2acaf8e3acfd2e8a5825a1f6f0aca4a8. 00:31:13,718 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed Htable_UFDR_031,00332,1325808823997.2acaf8e3acfd2e8a5825a1f6f0aca4a8. 00:31:39,552 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined
[jira] [Commented] (HBASE-5221) bin/hbase script doesn't look for Hadoop jars in the right place in trunk layout
[ https://issues.apache.org/jira/browse/HBASE-5221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203675#comment-13203675 ] Jimmy Xiang commented on HBASE-5221: Since 0.23, Hadoop re-organized the folder structure. They put the jars under each individual modules like hdfs, mapreduce, util and so on (under share/hadoop). The common one is under share/haddop/common. I am not very clear about the story behind either. Todd should know this much better. bin/hbase script doesn't look for Hadoop jars in the right place in trunk layout Key: HBASE-5221 URL: https://issues.apache.org/jira/browse/HBASE-5221 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Todd Lipcon Assignee: Jimmy Xiang Attachments: hbase-5221.txt Running against an 0.24.0-SNAPSHOT hadoop: ls: cannot access /home/todd/ha-demo/hadoop-0.24.0-SNAPSHOT/hadoop-common*.jar: No such file or directory ls: cannot access /home/todd/ha-demo/hadoop-0.24.0-SNAPSHOT/hadoop-hdfs*.jar: No such file or directory ls: cannot access /home/todd/ha-demo/hadoop-0.24.0-SNAPSHOT/hadoop-mapred*.jar: No such file or directory The jars are rooted deeper in the heirarchy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5221) bin/hbase script doesn't look for Hadoop jars in the right place in trunk layout
[ https://issues.apache.org/jira/browse/HBASE-5221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203807#comment-13203807 ] Jimmy Xiang commented on HBASE-5221: The problem is that when I run hbase shell, it complains those files are missing, and ClassNotFound org.apache.hadoop.util.PlatformName. We need to fix it. The script is already looking under HADOOP installation tree, just a wrong place. I don't think this fix will break anything. We can use this fix before HBASE-5286 is resolved. bin/hbase script doesn't look for Hadoop jars in the right place in trunk layout Key: HBASE-5221 URL: https://issues.apache.org/jira/browse/HBASE-5221 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Todd Lipcon Assignee: Jimmy Xiang Fix For: 0.94.0 Attachments: hbase-5221.txt Running against an 0.24.0-SNAPSHOT hadoop: ls: cannot access /home/todd/ha-demo/hadoop-0.24.0-SNAPSHOT/hadoop-common*.jar: No such file or directory ls: cannot access /home/todd/ha-demo/hadoop-0.24.0-SNAPSHOT/hadoop-hdfs*.jar: No such file or directory ls: cannot access /home/todd/ha-demo/hadoop-0.24.0-SNAPSHOT/hadoop-mapred*.jar: No such file or directory The jars are rooted deeper in the heirarchy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5221) bin/hbase script doesn't look for Hadoop jars in the right place in trunk layout
[ https://issues.apache.org/jira/browse/HBASE-5221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203810#comment-13203810 ] Jimmy Xiang commented on HBASE-5221: Ok, let me close it as dup. Probably, I just use the fix for myself before HBASE-5286 is resolved. bin/hbase script doesn't look for Hadoop jars in the right place in trunk layout Key: HBASE-5221 URL: https://issues.apache.org/jira/browse/HBASE-5221 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Todd Lipcon Assignee: Jimmy Xiang Fix For: 0.94.0 Attachments: hbase-5221.txt Running against an 0.24.0-SNAPSHOT hadoop: ls: cannot access /home/todd/ha-demo/hadoop-0.24.0-SNAPSHOT/hadoop-common*.jar: No such file or directory ls: cannot access /home/todd/ha-demo/hadoop-0.24.0-SNAPSHOT/hadoop-hdfs*.jar: No such file or directory ls: cannot access /home/todd/ha-demo/hadoop-0.24.0-SNAPSHOT/hadoop-mapred*.jar: No such file or directory The jars are rooted deeper in the heirarchy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5353) HA/Distributed HMaster via RegionServers
[ https://issues.apache.org/jira/browse/HBASE-5353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203909#comment-13203909 ] Jimmy Xiang commented on HBASE-5353: Another option is not to have a master, every region server can do the work a master currently does. Just uses the ZK to coordinate them. For example, once a region server dies, all other region server knows about it, all try to run the dead server clean up, but only one will actually do it. The drawback here is too much zk interaction. HA/Distributed HMaster via RegionServers Key: HBASE-5353 URL: https://issues.apache.org/jira/browse/HBASE-5353 Project: HBase Issue Type: Improvement Components: master, regionserver Affects Versions: 0.94.0 Reporter: Jesse Yates Priority: Minor Currently, the HMaster node must be considered a 'special' node (single point of failure), meaning that the node must be protected more than the other commodity machines. It should be possible to instead have the HMaster be much more available, either in a distributed sense (meaning a bit rewrite) or with multiple instances and automatic failover. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5317) Fix TestHFileOutputFormat to work against hadoop 0.23
[ https://issues.apache.org/jira/browse/HBASE-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13200575#comment-13200575 ] Jimmy Xiang commented on HBASE-5317: @Ted, did you run it with Hadoop 0.23? Fix TestHFileOutputFormat to work against hadoop 0.23 - Key: HBASE-5317 URL: https://issues.apache.org/jira/browse/HBASE-5317 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.94.0, 0.92.0 Reporter: Gregory Chanan Assignee: Gregory Chanan Attachments: HBASE-5317-v0.patch Running mvn -Dhadoop.profile=23 test -P localTests -Dtest=org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat yields this on 0.92: Failed tests: testColumnFamilyCompression(org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat): HFile for column family info-A not found Tests in error: test_TIMERANGE(org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat): /home/gchanan/workspace/apache92/target/test-data/276cbd0c-c771-4f81-9ba8-c464c9dd7486/test_TIMERANGE_present/_temporary/0/_temporary/_attempt_200707121733_0001_m_00_0 (Is a directory) testMRIncrementalLoad(org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat): TestTable testMRIncrementalLoadWithSplit(org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat): TestTable It looks like on trunk, this also results in an error: testExcludeMinorCompaction(org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat): TestTable I have a patch that fixes testColumnFamilyCompression and test_TIMERANGE, but haven't fixed the other 3 yet. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5310) HConnectionManager server cache key enhancement
[ https://issues.apache.org/jira/browse/HBASE-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197984#comment-13197984 ] Jimmy Xiang commented on HBASE-5310: @Ted, thanks for review and integration. HConnectionManager server cache key enhancement --- Key: HBASE-5310 URL: https://issues.apache.org/jira/browse/HBASE-5310 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Fix For: 0.94.0 Attachments: hbase-5310.txt HConnectionManager uses deprecated HServerAddress to create server cache key which needs to resolve the address every time. It should be better to use HRegionLocation.getHostnamePort() instead. In our cluster we have some DNS issue, resolving an address fails sometime which kills the application since it is a runtime exception IllegalArgumentException thrown at HServerAddress.getResolvedAddress. This change will fix this issue as well. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5281) Should a failure in creating an unassigned node abort the master?
[ https://issues.apache.org/jira/browse/HBASE-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197440#comment-13197440 ] Jimmy Xiang commented on HBASE-5281: I think it is safer to retry certain times before abort. Should a failure in creating an unassigned node abort the master? - Key: HBASE-5281 URL: https://issues.apache.org/jira/browse/HBASE-5281 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.5 Reporter: Harsh J Assignee: Harsh J Fix For: 0.94.0, 0.92.1 Attachments: HBASE-5281.patch In {{AssignmentManager}}'s {{CreateUnassignedAsyncCallback}}, we have the following condition: {code} if (rc != 0) { // Thisis resultcode. If non-zero, need to resubmit. LOG.warn(rc != 0 for + path + -- retryable connectionloss -- + FIX see http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A2;); this.zkw.abort(Connectionloss writing unassigned at + path + , rc= + rc, null); return; } {code} While a similar structure inside {{ExistsUnassignedAsyncCallback}} (which the above is linked to), does not have such a force abort. Do we really require the abort statement here, or can we make do without? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5281) Should a failure in creating an unassigned node abort the master?
[ https://issues.apache.org/jira/browse/HBASE-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197472#comment-13197472 ] Jimmy Xiang commented on HBASE-5281: The issue Harsh reported is from a customer using CDH3u2 which doesn't have the recoverablezk feature. I think the recoverablezk feature in 0.92.0 should have fixed this issue. Should a failure in creating an unassigned node abort the master? - Key: HBASE-5281 URL: https://issues.apache.org/jira/browse/HBASE-5281 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.5 Reporter: Harsh J Assignee: Harsh J Fix For: 0.94.0, 0.92.1 Attachments: HBASE-5281.patch In {{AssignmentManager}}'s {{CreateUnassignedAsyncCallback}}, we have the following condition: {code} if (rc != 0) { // Thisis resultcode. If non-zero, need to resubmit. LOG.warn(rc != 0 for + path + -- retryable connectionloss -- + FIX see http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A2;); this.zkw.abort(Connectionloss writing unassigned at + path + , rc= + rc, null); return; } {code} While a similar structure inside {{ExistsUnassignedAsyncCallback}} (which the above is linked to), does not have such a force abort. Do we really require the abort statement here, or can we make do without? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5210) HFiles are missing from an incremental load
[ https://issues.apache.org/jira/browse/HBASE-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13191310#comment-13191310 ] Jimmy Xiang commented on HBASE-5210: Any fix in getRandomFilename will just reduce the chance of file name collision. Since this a rare case, I think it may be better to just fail the task if failed to commit the files in the moveTaskOutputs(), without overwriting the existing files. In HDFS 0.23, rename() takes an option not to overwrite. With HADOOP 0.20, we can just do our best to check any conflicts before committing the files. HFiles are missing from an incremental load --- Key: HBASE-5210 URL: https://issues.apache.org/jira/browse/HBASE-5210 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.2 Environment: HBase 0.90.2 with Hadoop-0.20.2 (with durable sync). RHEL 2.6.18-164.15.1.el5. 4 node cluster (1 master, 3 slaves) Reporter: Lawrence Simpson Attachments: HBASE-5210-crazy-new-getRandomFilename.patch We run an overnight map/reduce job that loads data from an external source and adds that data to an existing HBase table. The input files have been loaded into hdfs. The map/reduce job uses the HFileOutputFormat (and the TotalOrderPartitioner) to create HFiles which are subsequently added to the HBase table. On at least two separate occasions (that we know of), a range of output would be missing for a given day. The range of keys for the missing values corresponded to those of a particular region. This implied that a complete HFile somehow went missing from the job. Further investigation revealed the following: * Two different reducers (running in separate JVMs and thus separate class loaders) * in the same server can end up using the same file names for their * HFiles. The scenario is as follows: *1. Both reducers start near the same time. *2. The first reducer reaches the point where it wants to write its first file. *3. It uses the StoreFile class which contains a static Random object *which is initialized by default using a timestamp. *4. The file name is generated using the random number generator. *5. The file name is checked against other existing files. *6. The file is written into temporary files in a directory named *after the reducer attempt. *7. The second reduce task reaches the same point, but its StoreClass *(which is now in the file system's cache) gets loaded within the *time resolution of the OS and thus initializes its Random() *object with the same seed as the first task. *8. The second task also checks for an existing file with the name *generated by the random number generator and finds no conflict *because each task is writing files in its own temporary folder. *9. The first task finishes and gets its temporary files committed *to the real folder specified for output of the HFiles. * 10.The second task then reaches its own conclusion and commits its *files (moveTaskOutputs). The released Hadoop code just overwrites *any files with the same name. No warning messages or anything. *The first task's HFiles just go missing. * * Note: The reducers here are NOT different attempts at the same *reduce task. They are different reduce tasks so data is *really lost. I am currently testing a fix in which I have added code to the Hadoop FileOutputCommitter.moveTaskOutputs method to check for a conflict with an existing file in the final output folder and to rename the HFile if needed. This may not be appropriate for all uses of FileOutputFormat. So I have put this into a new class which is then used by a subclass of HFileOutputFormat. Subclassing of FileOutputCommitter itself was a bit more of a problem due to private declarations. I don't know if my approach is the best fix for the problem. If someone more knowledgeable than myself deems that it is, I will be happy to share what I have done and by that time I may have some information on the results. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5210) HFiles are missing from an incremental load
[ https://issues.apache.org/jira/browse/HBASE-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13191351#comment-13191351 ] Jimmy Xiang commented on HBASE-5210: I like this one. It's really simple and clean. HFiles are missing from an incremental load --- Key: HBASE-5210 URL: https://issues.apache.org/jira/browse/HBASE-5210 Project: HBase Issue Type: Bug Components: mapreduce Affects Versions: 0.90.2 Environment: HBase 0.90.2 with Hadoop-0.20.2 (with durable sync). RHEL 2.6.18-164.15.1.el5. 4 node cluster (1 master, 3 slaves) Reporter: Lawrence Simpson Attachments: HBASE-5210-crazy-new-getRandomFilename.patch We run an overnight map/reduce job that loads data from an external source and adds that data to an existing HBase table. The input files have been loaded into hdfs. The map/reduce job uses the HFileOutputFormat (and the TotalOrderPartitioner) to create HFiles which are subsequently added to the HBase table. On at least two separate occasions (that we know of), a range of output would be missing for a given day. The range of keys for the missing values corresponded to those of a particular region. This implied that a complete HFile somehow went missing from the job. Further investigation revealed the following: * Two different reducers (running in separate JVMs and thus separate class loaders) * in the same server can end up using the same file names for their * HFiles. The scenario is as follows: *1. Both reducers start near the same time. *2. The first reducer reaches the point where it wants to write its first file. *3. It uses the StoreFile class which contains a static Random object *which is initialized by default using a timestamp. *4. The file name is generated using the random number generator. *5. The file name is checked against other existing files. *6. The file is written into temporary files in a directory named *after the reducer attempt. *7. The second reduce task reaches the same point, but its StoreClass *(which is now in the file system's cache) gets loaded within the *time resolution of the OS and thus initializes its Random() *object with the same seed as the first task. *8. The second task also checks for an existing file with the name *generated by the random number generator and finds no conflict *because each task is writing files in its own temporary folder. *9. The first task finishes and gets its temporary files committed *to the real folder specified for output of the HFiles. * 10.The second task then reaches its own conclusion and commits its *files (moveTaskOutputs). The released Hadoop code just overwrites *any files with the same name. No warning messages or anything. *The first task's HFiles just go missing. * * Note: The reducers here are NOT different attempts at the same *reduce task. They are different reduce tasks so data is *really lost. I am currently testing a fix in which I have added code to the Hadoop FileOutputCommitter.moveTaskOutputs method to check for a conflict with an existing file in the final output folder and to rename the HFile if needed. This may not be appropriate for all uses of FileOutputFormat. So I have put this into a new class which is then used by a subclass of HFileOutputFormat. Subclassing of FileOutputCommitter itself was a bit more of a problem due to private declarations. I don't know if my approach is the best fix for the problem. If someone more knowledgeable than myself deems that it is, I will be happy to share what I have done and by that time I may have some information on the results. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5196) Failure in region split after PONR could cause region hole
[ https://issues.apache.org/jira/browse/HBASE-5196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187816#comment-13187816 ] Jimmy Xiang commented on HBASE-5196: Yes, the test suite on 0.90 with the patch passed. Failure in region split after PONR could cause region hole -- Key: HBASE-5196 URL: https://issues.apache.org/jira/browse/HBASE-5196 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5196-v2.txt, hbase-5196_0.90.txt If region split fails after PONR, it relies on the master ServerShutdown handler to fix it. However, if the master doesn't get a chance to fix it. There will be a hole in the region chain. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5196) Failure in region split after PONR could cause region hole
[ https://issues.apache.org/jira/browse/HBASE-5196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187318#comment-13187318 ] Jimmy Xiang commented on HBASE-5196: I attached a patch for 0.90 branch: hbase-5196_0.90.txt Could anyone please check it in? Failure in region split after PONR could cause region hole -- Key: HBASE-5196 URL: https://issues.apache.org/jira/browse/HBASE-5196 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5196-v2.txt, hbase-5196_0.90.txt If region split fails after PONR, it relies on the master ServerShutdown handler to fix it. However, if the master doesn't get a chance to fix it. There will be a hole in the region chain. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5196) Failure in region split after PONR could cause region hole
[ https://issues.apache.org/jira/browse/HBASE-5196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187332#comment-13187332 ] Jimmy Xiang commented on HBASE-5196: @Ted, I ran the test suite, and verified the fix on CDH3u3. Let me run the test suite on 0.90 now. Failure in region split after PONR could cause region hole -- Key: HBASE-5196 URL: https://issues.apache.org/jira/browse/HBASE-5196 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5196-v2.txt, hbase-5196_0.90.txt If region split fails after PONR, it relies on the master ServerShutdown handler to fix it. However, if the master doesn't get a chance to fix it. There will be a hole in the region chain. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5136) Redundant MonitoredTask instances in case of distributed log splitting retry
[ https://issues.apache.org/jira/browse/HBASE-5136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185688#comment-13185688 ] Jimmy Xiang commented on HBASE-5136: Instead of reuse the same status object, can we abort the original one? {code} waitForSplittingCompletion(batch, status); if (batch.done != batch.installed) { batch.isDead = true; tot_mgr_log_split_batch_err.incrementAndGet(); LOG.warn(error while splitting logs in + logDirs + installed = + batch.installed + but only + batch.done + done); = update the status message and abort it here throw new IOException(error or interrupt while splitting logs in + logDirs + Task = + batch); } {code} Redundant MonitoredTask instances in case of distributed log splitting retry Key: HBASE-5136 URL: https://issues.apache.org/jira/browse/HBASE-5136 Project: HBase Issue Type: Task Reporter: Zhihong Yu Assignee: Zhihong Yu Attachments: 5136.txt In case of log splitting retry, the following code would be executed multiple times: {code} public long splitLogDistributed(final ListPath logDirs) throws IOException { MonitoredTask status = TaskMonitor.get().createStatus( Doing distributed log split in + logDirs); {code} leading to multiple MonitoredTask instances. User may get confused by multiple distributed log splitting entries for the same region server on master UI -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5174) Coalesce aborted tasks in the TaskMonitor
[ https://issues.apache.org/jira/browse/HBASE-5174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185694#comment-13185694 ] Jimmy Xiang commented on HBASE-5174: Failed or aborted tasks should not be displayed after the retry is succeeded. Otherwise, will it cause confusion? Coalesce aborted tasks in the TaskMonitor - Key: HBASE-5174 URL: https://issues.apache.org/jira/browse/HBASE-5174 Project: HBase Issue Type: Improvement Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Fix For: 0.94.0, 0.92.1 Some tasks can get repeatedly canceled like flushing when splitting is going on, in the logs it looks like this: {noformat} 2012-01-10 19:28:29,164 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c. due to global heap pressure 2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c., flushing=false, writesEnabled=false 2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up because memory above low water=1.6g 2012-01-10 19:28:29,164 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c. due to global heap pressure 2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c., flushing=false, writesEnabled=false 2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up because memory above low water=1.6g 2012-01-10 19:28:29,164 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c. due to global heap pressure 2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c., flushing=false, writesEnabled=false {noformat} But in the TaskMonitor UI you'll get MAX_TASKS (1000) displayed on top of the regions. Basically 1000x: {noformat} Tue Jan 10 19:28:29 UTC 2012 Flushing test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c. ABORTED (since 31sec ago) Not flushing since writes not enabled (since 31sec ago) {noformat} It's ugly and I'm sure some users will freak out seeing this, plus you have to scroll down all the way to see your regions. Coalescing consecutive aborted tasks seems like a good solution. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5174) Coalesce aborted tasks in the TaskMonitor
[ https://issues.apache.org/jira/browse/HBASE-5174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185701#comment-13185701 ] Jimmy Xiang commented on HBASE-5174: I meant we can not just show the failed or aborted tasks longer. We should also show the succeeded one or the retrying one as well, if it failed before and the failed tasks is still showing. Coalesce aborted tasks in the TaskMonitor - Key: HBASE-5174 URL: https://issues.apache.org/jira/browse/HBASE-5174 Project: HBase Issue Type: Improvement Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Fix For: 0.94.0, 0.92.1 Some tasks can get repeatedly canceled like flushing when splitting is going on, in the logs it looks like this: {noformat} 2012-01-10 19:28:29,164 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c. due to global heap pressure 2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c., flushing=false, writesEnabled=false 2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up because memory above low water=1.6g 2012-01-10 19:28:29,164 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c. due to global heap pressure 2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c., flushing=false, writesEnabled=false 2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up because memory above low water=1.6g 2012-01-10 19:28:29,164 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c. due to global heap pressure 2012-01-10 19:28:29,164 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: NOT flushing memstore for region test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c., flushing=false, writesEnabled=false {noformat} But in the TaskMonitor UI you'll get MAX_TASKS (1000) displayed on top of the regions. Basically 1000x: {noformat} Tue Jan 10 19:28:29 UTC 2012 Flushing test1,,1326223218996.3eea0d89af7b851c3a9b4246389a4f2c. ABORTED (since 31sec ago) Not flushing since writes not enabled (since 31sec ago) {noformat} It's ugly and I'm sure some users will freak out seeing this, plus you have to scroll down all the way to see your regions. Coalescing consecutive aborted tasks seems like a good solution. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5196) Failure in region split after PONR could cause region hole
[ https://issues.apache.org/jira/browse/HBASE-5196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185760#comment-13185760 ] Jimmy Xiang commented on HBASE-5196: I have a simple fix. When the master starts up, fix up all the missing daughters as the ServerShutdown handler does. Failure in region split after PONR could cause region hole -- Key: HBASE-5196 URL: https://issues.apache.org/jira/browse/HBASE-5196 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang If region split fails after PONR, it relies on the master ServerShutdown handler to fix it. However, if the master doesn't get a chance to fix it. There will be a hole in the region chain. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5196) Failure in region split after PONR could cause region hole
[ https://issues.apache.org/jira/browse/HBASE-5196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13185856#comment-13185856 ] Jimmy Xiang commented on HBASE-5196: Yes, it is good. Thanks Ted. These failed tests passed on my box. Failure in region split after PONR could cause region hole -- Key: HBASE-5196 URL: https://issues.apache.org/jira/browse/HBASE-5196 Project: HBase Issue Type: Bug Components: master, regionserver Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5196-v2.txt If region split fails after PONR, it relies on the master ServerShutdown handler to fix it. However, if the master doesn't get a chance to fix it. There will be a hole in the region chain. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5150) Fail in a thread may not fail a test, clean up log splitting test
[ https://issues.apache.org/jira/browse/HBASE-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184221#comment-13184221 ] Jimmy Xiang commented on HBASE-5150: Those failed tests passed on my local box. Fail in a thread may not fail a test, clean up log splitting test - Key: HBASE-5150 URL: https://issues.apache.org/jira/browse/HBASE-5150 Project: HBase Issue Type: Test Affects Versions: 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: hbase-5150.txt, hbase_5150_v3.patch This is to clean up some tests for HBASE-5081. The Assert.fail method in a separate thread will terminate the thread, but may not fail the test. We can use callable, so that we can get the error in getting the result. Some documentation to explain the test will be helpful too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5150) Fail in a thread may not fail a test, clean up log splitting test
[ https://issues.apache.org/jira/browse/HBASE-5150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13184234#comment-13184234 ] Jimmy Xiang commented on HBASE-5150: @Prakash and Ted, are you ok with this patch? I changed the 3sec wait time to 2sec. Fail in a thread may not fail a test, clean up log splitting test - Key: HBASE-5150 URL: https://issues.apache.org/jira/browse/HBASE-5150 Project: HBase Issue Type: Test Affects Versions: 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Priority: Minor Attachments: hbase-5150.txt, hbase_5150_v3.patch This is to clean up some tests for HBASE-5081. The Assert.fail method in a separate thread will terminate the thread, but may not fail the test. We can use callable, so that we can get the error in getting the result. Some documentation to explain the test will be helpful too. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13181407#comment-13181407 ] Jimmy Xiang commented on HBASE-5081: It turns out all my region servers died. I restarted them (rs) all, and things are looking better now. One folder is completed. Two more to go. Distributed log splitting deleteNode races against splitLog retry -- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Prakash Khemani Fix For: 0.92.0 Attachments: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 5081-deleteNode-with-while-loop.txt, HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, distributed-log-splitting-screenshot.png, distributed_log_splitting_screen_shot2.png, distributed_log_splitting_screenshot3.png, hbase-5081-patch-v6.txt, hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13181451#comment-13181451 ] Jimmy Xiang commented on HBASE-5081: Now, all logs are split. I am happy with the patch. Distributed log splitting deleteNode races against splitLog retry -- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Prakash Khemani Fix For: 0.92.0 Attachments: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 5081-deleteNode-with-while-loop.txt, HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, distributed-log-splitting-screenshot.png, distributed_log_splitting_screen_shot2.png, distributed_log_splitting_screenshot3.png, hbase-5081-patch-v6.txt, hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13181457#comment-13181457 ] Jimmy Xiang commented on HBASE-5081: @Stack, yes, it will screw up the cluster (7 nodes). Distributed log splitting deleteNode races against splitLog retry -- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Prakash Khemani Fix For: 0.92.0 Attachments: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 5081-deleteNode-with-while-loop.txt, HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, distributed-log-splitting-screenshot.png, distributed_log_splitting_screen_shot2.png, distributed_log_splitting_screenshot3.png, hbase-5081-patch-v6.txt, hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races against splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13181087#comment-13181087 ] Jimmy Xiang commented on HBASE-5081: It hangs again. In the region server log, I saw some DFS issue. Let me restart the cluster. Hopefully, it will move on. Distributed log splitting deleteNode races against splitLog retry -- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Prakash Khemani Fix For: 0.92.0 Attachments: 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 0001-HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, 5081-deleteNode-with-while-loop.txt, HBASE-5081-jira-Distributed-log-splitting-deleteNode.patch, distributed-log-splitting-screenshot.png, distributed_log_splitting_screen_shot2.png, hbase-5081-patch-v6.txt, hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
[ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177847#comment-13177847 ] Jimmy Xiang commented on HBASE-5099: TestReplication is flaky. But it works on my ubuntu box. Let me take a look. ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on Key: HBASE-5099 URL: https://issues.apache.org/jira/browse/HBASE-5099 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager installed the tasks asynchronously, then started to wait for them to complete. The task znodes were not created actually. The requests were just queued. At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. During the recovery, a new zookeeper connection was created. However, this master became the new master again. It tried to assign root and meta. Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. This waiting holds the zookeeper event thread. So the async create split task is never retried since there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
[ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177860#comment-13177860 ] Jimmy Xiang commented on HBASE-5099: I tried to debug this testcase but it doesn't stop at the changes I did. ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on Key: HBASE-5099 URL: https://issues.apache.org/jira/browse/HBASE-5099 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager installed the tasks asynchronously, then started to wait for them to complete. The task znodes were not created actually. The requests were just queued. At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. During the recovery, a new zookeeper connection was created. However, this master became the new master again. It tried to assign root and meta. Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. This waiting holds the zookeeper event thread. So the async create split task is never retried since there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on
[ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177886#comment-13177886 ] Jimmy Xiang commented on HBASE-5099: TestReplication#queueFailover has a bug that's why it is flaky: https://issues.apache.org/jira/browse/HBASE-5112 ZK event thread waiting for root region assignment may block server shutdown handler for the region sever the root region was on Key: HBASE-5099 URL: https://issues.apache.org/jira/browse/HBASE-5099 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager installed the tasks asynchronously, then started to wait for them to complete. The task znodes were not created actually. The requests were just queued. At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. During the recovery, a new zookeeper connection was created. However, this master became the new master again. It tried to assign root and meta. Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. This waiting holds the zookeeper event thread. So the async create split task is never retried since there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5112) TestReplication#queueFailover flaky due to code error
[ https://issues.apache.org/jira/browse/HBASE-5112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177887#comment-13177887 ] Jimmy Xiang commented on HBASE-5112: @Ted, could you please give this patch a try on your MacBook? I could not reproduce the failure on my box. I looked into the code carefully and this fix should make this testcase not flaky any more. TestReplication#queueFailover flaky due to code error - Key: HBASE-5112 URL: https://issues.apache.org/jira/browse/HBASE-5112 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: hbase-5112.patch In TestReplication#queueFailover, the second scan is not reset for each new scan. Followed scan may not be able to scan the whole table. So it cannot get all the data and the test fails. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root
[ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177417#comment-13177417 ] Jimmy Xiang commented on HBASE-5099: @Ted, thanks! ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on Key: HBASE-5099 URL: https://issues.apache.org/jira/browse/HBASE-5099 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099.patch A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager installed the tasks asynchronously, then started to wait for them to complete. The task znodes were not created actually. The requests were just queued. At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. During the recovery, a new zookeeper connection was created. However, this master became the new master again. It tried to assign root and meta. Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. This waiting holds the zookeeper event thread. So the async create split task is never retried since there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root
[ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177446#comment-13177446 ] Jimmy Xiang commented on HBASE-5099: There is no harm to call shutdownNow() even if awaitTerminition is not timed out. So this should be fine: if (executor.awaitTermination(timeout, TimeUnit.MILLISECONDS) result.isDone()) { Boolean recovered = result.get(); if (recovered != null) { return recovered.booleanValue(); } } executor.shutdownNow(); ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on Key: HBASE-5099 URL: https://issues.apache.org/jira/browse/HBASE-5099 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0, 0.94.0 Attachments: 5099.92, ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099-v2.patch, hbase-5099-v3.patch, hbase-5099-v4.patch, hbase-5099-v5.patch, hbase-5099-v6.patch, hbase-5099.patch A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager installed the tasks asynchronously, then started to wait for them to complete. The task znodes were not created actually. The requests were just queued. At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. During the recovery, a new zookeeper connection was created. However, this master became the new master again. It tried to assign root and meta. Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. This waiting holds the zookeeper event thread. So the async create split task is never retried since there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root
[ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176778#comment-13176778 ] Jimmy Xiang commented on HBASE-5099: Cool, let me submit a patch. ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on Key: HBASE-5099 URL: https://issues.apache.org/jira/browse/HBASE-5099 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png, hbase-5099.patch A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager installed the tasks asynchronously, then started to wait for them to complete. The task znodes were not created actually. The requests were just queued. At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. During the recovery, a new zookeeper connection was created. However, this master became the new master again. It tried to assign root and meta. Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. This waiting holds the zookeeper event thread. So the async create split task is never retried since there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root
[ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176430#comment-13176430 ] Jimmy Xiang commented on HBASE-5099: This is good. If introducing a timeout, I prefer to do it for tryRecoveringExpiredZKSession(). The reason for that is, other than waitForAssignment, there are several other places which have the waiting logic as well, such as bulkAssign(), waitForRoot(), this.activeMasterManager.blockUntilBecomingActiveMaster(startupStatus), etc. ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on Key: HBASE-5099 URL: https://issues.apache.org/jira/browse/HBASE-5099 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager installed the tasks asynchronously, then started to wait for them to complete. The task znodes were not created actually. The requests were just queued. At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. During the recovery, a new zookeeper connection was created. However, this master became the new master again. It tried to assign root and meta. Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. This waiting holds the zookeeper event thread. So the async create split task is never retried since there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5099) ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root
[ https://issues.apache.org/jira/browse/HBASE-5099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176447#comment-13176447 ] Jimmy Xiang commented on HBASE-5099: tryRecoveringExpiredZKSession() is only called by abortNow(), which is called by abort(), which is called by the eventThread. I was thinking to put this whole method in another thread with executor service and time it out after a certain time, for example, 5 minutes, then fails the recovery and let it abort. This way, we don't have to adding timeout for all the methods. The regular master startup is not impacted which calls assignRootAndMeta() too. However, if we know most likely just waitForAssignment() takes a long time, we can add timeout to this method only. But I am not so sure. ZK event thread waiting for root region while server shutdown handler waiting for event thread to finish distributed log splitting to recover the region sever the root region is on Key: HBASE-5099 URL: https://issues.apache.org/jira/browse/HBASE-5099 Project: HBase Issue Type: Bug Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Attachments: ZK-event-thread-waiting-for-root.png, distributed-log-splitting-hangs.png A RS died. The ServerShutdownHandler kicked in and started the logspliting. SpliLogManager installed the tasks asynchronously, then started to wait for them to complete. The task znodes were not created actually. The requests were just queued. At this time, the zookeeper connection expired. HMaster tried to recover the expired ZK session. During the recovery, a new zookeeper connection was created. However, this master became the new master again. It tried to assign root and meta. Because the dead RS got the old root region, the master needs to wait for the log splitting to complete. This waiting holds the zookeeper event thread. So the async create split task is never retried since there is only one event thread, which is waiting for the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175155#comment-13175155 ] Jimmy Xiang commented on HBASE-5081: @Stack, it is not an orphan task. It happens in ServerShutdownHandler. It retries the log splitting if the previous one failed for any reason: line 178: this.services.getExecutorService().submit(this); It keep retrying. Should we have a limit here? Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0 Attachments: distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175158#comment-13175158 ] Jimmy Xiang commented on HBASE-5081: @Prakash, this one didn't happen when the master starts up. It happened when one region server died. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Fix For: 0.92.0 Attachments: distributed-log-splitting-screenshot.png, hbase-5081-patch-v6.txt, hbase-5081-patch-v7.txt, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174180#comment-13174180 ] Jimmy Xiang commented on HBASE-5081: I am working on a path now. I think synchronous deleteNode is clean. It will give retry a fresh start. But it may take a while if there are too many files. Yes, for long term, we can think about how to do what stack says. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174187#comment-13174187 ] Jimmy Xiang commented on HBASE-5081: Can we deleteNode only if it is successfully done? If it is not completed, let the node stay there. In this case, when the retry happens, it should see the old node there, but it is ok. The new task in the hashmap won't be deleted either. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174251#comment-13174251 ] Jimmy Xiang commented on HBASE-5081: The patch is for both 0.92 and 0.94 actually. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081_patch_for_92_v4.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5081) Distributed log splitting deleteNode races againsth splitLog retry
[ https://issues.apache.org/jira/browse/HBASE-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174357#comment-13174357 ] Jimmy Xiang commented on HBASE-5081: I am thinking about sync delete for the failure case. What do you think? I am adjusting the test failure now. Distributed log splitting deleteNode races againsth splitLog retry --- Key: HBASE-5081 URL: https://issues.apache.org/jira/browse/HBASE-5081 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.92.0, 0.94.0 Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: distributed-log-splitting-screenshot.png, hbase-5081_patch_for_92_v4.txt, hbase-5081_patch_v5.txt, patch_for_92.txt, patch_for_92_v2.txt, patch_for_92_v3.txt Recently, during 0.92 rc testing, we found distributed log splitting hangs there forever. Please see attached screen shot. I looked into it and here is what happened I think: 1. One rs died, the servershutdownhandler found it out and started the distributed log splitting; 2. All three tasks failed, so the three tasks were deleted, asynchronously; 3. Servershutdownhandler retried the log splitting; 4. During the retrial, it created these three tasks again, and put them in a hashmap (tasks); 5. The asynchronously deletion in step 2 finally happened for one task, in the callback, it removed one task in the hashmap; 6. One of the newly submitted tasks' zookeeper watcher found out that task is unassigned, and it is not in the hashmap, so it created a new orphan task. 7. All three tasks failed, but that task created in step 6 is an orphan so the batch.err counter was one short, so the log splitting hangs there and keeps waiting for the last task to finish which is never going to happen. So I think the problem is step 2. The fix is to make deletion sync, instead of async, so that the retry will have a clean start. Async deleteNode will mess up with split log retrial. In extreme situation, if async deleteNode doesn't happen soon enough, some node created during the retrial could be deleted. deleteNode should be sync. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira