[jira] [Commented] (HBASE-12742) ClusterStatusPublisher crashes with a IPv6 network interface.
[ https://issues.apache.org/jira/browse/HBASE-12742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259874#comment-14259874 ] Qiang Tian commented on HBASE-12742: ah...ipv4 address.. thanks [~stack] for telling me the progress! ClusterStatusPublisher crashes with a IPv6 network interface. - Key: HBASE-12742 URL: https://issues.apache.org/jira/browse/HBASE-12742 Project: HBase Issue Type: Bug Reporter: Jurriaan Mous Assignee: Jurriaan Mous Fix For: 1.0.0, 2.0.0, 1.1.0 Attachments: HBASE-12742-v1.patch, HBASE-12742.patch On my dev machine the first network interface is an IPv6 tunnel. Hbase works internally with ipv4 addresses. Addressing selects the first tunnel to use. This causes the ClusterStatusPublisher its DatagramChannel group join to crash. Stack trace: {code} java.io.IOException: Shutting down at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:231) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:93) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:976) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:936) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:810) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:792) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:779) at org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:140) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.junit.runner.JUnitCore.run(JUnitCore.java:160) at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) Caused by: java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:143) at org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215) at org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:213) ... 26 more Caused by: java.io.IOException: Network interface not configured for IPv4 at sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:860) at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:885) at io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:409) at org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:286) at org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:129) at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:379) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:139) ... 29 more {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12657) The Region is not being split and far exceeds the desired maximum size.
[ https://issues.apache.org/jira/browse/HBASE-12657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243652#comment-14243652 ] Qiang Tian commented on HBASE-12657: Hi [~lhofhansl] sorry I do not quite understand that part. we expect to see filesCompacting is empty in getCurrentEligibleFiles right? regarding HStore#filesCompacting. it looks we always remove files in CompactionRequest by finishCompactionRequest, so not sure why we need to maintain it.. The Region is not being split and far exceeds the desired maximum size. --- Key: HBASE-12657 URL: https://issues.apache.org/jira/browse/HBASE-12657 Project: HBase Issue Type: Bug Components: Compaction Affects Versions: 0.94.25 Reporter: Vladimir Rodionov Assignee: Vladimir Rodionov Fix For: 0.94.26 Attachments: HBASE-12657-0.94.patch, HBASE-12657-0.94.patch.2 We are seeing this behavior when creating indexes in one of our environment. When an index is being created, most of the requests go into a single region. The amount of time to create an index seems to take longer than usual and it can take days for the regions to compact and split after the index is created. Here is a du of the HBase index table: {code} -bash-4.1$ sudo -su hdfs hadoop fs -du /hbase/43681 705 /hbase/43681/.tableinfo.01 0/hbase/43681/.tmp 27981697293 /hbase/43681/0492e22092e21d35fca8e779b21ec797 539687093/hbase/43681/832298c4e975fc47210feb6bac3d2f71 560660531/hbase/43681/be9bdb3bdf9365afe5fe90db4247d82c 7081938297 /hbase/43681/cd440e524f96fbe0719b2fe969848560 6297860287 /hbase/43681/dc893a2d8daa08c689dc69e6bb2c5b50 7189607722 /hbase/43681/ffbceaea5e2f142dbe6cd4cbeacc00e8 ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12657) The Region is not being split and far exceeds the desired maximum size.
[ https://issues.apache.org/jira/browse/HBASE-12657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240783#comment-14240783 ] Qiang Tian commented on HBASE-12657: RatioBasedCompactionPolicy.getCurrentEligibleFiles does look suspicious(still not sure when it is triggered..) The Region is not being split and far exceeds the desired maximum size. --- Key: HBASE-12657 URL: https://issues.apache.org/jira/browse/HBASE-12657 Project: HBase Issue Type: Bug Components: Compaction Affects Versions: 0.94.25 Reporter: Vladimir Rodionov Assignee: Vladimir Rodionov Fix For: 0.94.26 Attachments: HBASE-12657-0.94.patch We are seeing this behavior when creating indexes in one of our environment. When an index is being created, most of the requests go into a single region. The amount of time to create an index seems to take longer than usual and it can take days for the regions to compact and split after the index is created. Here is a du of the HBase index table: {code} -bash-4.1$ sudo -su hdfs hadoop fs -du /hbase/43681 705 /hbase/43681/.tableinfo.01 0/hbase/43681/.tmp 27981697293 /hbase/43681/0492e22092e21d35fca8e779b21ec797 539687093/hbase/43681/832298c4e975fc47210feb6bac3d2f71 560660531/hbase/43681/be9bdb3bdf9365afe5fe90db4247d82c 7081938297 /hbase/43681/cd440e524f96fbe0719b2fe969848560 6297860287 /hbase/43681/dc893a2d8daa08c689dc69e6bb2c5b50 7189607722 /hbase/43681/ffbceaea5e2f142dbe6cd4cbeacc00e8 ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11902: --- Status: Open (was: Patch Available) RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, hbase11902-master.patch, hbase11902-master_v2.patch, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11902: --- Attachment: hbase11902-master_v3.patch Local run passed. based on HBASE-8208(that is why wal.sync is introduced in memstore flush), we also need to fail the flush when getting HDFS failure. RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, hbase11902-master.patch, hbase11902-master_v2.patch, hbase11902-master_v3.patch, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11902: --- Status: Patch Available (was: Open) RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, hbase11902-master.patch, hbase11902-master_v2.patch, hbase11902-master_v3.patch, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12558) TestHCM.testClusterStatus Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedErro
[ https://issues.apache.org/jira/browse/HBASE-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228022#comment-14228022 ] Qiang Tian commented on HBASE-12558: ah..it has been there for quite a bit of time [~stack]? interesting...why I cannot hit it. is there any pattern based on those hits you got? e.g. is there any particular machine, os? TestHCM.testClusterStatus Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError - Key: HBASE-12558 URL: https://issues.apache.org/jira/browse/HBASE-12558 Project: HBase Issue Type: Bug Components: test Reporter: stack Assignee: stack Priority: Critical Fix For: 2.0.0, 0.99.2 Attachments: 12558-master.patch, 12558.ignore.txt Happens for me reliably on mac os x. I looked at fixing it. The listener is not noticing the publish for whatever reason. Thats where I stopped. {code} java.lang.Exception: Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError at junit.framework.Assert.fail(Assert.java:57) at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193) at org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537) at org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225962#comment-14225962 ] Qiang Tian commented on HBASE-11902: the TestLogRolling creates a similar error scenario with this case. the testcase failure is because of below code: {code} // verify the written rows are there assertTrue(loggedRows.contains(row1002)); assertTrue(loggedRows.contains(row1003)); assertTrue(loggedRows.contains(row1004)); assertTrue(loggedRows.contains(row1005)); // flush all regions ListHRegion regions = new ArrayListHRegion(server.getOnlineRegionsLocalContext()); for (HRegion r: regions) { r.flushcache();// ===the re-throwed exception will end the testcase } {code} adding a try/catch for flushcache call make it pass. RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, hbase11902-master.patch, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11902: --- Status: Open (was: Patch Available) RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, hbase11902-master.patch, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11902: --- Attachment: hbase11902-master_v2.patch There is much difference in WAL between master and 0.98 such as thread model, error handling. RS is aborted in case of HDFS failure in 0.98, but master branch ignores it and the writesync thread continues.. make the patch simpler. RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, hbase11902-master.patch, hbase11902-master_v2.patch, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11902: --- Status: Patch Available (was: Open) RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, hbase11902-master.patch, hbase11902-master_v2.patch, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227322#comment-14227322 ] Qiang Tian commented on HBASE-11902: ok. the latest failure is because, in the testcase, only WAL write fails, if we hide the exception( just decrement the counter) and continues, the data flush will succeed, so completeCacheFlush call decrement it again!. to preserve the counter semantics, simple is the best --- return right away.(the original patch) RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, hbase11902-master.patch, hbase11902-master_v2.patch, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12558) TestHCM.testClusterStatus Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedErro
[ https://issues.apache.org/jira/browse/HBASE-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227328#comment-14227328 ] Qiang Tian commented on HBASE-12558: Thanks [~stack]. so the publisher does send but listener does not get it. perhaps netty logging could help show what is going on. unfortunately I cannot reproduce the problem in my env...if everything is fine before HBASE-12359, maybe we need to revert the patch. sorry for that.. TestHCM.testClusterStatus Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError - Key: HBASE-12558 URL: https://issues.apache.org/jira/browse/HBASE-12558 Project: HBase Issue Type: Bug Components: test Reporter: stack Assignee: stack Priority: Critical Fix For: 2.0.0, 0.99.2 Attachments: 12558-master.patch, 12558.ignore.txt Happens for me reliably on mac os x. I looked at fixing it. The listener is not noticing the publish for whatever reason. Thats where I stopped. {code} java.lang.Exception: Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError at junit.framework.Assert.fail(Assert.java:57) at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193) at org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537) at org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12558) TestHCM.testClusterStatus Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedErro
[ https://issues.apache.org/jira/browse/HBASE-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224301#comment-14224301 ] Qiang Tian commented on HBASE-12558: still looking..(and the UT failure in 11902) thanks. TestHCM.testClusterStatus Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError - Key: HBASE-12558 URL: https://issues.apache.org/jira/browse/HBASE-12558 Project: HBase Issue Type: Bug Components: test Reporter: stack Assignee: stack Priority: Critical Fix For: 2.0.0, 0.99.2 Happens for me reliably on mac os x. I looked at fixing it. The listener is not noticing the publish for whatever reason. Thats where I stopped. {code} java.lang.Exception: Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError at junit.framework.Assert.fail(Assert.java:57) at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193) at org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537) at org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12558) TestHCM.testClusterStatus Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError
[ https://issues.apache.org/jira/browse/HBASE-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-12558: --- Attachment: 12558-master.patch As you mentioned, in good case the listener receives the notification and write There is a new dead server in log. 2 possible cases for bad run: 1)the publisher does not collect data (looks not possible, just in case. log a debug message for it) 2)the publisher fail to send to client..(things handled by netty, add the same change to listener code..) let's see what will happen.. TestHCM.testClusterStatus Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError - Key: HBASE-12558 URL: https://issues.apache.org/jira/browse/HBASE-12558 Project: HBase Issue Type: Bug Components: test Reporter: stack Assignee: stack Priority: Critical Fix For: 2.0.0, 0.99.2 Attachments: 12558-master.patch Happens for me reliably on mac os x. I looked at fixing it. The listener is not noticing the publish for whatever reason. Thats where I stopped. {code} java.lang.Exception: Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError at junit.framework.Assert.fail(Assert.java:57) at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193) at org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537) at org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223976#comment-14223976 ] Qiang Tian commented on HBASE-11902: from the stacktrace, DrainBarrier.stopAndDrainOps is waiting, so DrainBarrier#endOp does not notify it. looking at class DrainBarrier, it is expected that beginOp and endOp are called in pair. the initial value of {{valueAndFlags}} is 2, incremented by 2 in beginOp; decremented by 2 in endOp. in stopAndDrainOps, if getValue(oldValAndFlags) == 1, means oldValAndFlags=2, all ops are completed in pair, otherwise, it needs to wait the last endOp to notify it: {code} if (getValue(oldValAndFlags) == 1) return; // There were no operations outstanding. synchronized (this) { this.wait(); } {code} so the problem could be the beginOp/endOp is not called in pair, the hole looks to be here: HRegion#internalFlushcache {code} // sync unflushed WAL changes when deferred log sync is enabled // see HBASE-8208 for details if (wal != null !shouldSyncLog()) { wal.sync(); } {code} at that point, wal.startCacheFlush-closeBarrier.beginOp is called, but completeCacheFlush-closeBarrier.endOp() is not protected by a try block..so if WAL/HDFS layer throws exception, the endOp will not be called. related info in log: {quote} 2014-09-03 13:38:03,789 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error while AsyncWriter write, request close of hlog java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486) 2014-09-03 13:38:03,789 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog: Error while AsyncSyncer sync, request close of hlog java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting... at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486) 2014-09-03 13:38:03,799 ERROR org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flush failed for region page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c. //==MemStoreFlusher#flushRegion java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting... {quote} the exception thrown to caller should be here: {code} FSHLog#syncer: if (txid = this.failedTxid.get()) { assert asyncIOE != null : current txid is among(under) failed txids, but asyncIOE is null!; throw asyncIOE; } {code} the master branch can catch the hdfs exception, but it just ignore it, which looks incorrect: {code} if (wal != null) { try { wal.sync(); // ensure that flush marker is sync'ed } catch (IOException ioe) { LOG.warn(Unexpected exception while wal.sync(), ignoring. Exception: + StringUtils.stringifyException(ioe)); } } {code} Personally the exeception should not be ignored since it is severe hdfs error. RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian reassigned HBASE-11902: -- Assignee: Qiang Tian RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223989#comment-14223989 ] Qiang Tian commented on HBASE-11902: proposed fix for 0.98: {code} --- hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java +++ hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java @@ -1760,7 +1760,13 @@ public class HRegion implements HeapSize { // , Writable{ // sync unflushed WAL changes when deferred log sync is enabled // see HBASE-8208 for details if (wal != null !shouldSyncLog()) { - wal.sync(); + try { +wal.sync(); + } catch (IOException e) { + wal.abortCacheFlush(this.getRegionInfo().getEncodedNameAsBytes()); + LOG.warn(Unexpected exception while wal.sync(), re-throw); + throw e; + } } {code} the master branch code writes ABORT_FLUSH log before we call wal.abortCacheFlush. so it is also needed if wal.sync aborts? also I am thinking about if we could make error injection test for such kind of failure which could mostly happen in real env but would not happen in UT? RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11902: --- Attachment: hbase11902-master.patch bq. the master branch code writes ABORT_FLUSH log before we call wal.abortCacheFlush. so it is also needed if wal.sync aborts? the wal write will fail in this case since it is severe HDFS failure.. RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, hbase11902-master.patch, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-11902) RegionServer was blocked while aborting
[ https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11902: --- Status: Patch Available (was: Open) RegionServer was blocked while aborting --- Key: HBASE-11902 URL: https://issues.apache.org/jira/browse/HBASE-11902 Project: HBase Issue Type: Bug Components: regionserver, wal Affects Versions: 0.98.4 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7 Reporter: Victor Xu Assignee: Qiang Tian Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, hbase11902-master.patch, jstack_hadoop461.cm6.log Generally, regionserver automatically aborts when isHealth() returns false. But it sometimes got blocked while aborting. I saved the jstack and logs, and found out that it was caused by datanodes failures. The regionserver60020 thread was blocked while closing WAL. This issue doesn't happen so frequently, but if it happens, it always leads to huge amount of requests failure. The only way to do is KILL -9. I think it's a bug, but I haven't found a decent solution. Does anyone have the same problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12558) TestHCM.testClusterStatus Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedErro
[ https://issues.apache.org/jira/browse/HBASE-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222589#comment-14222589 ] Qiang Tian commented on HBASE-12558: Hi Stack, it did not fail after hbase-12359, but fail now? TestHCM.testClusterStatus Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError - Key: HBASE-12558 URL: https://issues.apache.org/jira/browse/HBASE-12558 Project: HBase Issue Type: Bug Components: test Reporter: stack Assignee: stack Priority: Critical Fix For: 2.0.0, 0.99.2 Happens for me reliably on mac os x. I looked at fixing it. The listener is not noticing the publish for whatever reason. Thats where I stopped. {code} java.lang.Exception: Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError at junit.framework.Assert.fail(Assert.java:57) at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193) at org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537) at org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12558) TestHCM.testClusterStatus Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedErro
[ https://issues.apache.org/jira/browse/HBASE-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222705#comment-14222705 ] Qiang Tian commented on HBASE-12558: interesting...:-) timing issue? perhaps 10s is not long enough? I can take a look if it happens again..(log, stacktrace etc.) TestHCM.testClusterStatus Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError - Key: HBASE-12558 URL: https://issues.apache.org/jira/browse/HBASE-12558 Project: HBase Issue Type: Bug Components: test Reporter: stack Assignee: stack Priority: Critical Fix For: 2.0.0, 0.99.2 Happens for me reliably on mac os x. I looked at fixing it. The listener is not noticing the publish for whatever reason. Thats where I stopped. {code} java.lang.Exception: Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError at junit.framework.Assert.fail(Assert.java:57) at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193) at org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537) at org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12451) IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region splits in rolling update of cluster
[ https://issues.apache.org/jira/browse/HBASE-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212007#comment-14212007 ] Qiang Tian commented on HBASE-12451: Shaohui boss, this looks not a small change? :-) Even if average region count is used, below code will use max file size in most cases - e.g. when tableRegionsCount = 4, the value is 16384M which is bigger than DEFAULT_MAX_FILE_SIZE.. {code} return tableRegionsCount == 0 || tableRegionsCount 100 ? getDesiredMaxFileSize(): Math.min(getDesiredMaxFileSize(), this.initialSize * tableRegionsCount * tableRegionsCount * tableRegionsCount); {code} Personally I like KISS - a simple configure parameter can resolve this case - if we have many complains on existing region split policy, that is another story...my 2 cents. IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region splits in rolling update of cluster Key: HBASE-12451 URL: https://issues.apache.org/jira/browse/HBASE-12451 Project: HBase Issue Type: Bug Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Minor Fix For: 2.0.0 Attachments: HBASE-12451-v1.diff Currently IncreasingToUpperBoundRegionSplitPolicy is the default region split policy. In this policy, split size is the number of regions that are on this server that all are of the same table, cubed, times 2x the region flush size. But when unloading regions of a regionserver in a cluster using region_mover.rb, the number of regions that are on this server that all are of the same table will decrease, and the split size will decrease too, which may cause the left region split in the regionsever. Region Splits also happens when loading regions of a regionserver in a cluster. A improvment may set a minimum split size in IncreasingToUpperBoundRegionSplitPolicy Suggestions are welcomed. Thanks~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12451) IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region splits in rolling update of cluster
[ https://issues.apache.org/jira/browse/HBASE-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212065#comment-14212065 ] Qiang Tian commented on HBASE-12451: bq. Add a minimum split size will break the first goal, right? it is configurable, we can set the default minimum size to 0, the calculation will ignore the minimum size in this case:(Pseudocode) {code} long splitSize = tableRegionsCount == 0 || tableRegionsCount 100 ? getDesiredMaxFileSize(): Math.min(getDesiredMaxFileSize(), this.initialSize * tableRegionsCount * tableRegionsCount * tableRegionsCount); long minmumSize = conf.getLong(hbase.hregion.split.minimum.size, 0); return minmumSize 0 ? Math.max(splitSize, minmumSize) : splitSize; } {code} bq. And actually, there is a hbase.increasing.policy.initial.size configuration in IncreasingToUpperBoundRegionSplitPolicy which specify the initial(minimum) split size. beside initial.size, tableRegionsCount is also a variable. bq. And if you do not like the curve of split size, I think we should introduce a new split policy instead? I do not mean I do not like it, it looks to me the case described in the jira is not a common case? so a simple fix can resolve it quickly...if there are other issues I am open for ideas :-) IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region splits in rolling update of cluster Key: HBASE-12451 URL: https://issues.apache.org/jira/browse/HBASE-12451 Project: HBase Issue Type: Bug Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Minor Fix For: 2.0.0 Attachments: HBASE-12451-v1.diff Currently IncreasingToUpperBoundRegionSplitPolicy is the default region split policy. In this policy, split size is the number of regions that are on this server that all are of the same table, cubed, times 2x the region flush size. But when unloading regions of a regionserver in a cluster using region_mover.rb, the number of regions that are on this server that all are of the same table will decrease, and the split size will decrease too, which may cause the left region split in the regionsever. Region Splits also happens when loading regions of a regionserver in a cluster. A improvment may set a minimum split size in IncreasingToUpperBoundRegionSplitPolicy Suggestions are welcomed. Thanks~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12451) IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region splits in rolling update of cluster
[ https://issues.apache.org/jira/browse/HBASE-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212154#comment-14212154 ] Qiang Tian commented on HBASE-12451: Had a offline discuss with Duo, Get more details about the scenario. Now I understand the requirement.thanks IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region splits in rolling update of cluster Key: HBASE-12451 URL: https://issues.apache.org/jira/browse/HBASE-12451 Project: HBase Issue Type: Bug Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Minor Fix For: 2.0.0 Attachments: HBASE-12451-v1.diff Currently IncreasingToUpperBoundRegionSplitPolicy is the default region split policy. In this policy, split size is the number of regions that are on this server that all are of the same table, cubed, times 2x the region flush size. But when unloading regions of a regionserver in a cluster using region_mover.rb, the number of regions that are on this server that all are of the same table will decrease, and the split size will decrease too, which may cause the left region split in the regionsever. Region Splits also happens when loading regions of a regionserver in a cluster. A improvment may set a minimum split size in IncreasingToUpperBoundRegionSplitPolicy Suggestions are welcomed. Thanks~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12359) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel
[ https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213383#comment-14213383 ] Qiang Tian commented on HBASE-12359: Hi Stack, that is strange. the jekins run and local run were both fine. My local run against the latest master branch on linux VM is also fine.. from the openjdk code(suppose similar with oracle jdk code), we hit below null: {code} 151* Returns any IPv4 address of the given network interface, or 152* null if the interface does not have any IPv4 addresses. 153*/ 154 static Inet4Address anyInet4Address(final NetworkInterface interf) { 155 return AccessController.doPrivileged(new PrivilegedActionInet4Address() { 156 public Inet4Address run() { 157 EnumerationInetAddress addrs = interf.getInetAddresses(); 158 while (addrs.hasMoreElements()) { 159 InetAddress addr = addrs.nextElement(); 160 if (addr instanceof Inet4Address) { 161 return (Inet4Address)addr; 162 } 163 } 164 return null; // here {code} and {code} 752 /** 753* Joins channel's socket to the given group/interface and 754* optional source address. 755*/ 756 private MembershipKey innerJoin(InetAddress group, 757 NetworkInterface interf, 758 InetAddress source) ... 826 Inet4Address target = Net.anyInet4Address(interf); 827 if (target == null) 828 throw new IOException(Network interface not configured for IPv4); {code} the interf comes from below MulticastPublisher code: {code} public void connect(Configuration conf) throws IOException { NetworkInterface ni = NetworkInterface.getByInetAddress(Addressing.getIpAddress()); {code} but I do not change it... Where did you run it? which os? thanks. MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel - Key: HBASE-12359 URL: https://issues.apache.org/jira/browse/HBASE-12359 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0 Reporter: Qiang Tian Assignee: Qiang Tian Priority: Minor Attachments: hbase-12359-master.patch see http://osdir.com/ml/general/2014-10/msg56689.html {code} 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error starting cluster java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145) at org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215) at org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752) at org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at
[jira] [Updated] (HBASE-12359) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel
[ https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-12359: --- Status: Open (was: Patch Available) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel - Key: HBASE-12359 URL: https://issues.apache.org/jira/browse/HBASE-12359 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0 Reporter: Qiang Tian Assignee: Qiang Tian Priority: Minor Attachments: hbase-12359-master.patch see http://osdir.com/ml/general/2014-10/msg56689.html {code} 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error starting cluster java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145) at org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215) at org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752) at org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 multicast group at sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779) at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865) at io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394) at org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273) at org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121) at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142) ... 26 more {code} the exception comes from below JDK code: {code} 765 if (group instanceof Inet4Address) { 766 if (family == StandardProtocolFamily.INET6 !Net.canIPv6SocketJoinIPv4Group()) 767 throw new
[jira] [Updated] (HBASE-12359) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel
[ https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-12359: --- Status: Patch Available (was: Open) try jekins again. MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel - Key: HBASE-12359 URL: https://issues.apache.org/jira/browse/HBASE-12359 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0 Reporter: Qiang Tian Assignee: Qiang Tian Priority: Minor Attachments: hbase-12359-master.patch see http://osdir.com/ml/general/2014-10/msg56689.html {code} 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error starting cluster java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145) at org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215) at org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752) at org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 multicast group at sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779) at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865) at io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394) at org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273) at org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121) at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142) ... 26 more {code} the exception comes from below JDK code: {code} 765 if (group instanceof Inet4Address) { 766 if (family == StandardProtocolFamily.INET6 !Net.canIPv6SocketJoinIPv4Group()) 767 throw new
[jira] [Commented] (HBASE-12451) IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region splits in rolling update of cluster
[ https://issues.apache.org/jira/browse/HBASE-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205828#comment-14205828 ] Qiang Tian commented on HBASE-12451: Minimum split size is good and simple enough for me..user could have a tradeoff between automatic tuning and customization based on knowledge of their workload...(many times we do not want to expose too many configure parameters, but it looks really useful in some cases:-)) based on total region count looks hard to control, e.g. if user pre-split many regions, e.g. in http://search-hadoop.com/m/DHED4aS08G1 with 240 regions, the size will be quite big unless hbase.increasing.policy.initial.size is also configured.. IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region splits in rolling update of cluster Key: HBASE-12451 URL: https://issues.apache.org/jira/browse/HBASE-12451 Project: HBase Issue Type: Bug Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Minor Fix For: 2.0.0 Currently IncreasingToUpperBoundRegionSplitPolicy is the default region split policy. In this policy, split size is the number of regions that are on this server that all are of the same table, cubed, times 2x the region flush size. But when unloading regions of a regionserver in a cluster using region_mover.rb, the number of regions that are on this server that all are of the same table will decrease, and the split size will decrease too, which may cause the left region split in the regionsever. Region Splits also happens when loading regions of a regionserver in a cluster. A improvment may set a minimum split size in IncreasingToUpperBoundRegionSplitPolicy Suggestions are welcomed. Thanks~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12451) IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region splits in rolling update of cluster
[ https://issues.apache.org/jira/browse/HBASE-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206040#comment-14206040 ] Qiang Tian commented on HBASE-12451: Thanks Duo, forgot that. with default values, it looks the region split size will use upper limit after 3 regions. according to http://hbase.apache.org/book/ops.capacity.html, region count and region size are most important factors, but there is no clear answer for region count? bq. If we already have 240 regions of a table, and there is only one region of this table on a regionserver, should the region have a small split size? the regions should be evenly spread across RS(8 RS in that case) IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region splits in rolling update of cluster Key: HBASE-12451 URL: https://issues.apache.org/jira/browse/HBASE-12451 Project: HBase Issue Type: Bug Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Minor Fix For: 2.0.0 Currently IncreasingToUpperBoundRegionSplitPolicy is the default region split policy. In this policy, split size is the number of regions that are on this server that all are of the same table, cubed, times 2x the region flush size. But when unloading regions of a regionserver in a cluster using region_mover.rb, the number of regions that are on this server that all are of the same table will decrease, and the split size will decrease too, which may cause the left region split in the regionsever. Region Splits also happens when loading regions of a regionserver in a cluster. A improvment may set a minimum split size in IncreasingToUpperBoundRegionSplitPolicy Suggestions are welcomed. Thanks~ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12359) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel
[ https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199630#comment-14199630 ] Qiang Tian commented on HBASE-12359: Hi [~stack], run on Linux is fine. thanks. MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel - Key: HBASE-12359 URL: https://issues.apache.org/jira/browse/HBASE-12359 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0 Reporter: Qiang Tian Assignee: Qiang Tian Priority: Minor Attachments: hbase-12359-master.patch see http://osdir.com/ml/general/2014-10/msg56689.html {code} 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error starting cluster java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145) at org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215) at org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752) at org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 multicast group at sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779) at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865) at io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394) at org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273) at org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121) at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142) ... 26 more {code} the exception comes from below JDK code: {code} 765 if (group instanceof Inet4Address) { 766 if (family == StandardProtocolFamily.INET6
[jira] [Commented] (HBASE-12336) RegionServer failed to shutdown for NodeFailoverWorker thread
[ https://issues.apache.org/jira/browse/HBASE-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199654#comment-14199654 ] Qiang Tian commented on HBASE-12336: Hi [~stack], As I understand, the zookeeper-2012 could apply to this issue as well. the root of the problem is zk uses 2 queues for request handling. when a packet is not on one of the 2 queues. the exception in send thread(in this case, could be due to cluster restarted?) will just ignore the packet, so the main thread will never get response and hang there. But we need more data for proof.. (so far the occurrence is rare..) thanks. RegionServer failed to shutdown for NodeFailoverWorker thread - Key: HBASE-12336 URL: https://issues.apache.org/jira/browse/HBASE-12336 Project: HBase Issue Type: Bug Affects Versions: 0.94.11 Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Minor Fix For: 2.0.0, 0.94.26, 0.98.9, 0.99.2 Attachments: HBASE-12336-trunk-v1.diff, stack After enabling hbase.zookeeper.useMulti in hbase cluster, we found that regionserver failed to shutdown. Other threads have exited except a NodeFailoverWorker thread. {code} ReplicationExecutor-0 prio=10 tid=0x7f0d40195ad0 nid=0x73a in Object.wait() [0x7f0dc8fe6000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1309) - locked 0x0005a16df080 (a org.apache.zookeeper.ClientCnxn$Packet) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:930) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:912) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:531) at org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1518) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.copyQueuesFromRSUsingMulti(ReplicationZookeeper.java:804) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:612) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {code} It's sure that the shutdown method of the executor is called in ReplicationSourceManager#join. I am looking for the root cause and suggestions are welcomed. Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12359) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel
[ https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194370#comment-14194370 ] Qiang Tian commented on HBASE-12359: Thanks [~ashish singhi], just noticed there are only 2 watchers! :-) ping [~stack] Considering the description of the document. bq. The creation of the channel should specify the ProtocolFamily that corresponds to the address type of the multicast groups that the channel will join. I'd think putting the new factory code into netty is better... MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel - Key: HBASE-12359 URL: https://issues.apache.org/jira/browse/HBASE-12359 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0 Reporter: Qiang Tian Assignee: Qiang Tian Priority: Minor Attachments: hbase-12359-master.patch see http://osdir.com/ml/general/2014-10/msg56689.html {code} 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error starting cluster java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145) at org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215) at org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752) at org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 multicast group at sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779) at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865) at io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394) at org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273) at org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121) at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at
[jira] [Updated] (HBASE-12359) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel
[ https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-12359: --- Summary: MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel (was: TestHCM.java failed on windows ) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel - Key: HBASE-12359 URL: https://issues.apache.org/jira/browse/HBASE-12359 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0 Reporter: Qiang Tian Assignee: Qiang Tian Priority: Minor see http://osdir.com/ml/general/2014-10/msg56689.html {code} 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error starting cluster java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145) at org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215) at org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752) at org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 multicast group at sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779) at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865) at io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394) at org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273) at org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121) at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142) ... 26 more {code} the exception comes from below JDK code: {code} 765 if (group instanceof Inet4Address) { 766 if (family == StandardProtocolFamily.INET6 !Net.canIPv6SocketJoinIPv4Group())
[jira] [Updated] (HBASE-12359) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel
[ https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-12359: --- Attachment: hbase-12359-master.patch upload master patch MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel - Key: HBASE-12359 URL: https://issues.apache.org/jira/browse/HBASE-12359 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0 Reporter: Qiang Tian Assignee: Qiang Tian Priority: Minor Attachments: hbase-12359-master.patch see http://osdir.com/ml/general/2014-10/msg56689.html {code} 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error starting cluster java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145) at org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215) at org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752) at org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 multicast group at sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779) at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865) at io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394) at org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273) at org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121) at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142) ... 26 more {code} the exception comes from below JDK code: {code} 765 if (group instanceof Inet4Address) { 766 if (family == StandardProtocolFamily.INET6 !Net.canIPv6SocketJoinIPv4Group()) 767 throw new
[jira] [Updated] (HBASE-12359) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel
[ https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-12359: --- Status: Patch Available (was: Open) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel - Key: HBASE-12359 URL: https://issues.apache.org/jira/browse/HBASE-12359 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0 Reporter: Qiang Tian Assignee: Qiang Tian Priority: Minor Attachments: hbase-12359-master.patch see http://osdir.com/ml/general/2014-10/msg56689.html {code} 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error starting cluster java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145) at org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215) at org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752) at org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 multicast group at sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779) at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865) at io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394) at org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273) at org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121) at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142) ... 26 more {code} the exception comes from below JDK code: {code} 765 if (group instanceof Inet4Address) { 766 if (family == StandardProtocolFamily.INET6 !Net.canIPv6SocketJoinIPv4Group()) 767 throw new
[jira] [Commented] (HBASE-12359) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel
[ https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194325#comment-14194325 ] Qiang Tian commented on HBASE-12359: Hi [~ashish singhi], could you try the patch on your windows env? MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel - Key: HBASE-12359 URL: https://issues.apache.org/jira/browse/HBASE-12359 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0 Reporter: Qiang Tian Assignee: Qiang Tian Priority: Minor Attachments: hbase-12359-master.patch see http://osdir.com/ml/general/2014-10/msg56689.html {code} 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error starting cluster java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145) at org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215) at org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752) at org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 multicast group at sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779) at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865) at io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394) at org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273) at org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121) at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142) ... 26 more {code} the exception comes from below JDK code: {code} 765 if (group instanceof Inet4Address) { 766 if (family == StandardProtocolFamily.INET6
[jira] [Commented] (HBASE-12336) RegionServer failed to shutdown for NodeFailoverWorker thread
[ https://issues.apache.org/jira/browse/HBASE-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194334#comment-14194334 ] Qiang Tian commented on HBASE-12336: bq. Maybe we need to look at server.isStopped inside ReplicationSourceManager more often than we do? [~stack], I think it is a zookeeper issue - the stacktrace shows the zk client thread is stuck {code} Object.wait() [0x7f0dc8fe6000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1309) {code} so it cannot shut down. as shown in ZOOKEEPER-2012, there is timing hole between the zk client 'main' thread and 'main-SendThread' thread. RegionServer failed to shutdown for NodeFailoverWorker thread - Key: HBASE-12336 URL: https://issues.apache.org/jira/browse/HBASE-12336 Project: HBase Issue Type: Bug Affects Versions: 0.94.11 Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Minor Fix For: 2.0.0, 0.94.26, 0.98.9, 0.99.2 Attachments: HBASE-12336-trunk-v1.diff, stack After enabling hbase.zookeeper.useMulti in hbase cluster, we found that regionserver failed to shutdown. Other threads have exited except a NodeFailoverWorker thread. {code} ReplicationExecutor-0 prio=10 tid=0x7f0d40195ad0 nid=0x73a in Object.wait() [0x7f0dc8fe6000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1309) - locked 0x0005a16df080 (a org.apache.zookeeper.ClientCnxn$Packet) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:930) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:912) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:531) at org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1518) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.copyQueuesFromRSUsingMulti(ReplicationZookeeper.java:804) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:612) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {code} It's sure that the shutdown method of the executor is called in ReplicationSourceManager#join. I am looking for the root cause and suggestions are welcomed. Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long
[ https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11368: --- Status: Patch Available (was: Open) Multi-column family BulkLoad fails if compactions go on too long Key: HBASE-11368 URL: https://issues.apache.org/jira/browse/HBASE-11368 Project: HBase Issue Type: Bug Reporter: stack Assignee: Qiang Tian Attachments: hbase-11368-0.98.5.patch, hbase11368-master.patch, key_stacktrace_hbase10882.TXT, performance_improvement_verification_98.5.patch Compactions take a read lock. If a multi-column family region, before bulk loading, we want to take a write lock on the region. If the compaction takes too long, the bulk load fails. Various recipes include: + Making smaller regions (lame) + [~victorunique] suggests major compacting just before bulk loading over in HBASE-10882 as a work around. Does the compaction need a read lock for that long? Does the bulk load need a full write lock when multiple column families? Can we fail more gracefully at least? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long
[ https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11368: --- Attachment: hbase11368-master.patch patch for master branch Multi-column family BulkLoad fails if compactions go on too long Key: HBASE-11368 URL: https://issues.apache.org/jira/browse/HBASE-11368 Project: HBase Issue Type: Bug Reporter: stack Assignee: Qiang Tian Attachments: hbase-11368-0.98.5.patch, hbase11368-master.patch, key_stacktrace_hbase10882.TXT, performance_improvement_verification_98.5.patch Compactions take a read lock. If a multi-column family region, before bulk loading, we want to take a write lock on the region. If the compaction takes too long, the bulk load fails. Various recipes include: + Making smaller regions (lame) + [~victorunique] suggests major compacting just before bulk loading over in HBASE-10882 as a work around. Does the compaction need a read lock for that long? Does the bulk load need a full write lock when multiple column families? Can we fail more gracefully at least? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12357) TestHCM#testClusterStatus is continuosly failing in jenkins
[ https://issues.apache.org/jira/browse/HBASE-12357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186563#comment-14186563 ] Qiang Tian commented on HBASE-12357: it looks here:-) {quote} Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 multicast group at sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779) at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865) at io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394) at org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273) at org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121) at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142) {quote} {code} 756 private MembershipKey innerJoin(InetAddress group, 757 NetworkInterface interf, 758 InetAddress source) 759 throws IOException 760 { 761 if (!group.isMulticastAddress()) 762 throw new IllegalArgumentException(Group not a multicast address); 763 764 // check multicast address is compatible with this socket 765 if (group instanceof Inet4Address) { 766 if (family == StandardProtocolFamily.INET6 !Net.canIPv6SocketJoinIPv4Group()) 767 throw new IllegalArgumentException(IPv6 socket cannot join IPv4 multicast group); 768 } else if (group instanceof Inet6Address) { 769 if (family != StandardProtocolFamily.INET6) 770 throw new IllegalArgumentException(Only IPv6 sockets can join IPv6 multicast group); 771 } else { 772 throw new IllegalArgumentException(Address type not supported); 773 } {code} looks need to specify {{StandardProtocolFamily.INET6}} for IPV6 addr..? TestHCM#testClusterStatus is continuosly failing in jenkins --- Key: HBASE-12357 URL: https://issues.apache.org/jira/browse/HBASE-12357 Project: HBase Issue Type: Bug Components: test Affects Versions: 1.0.0 Reporter: Ashish Singhi Assignee: Dima Spivak {noformat}Tests run: 21, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 187.475 sec FAILURE! - in org.apache.hadoop.hbase.client.TestHCM testClusterStatus(org.apache.hadoop.hbase.client.TestHCM) Time elapsed: 41.477 sec ERROR! java.lang.Exception: Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError at junit.framework.Assert.fail(Assert.java:57) at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193) at org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537) at org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12357) TestHCM#testClusterStatus is continuosly failing in jenkins
[ https://issues.apache.org/jira/browse/HBASE-12357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-12357: --- Description: {} {noformat}Tests run: 21, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 187.475 sec FAILURE! - in org.apache.hadoop.hbase.client.TestHCM testClusterStatus(org.apache.hadoop.hbase.client.TestHCM) Time elapsed: 41.477 sec ERROR! java.lang.Exception: Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError at junit.framework.Assert.fail(Assert.java:57) at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193) at org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537) at org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273) {noformat} was: {noformat}Tests run: 21, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 187.475 sec FAILURE! - in org.apache.hadoop.hbase.client.TestHCM testClusterStatus(org.apache.hadoop.hbase.client.TestHCM) Time elapsed: 41.477 sec ERROR! java.lang.Exception: Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError at junit.framework.Assert.fail(Assert.java:57) at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193) at org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537) at org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273) {noformat} TestHCM#testClusterStatus is continuosly failing in jenkins --- Key: HBASE-12357 URL: https://issues.apache.org/jira/browse/HBASE-12357 Project: HBase Issue Type: Bug Components: test Affects Versions: 1.0.0 Reporter: Ashish Singhi Assignee: Dima Spivak {} {noformat}Tests run: 21, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 187.475 sec FAILURE! - in org.apache.hadoop.hbase.client.TestHCM testClusterStatus(org.apache.hadoop.hbase.client.TestHCM) Time elapsed: 41.477 sec ERROR! java.lang.Exception: Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError at junit.framework.Assert.fail(Assert.java:57) at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193) at org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537) at org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12357) TestHCM#testClusterStatus is continuosly failing in jenkins
[ https://issues.apache.org/jira/browse/HBASE-12357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-12357: --- Description: {noformat} 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error starting cluster java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145) at org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215) at org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752) at org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 multicast group at sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779) at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865) at io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394) at org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273) at org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121) at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142) ... 26 more Tests run: 21, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 187.475 sec FAILURE! - in org.apache.hadoop.hbase.client.TestHCM testClusterStatus(org.apache.hadoop.hbase.client.TestHCM) Time elapsed: 41.477 sec ERROR! java.lang.Exception: Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError at junit.framework.Assert.fail(Assert.java:57) at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193) at org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537) at org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273) {noformat} was: {} {noformat}Tests run: 21, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 187.475 sec FAILURE! - in org.apache.hadoop.hbase.client.TestHCM testClusterStatus(org.apache.hadoop.hbase.client.TestHCM) Time elapsed: 41.477 sec ERROR!
[jira] [Commented] (HBASE-12357) TestHCM#testClusterStatus is continuosly failing in jenkins
[ https://issues.apache.org/jira/browse/HBASE-12357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186580#comment-14186580 ] Qiang Tian commented on HBASE-12357: oops. the stacktrace comes from Ashish's email. the code is JDK code about: bq. java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 multicast group see the error here: http://docs.oracle.com/javase/7/docs/api/java/nio/channels/MulticastChannel.html we are hitting here: {code} 765 if (group instanceof Inet4Address) { 766 if (family == StandardProtocolFamily.INET6 !Net.canIPv6SocketJoinIPv4Group()) 767 throw new IllegalArgumentException(IPv6 socket cannot join IPv4 multicast group); {code} the parameter group is IPV4, and family is IPV6(StandardProtocolFamily.INET6), but I do not see family is specified to IPv6 in ClusterStPublisher.java. strange.. TestHCM#testClusterStatus is continuosly failing in jenkins --- Key: HBASE-12357 URL: https://issues.apache.org/jira/browse/HBASE-12357 Project: HBase Issue Type: Bug Components: test Affects Versions: 1.0.0 Reporter: Ashish Singhi Assignee: Dima Spivak {noformat} Tests run: 21, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 187.475 sec FAILURE! - in org.apache.hadoop.hbase.client.TestHCM testClusterStatus(org.apache.hadoop.hbase.client.TestHCM) Time elapsed: 41.477 sec ERROR! java.lang.Exception: Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError at junit.framework.Assert.fail(Assert.java:57) at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193) at org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537) at org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12357) TestHCM#testClusterStatus is continuosly failing in jenkins
[ https://issues.apache.org/jira/browse/HBASE-12357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186584#comment-14186584 ] Qiang Tian commented on HBASE-12357: [~ashish singhi], Can you confirm is it the same problem? TestHCM#testClusterStatus is continuosly failing in jenkins --- Key: HBASE-12357 URL: https://issues.apache.org/jira/browse/HBASE-12357 Project: HBase Issue Type: Bug Components: test Affects Versions: 1.0.0 Reporter: Ashish Singhi Assignee: Dima Spivak {noformat} Tests run: 21, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 187.475 sec FAILURE! - in org.apache.hadoop.hbase.client.TestHCM testClusterStatus(org.apache.hadoop.hbase.client.TestHCM) Time elapsed: 41.477 sec ERROR! java.lang.Exception: Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError at junit.framework.Assert.fail(Assert.java:57) at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193) at org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537) at org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12357) TestHCM#testClusterStatus is continuosly failing in jenkins
[ https://issues.apache.org/jira/browse/HBASE-12357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186602#comment-14186602 ] Qiang Tian commented on HBASE-12357: ah..sorry for that [~dimaspivak]. that looks a bug too. will open a new jira TestHCM#testClusterStatus is continuosly failing in jenkins --- Key: HBASE-12357 URL: https://issues.apache.org/jira/browse/HBASE-12357 Project: HBase Issue Type: Bug Components: test Affects Versions: 1.0.0 Reporter: Ashish Singhi Assignee: Dima Spivak {noformat} Tests run: 21, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 187.475 sec FAILURE! - in org.apache.hadoop.hbase.client.TestHCM testClusterStatus(org.apache.hadoop.hbase.client.TestHCM) Time elapsed: 41.477 sec ERROR! java.lang.Exception: Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError at junit.framework.Assert.fail(Assert.java:57) at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193) at org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537) at org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-12359) TestHCM.java failed on windows
Qiang Tian created HBASE-12359: -- Summary: TestHCM.java failed on windows Key: HBASE-12359 URL: https://issues.apache.org/jira/browse/HBASE-12359 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0 Reporter: Qiang Tian Priority: Minor see http://osdir.com/ml/general/2014-10/msg56689.html {code} 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error starting cluster java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145) at org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215) at org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752) at org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 multicast group at sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779) at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865) at io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394) at org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273) at org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121) at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142) ... 26 more {code} the exception comes from below JDK code: {code} 765 if (group instanceof Inet4Address) { 766 if (family == StandardProtocolFamily.INET6 !Net.canIPv6SocketJoinIPv4Group()) 767 throw new IllegalArgumentException(IPv6 socket cannot join IPv4 multicast group); {code} according to document(http://docs.oracle.com/javase/7/docs/api/java/nio/channels/MulticastChannel.html) {quote} The multicast implementation is intended to map directly to the native multicasting facility. Consequently, the following items should be considered when developing an application that receives IP multicast datagrams: The creation of the channel should specify the ProtocolFamily that
[jira] [Assigned] (HBASE-12359) TestHCM.java failed on windows
[ https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian reassigned HBASE-12359: -- Assignee: Qiang Tian TestHCM.java failed on windows --- Key: HBASE-12359 URL: https://issues.apache.org/jira/browse/HBASE-12359 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0 Reporter: Qiang Tian Assignee: Qiang Tian Priority: Minor see http://osdir.com/ml/general/2014-10/msg56689.html {code} 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error starting cluster java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145) at org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215) at org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752) at org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 multicast group at sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779) at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865) at io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394) at org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273) at org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121) at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142) ... 26 more {code} the exception comes from below JDK code: {code} 765 if (group instanceof Inet4Address) { 766 if (family == StandardProtocolFamily.INET6 !Net.canIPv6SocketJoinIPv4Group()) 767 throw new IllegalArgumentException(IPv6 socket cannot join IPv4 multicast group); {code} according to document(http://docs.oracle.com/javase/7/docs/api/java/nio/channels/MulticastChannel.html) {quote}
[jira] [Commented] (HBASE-12359) TestHCM.java failed on windows
[ https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186632#comment-14186632 ] Qiang Tian commented on HBASE-12359: we should support both IPv4 and IPv6 family, right? TestHCM.java failed on windows --- Key: HBASE-12359 URL: https://issues.apache.org/jira/browse/HBASE-12359 Project: HBase Issue Type: Bug Components: master Affects Versions: 2.0.0 Reporter: Qiang Tian Assignee: Qiang Tian Priority: Minor see http://osdir.com/ml/general/2014-10/msg56689.html {code} 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error starting cluster java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145) at org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215) at org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215) at org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752) at org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197) Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 multicast group at sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779) at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865) at io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394) at org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273) at org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121) at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:525) at org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142) ... 26 more {code} the exception comes from below JDK code: {code} 765 if (group instanceof Inet4Address) { 766 if (family == StandardProtocolFamily.INET6 !Net.canIPv6SocketJoinIPv4Group()) 767 throw new IllegalArgumentException(IPv6 socket cannot join IPv4 multicast group); {code} according to
[jira] [Commented] (HBASE-12336) RegionServer failed to shutdown for NodeFailoverWorker thread
[ https://issues.apache.org/jira/browse/HBASE-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184939#comment-14184939 ] Qiang Tian commented on HBASE-12336: the zookeeper stacktrace looks similar with ZOOKEEPER-2012 - ClientCnxn.submitRequest never returns. http://pastebin.com/xU4MSq9k. is there any zookeeper error message in RS log? RegionServer failed to shutdown for NodeFailoverWorker thread - Key: HBASE-12336 URL: https://issues.apache.org/jira/browse/HBASE-12336 Project: HBase Issue Type: Bug Affects Versions: 0.94.11 Reporter: Liu Shaohui Assignee: Liu Shaohui Priority: Minor Attachments: stack After enabling hbase.zookeeper.useMulti in hbase cluster, we found that regionserver failed to shutdown. Other threads have exited except a NodeFailoverWorker thread. {code} ReplicationExecutor-0 prio=10 tid=0x7f0d40195ad0 nid=0x73a in Object.wait() [0x7f0dc8fe6000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1309) - locked 0x0005a16df080 (a org.apache.zookeeper.ClientCnxn$Packet) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:930) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:912) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:531) at org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1518) at org.apache.hadoop.hbase.replication.ReplicationZookeeper.copyQueuesFromRSUsingMulti(ReplicationZookeeper.java:804) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:612) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {code} It's sure that the shutdown method of the executor is called in ReplicationSourceManager#join. I am looking for the root cause and suggestions are welcomed. Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long
[ https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11368: --- Attachment: key_stacktrace_hbase10882.TXT Hi [~stack], Sorry for confusing. let me explain from scratch: 1)the root cause of problem - HRegion#lock. From the stacktrace in HBASE-10882(also see key_stacktrace_hbase10882.TXT attached), the event sequence is: 1.1)the compaction acquires the readlock of HRegion#lock, 1.2)the bulkload try to acquire the writelock of HRegion#lock if there are multiple CFs. it has to wait for compaction to release the readlock. 1.3)scanners try to acquire the readlock of HRegion#lock. they have to wait for the bulkload to release the writelock. so both bulkload and scanners are blocked on HRegion#lock by compaction. 2)what is HRegion#lock used for? Investigation on the HRegion#lock shows, it is originally designed to protect region close ONLY. if someone, such as region split, wants to close the region, it needs to wait for others release the readlock. Then HBASE-4552 used the lock to solve the multi-CF bulkload consistency issue. now we see it is too heavy. 3)can we not use HRegion#lock in bulkload? the answer is yes. Internally, HStore#DefaultStoreFileManager#storefiles keeps track of the on-disk HFiles for a CF. we have below steps for the bulkload: 3.1)moves HFiles directly to region directory 3.2)add them into the {{storefiles}} list 3.3)notify StoreScanner that the HFile list is changed, which is done by resetting the StoreScanner#heap to null. this forces existing StoreScanner instances to reinitialize based on new the HFiles seen on disk in next scan/read request. the step 3.2 and 3.3 is synchronized by HStore#lock. so we have CF level scan-bulkload consistency. To achieve multi-CF scan-bulkload consistency, if we do not use HRegion#lock, we still need another region level lock --- a RegionScanner is composed of multiple StoreScanner, a StoreScanner(a CF scanner) is composed of a MemStoreScanner and multiple StoreFileScanner. the RegionScannerImpl#sortheap(and joinedHeap) is just the entry point of multiple StoreScanners. to have multi-CF consistency, we need synchronization here - a lock is needed, but it is used only between scan and bulkload. Regarding the code change you referenced, performance_improvement_verification_98.5.patch is to simulate the event sequence described in #1, for testing purpose only. currently I use 98.5 for test since it is stable and easy to evaluate the effect of the change. thanks. Multi-column family BulkLoad fails if compactions go on too long Key: HBASE-11368 URL: https://issues.apache.org/jira/browse/HBASE-11368 Project: HBase Issue Type: Bug Reporter: stack Assignee: Qiang Tian Attachments: hbase-11368-0.98.5.patch, key_stacktrace_hbase10882.TXT, performance_improvement_verification_98.5.patch Compactions take a read lock. If a multi-column family region, before bulk loading, we want to take a write lock on the region. If the compaction takes too long, the bulk load fails. Various recipes include: + Making smaller regions (lame) + [~victorunique] suggests major compacting just before bulk loading over in HBASE-10882 as a work around. Does the compaction need a read lock for that long? Does the bulk load need a full write lock when multiple column families? Can we fail more gracefully at least? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long
[ https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184066#comment-14184066 ] Qiang Tian commented on HBASE-11368: the attachments: {{key_stacktrace_hbase10882.TXT}} : the problem stacktrace {{hbase-11368-0.98.5.patch}} : the fix {{performance_improvement_verification_98.5.patch}}: the testcase to verify performance improvement Multi-column family BulkLoad fails if compactions go on too long Key: HBASE-11368 URL: https://issues.apache.org/jira/browse/HBASE-11368 Project: HBase Issue Type: Bug Reporter: stack Assignee: Qiang Tian Attachments: hbase-11368-0.98.5.patch, key_stacktrace_hbase10882.TXT, performance_improvement_verification_98.5.patch Compactions take a read lock. If a multi-column family region, before bulk loading, we want to take a write lock on the region. If the compaction takes too long, the bulk load fails. Various recipes include: + Making smaller regions (lame) + [~victorunique] suggests major compacting just before bulk loading over in HBASE-10882 as a work around. Does the compaction need a read lock for that long? Does the bulk load need a full write lock when multiple column families? Can we fail more gracefully at least? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long
[ https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11368: --- Attachment: performance_improvement_verification_98.5.patch A simple comparison test using updated TestHRegionServerBulkLoad.java the number is for just for reference. the real perf improvement might depend on a combination of factors, such as campaction time, bulkload time, scan/read workload type, request currency etc) 98.5: --- 2014-10-24 02:30:03,399 INFO [main] regionserver.TestHRegionServerBulkLoad(345): loaded 16 2014-10-24 02:30:03,399 INFO [main] regionserver.TestHRegionServerBulkLoad(346): compations 16 2014-10-24 02:30:03,399 INFO [main] regionserver.TestHRegionServerBulkLoad(348): Scanners: //average # with 50 scanners 2014-10-24 02:30:03,399 INFO [main] regionserver.TestHRegionServerBulkLoad(350): scanned 73 2014-10-24 02:30:03,400 INFO [main] regionserver.TestHRegionServerBulkLoad(351): verified 18000 rows 98.5+patch //since bulkload has smaller conflict with compaction, we get more bulkload/compaction request in fixed test cycle(5 minutes) 2014-10-24 02:41:19,071 INFO [main] regionserver.TestHRegionServerBulkLoad(344): Loaders: 2014-10-24 02:41:19,072 INFO [main] regionserver.TestHRegionServerBulkLoad(345): loaded 43 2014-10-24 02:41:19,072 INFO [main] regionserver.TestHRegionServerBulkLoad(346): compations 43 2014-10-24 02:41:19,073 INFO [main] regionserver.TestHRegionServerBulkLoad(348): Scanners: //since bulkload has smaller conflict with scan, we get more scans in fixed test cycle(5 minutes) //average # for 50 scanners 2014-10-24 02:41:19,073 INFO [main] regionserver.TestHRegionServerBulkLoad(350): scanned 92 2014-10-24 02:41:19,073 INFO [main] regionserver.TestHRegionServerBulkLoad(351): verified 25000 rows Multi-column family BulkLoad fails if compactions go on too long Key: HBASE-11368 URL: https://issues.apache.org/jira/browse/HBASE-11368 Project: HBase Issue Type: Bug Reporter: stack Assignee: Qiang Tian Attachments: hbase-11368-0.98.5.patch, performance_improvement_verification_98.5.patch Compactions take a read lock. If a multi-column family region, before bulk loading, we want to take a write lock on the region. If the compaction takes too long, the bulk load fails. Various recipes include: + Making smaller regions (lame) + [~victorunique] suggests major compacting just before bulk loading over in HBASE-10882 as a work around. Does the compaction need a read lock for that long? Does the bulk load need a full write lock when multiple column families? Can we fail more gracefully at least? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long
[ https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182647#comment-14182647 ] Qiang Tian commented on HBASE-11368: Hi [~stack], [~apurtell], any comments? thanks! Multi-column family BulkLoad fails if compactions go on too long Key: HBASE-11368 URL: https://issues.apache.org/jira/browse/HBASE-11368 Project: HBase Issue Type: Bug Reporter: stack Assignee: Qiang Tian Attachments: hbase-11368-0.98.5.patch, performance_improvement_verification_98.5.patch Compactions take a read lock. If a multi-column family region, before bulk loading, we want to take a write lock on the region. If the compaction takes too long, the bulk load fails. Various recipes include: + Making smaller regions (lame) + [~victorunique] suggests major compacting just before bulk loading over in HBASE-10882 as a work around. Does the compaction need a read lock for that long? Does the bulk load need a full write lock when multiple column families? Can we fail more gracefully at least? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long
[ https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181168#comment-14181168 ] Qiang Tian commented on HBASE-11368: initial YCSB test: Env: --- hadoop 2.2.0 YCSB 1.0.4(Andrew's branch) 3 nodes, 1 master, 2 RS //ignore cluster details since just to evaluate the new lock Steps: --- Followed Andrew's steps(see http://search-hadoop.com/m/DHED4hl7pC/) the seed table has 3 CFs, pre-split to 20 regions load 1 million rows to CF 'f1', using workloada run 3 iterations for workloadc and workloada respectively. the parameter in each run: bq. -p columnfamily=f1 -p operationcount=100 -s -threads 10 Results: --- 0.98.5: workload c: [READ], AverageLatency(us), 496.225811 [READ], AverageLatency(us), 510.206831 [READ], AverageLatency(us), 501.256123 workload a: [READ], AverageLatency(us), 676.4527555821747 [READ], AverageLatency(us), 622.5544771452717 [READ], AverageLatency(us), 628.1365657163067 0.98.5+patch: workload c: [READ], AverageLatency(us), 536.334437 [READ], AverageLatency(us), 508.40 [READ], AverageLatency(us), 491.416182 workload a: [READ], AverageLatency(us), 640.3625218319231 [READ], AverageLatency(us), 642.9719823488798 [READ], AverageLatency(us), 631.7491770928287 looks little performance penalty. I also ran PE in the cluster, since the test table has only 1 CF, the new lock is actually not used. interestingly, with the patch the performance is even a bit better... Multi-column family BulkLoad fails if compactions go on too long Key: HBASE-11368 URL: https://issues.apache.org/jira/browse/HBASE-11368 Project: HBase Issue Type: Bug Reporter: stack Assignee: Qiang Tian Attachments: hbase-11368-0.98.5.patch Compactions take a read lock. If a multi-column family region, before bulk loading, we want to take a write lock on the region. If the compaction takes too long, the bulk load fails. Various recipes include: + Making smaller regions (lame) + [~victorunique] suggests major compacting just before bulk loading over in HBASE-10882 as a work around. Does the compaction need a read lock for that long? Does the bulk load need a full write lock when multiple column families? Can we fail more gracefully at least? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12274) Race between RegionScannerImpl#nextInternal() and RegionScannerImpl#close() may produce null pointer exception
[ https://issues.apache.org/jira/browse/HBASE-12274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174650#comment-14174650 ] Qiang Tian commented on HBASE-12274: Hi Ted, I also ran mvn test with 0.98.6. I did not hit the scanner error, but did get some other strange failure. the UT looks not very clean. in RS log, the lease failure looks not expected as well. {code} org.apache.hadoop.hbase.regionserver.LeaseException: lease '8' does not exist at org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:221) at org.apache.hadoop.hbase.regionserver.Leases.cancelLease(Leases.java:206) at org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3305) {code} it is from different rpc handler, just before NPE. we got NotServingRegionException? do we have more log? thanks. Race between RegionScannerImpl#nextInternal() and RegionScannerImpl#close() may produce null pointer exception -- Key: HBASE-12274 URL: https://issues.apache.org/jira/browse/HBASE-12274 Project: HBase Issue Type: Bug Affects Versions: 0.98.6.1 Reporter: Ted Yu Assignee: Ted Yu Fix For: 2.0.0, 0.98.8, 0.99.2 Attachments: 12274-region-server.log, 12274-v2.txt, 12274-v2.txt, 12274-v3.txt I saw the following in region server log: {code} 2014-10-15 03:28:36,976 ERROR [B.DefaultRpcServer.handler=0,queue=0,port=60020] ipc.RpcServer: Unexpected throwable object java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5023) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4932) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4923) at org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3245) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29994) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2078) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) {code} This is where the NPE happened: {code} // Let's see what we have in the storeHeap. KeyValue current = this.storeHeap.peek(); {code} The cause was race between nextInternal(called through nextRaw) and close methods. nextRaw() is not synchronized. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12274) Race between RegionScannerImpl#nextInternal() and RegionScannerImpl#close() may produce null pointer exception
[ https://issues.apache.org/jira/browse/HBASE-12274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174674#comment-14174674 ] Qiang Tian commented on HBASE-12274: Hi Ted, perhaps I misunderstood. sorry for that. please go ahead. thanks. Race between RegionScannerImpl#nextInternal() and RegionScannerImpl#close() may produce null pointer exception -- Key: HBASE-12274 URL: https://issues.apache.org/jira/browse/HBASE-12274 Project: HBase Issue Type: Bug Affects Versions: 0.98.6.1 Reporter: Ted Yu Assignee: Ted Yu Fix For: 2.0.0, 0.98.8, 0.99.2 Attachments: 12274-region-server.log, 12274-v2.txt, 12274-v2.txt, 12274-v3.txt I saw the following in region server log: {code} 2014-10-15 03:28:36,976 ERROR [B.DefaultRpcServer.handler=0,queue=0,port=60020] ipc.RpcServer: Unexpected throwable object java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5023) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4932) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4923) at org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3245) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29994) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2078) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) {code} This is where the NPE happened: {code} // Let's see what we have in the storeHeap. KeyValue current = this.storeHeap.peek(); {code} The cause was race between nextInternal(called through nextRaw) and close methods. nextRaw() is not synchronized. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-12266) Slow Scan can cause dead loop in ClientScanner
Qiang Tian created HBASE-12266: -- Summary: Slow Scan can cause dead loop in ClientScanner Key: HBASE-12266 URL: https://issues.apache.org/jira/browse/HBASE-12266 Project: HBase Issue Type: Bug Components: Scanners Affects Versions: 0.96.0 Reporter: Qiang Tian Priority: Minor see http://search-hadoop.com/m/DHED45SVsC1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12266) Slow Scan can cause dead loop in ClientScanner
[ https://issues.apache.org/jira/browse/HBASE-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-12266: --- Attachment: HBASE-12266-master.patch any particular purpose to set it to true there? thanks. Slow Scan can cause dead loop in ClientScanner --- Key: HBASE-12266 URL: https://issues.apache.org/jira/browse/HBASE-12266 Project: HBase Issue Type: Bug Components: Scanners Affects Versions: 0.96.0 Reporter: Qiang Tian Priority: Minor Attachments: HBASE-12266-master.patch see http://search-hadoop.com/m/DHED45SVsC1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-12266) Slow Scan can cause dead loop in ClientScanner
[ https://issues.apache.org/jira/browse/HBASE-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-12266: --- Status: Patch Available (was: Open) Slow Scan can cause dead loop in ClientScanner --- Key: HBASE-12266 URL: https://issues.apache.org/jira/browse/HBASE-12266 Project: HBase Issue Type: Bug Components: Scanners Affects Versions: 0.96.0 Reporter: Qiang Tian Priority: Minor Attachments: HBASE-12266-master.patch see http://search-hadoop.com/m/DHED45SVsC1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12266) Slow Scan can cause dead loop in ClientScanner
[ https://issues.apache.org/jira/browse/HBASE-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173259#comment-14173259 ] Qiang Tian commented on HBASE-12266: Thanks Guys, frankly it looks to me such retry(including HBASE-7070) just makes code more complicated to read and easy to create new bugs in complex system. and it is hard to be covered by test. As mentioned in that jira: bq. 1.A next request is very large, so first time it is failed because of timeout since it is caused by client side timeout, why not just throw exception so that user(or app layer code) knows it and set a bigger value. the timeout value is case by case, that is why we make it configurable, right? Slow Scan can cause dead loop in ClientScanner --- Key: HBASE-12266 URL: https://issues.apache.org/jira/browse/HBASE-12266 Project: HBase Issue Type: Bug Components: Scanners Affects Versions: 0.96.0 Reporter: Qiang Tian Priority: Minor Attachments: 12266-v2.txt, HBASE-12266-master.patch see http://search-hadoop.com/m/DHED45SVsC1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-12274) Race between RegionScannerImpl#nextInternal() and RegionScannerImpl#close() may produce null pointer exception
[ https://issues.apache.org/jira/browse/HBASE-12274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173292#comment-14173292 ] Qiang Tian commented on HBASE-12274: Hi Ted, is it worth to find who close the scanner(is it possible to close it due to bug)? I asked this because I have the same question for StoreScanner, it looks only itself can call close. Regarding synchronized, I did not test it myself, synchronized method cost may be big, but googled synchronized block is fine, even better than lock. Race between RegionScannerImpl#nextInternal() and RegionScannerImpl#close() may produce null pointer exception -- Key: HBASE-12274 URL: https://issues.apache.org/jira/browse/HBASE-12274 Project: HBase Issue Type: Bug Affects Versions: 0.98.6.1 Reporter: Ted Yu Assignee: Ted Yu Fix For: 2.0.0, 0.98.8, 0.99.2 Attachments: 12274-v2.txt, 12274-v2.txt, 12274-v3.txt I saw the following in region server log: {code} 2014-10-15 03:28:36,976 ERROR [B.DefaultRpcServer.handler=0,queue=0,port=60020] ipc.RpcServer: Unexpected throwable object java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5023) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4932) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4923) at org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3245) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29994) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2078) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) {code} This is where the NPE happened: {code} // Let's see what we have in the storeHeap. KeyValue current = this.storeHeap.peek(); {code} The cause was race between nextInternal(called through nextRaw) and close methods. nextRaw() is not synchronized. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HBASE-12274) Race between RegionScannerImpl#nextInternal() and RegionScannerImpl#close() may produce null pointer exception
[ https://issues.apache.org/jira/browse/HBASE-12274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173292#comment-14173292 ] Qiang Tian edited comment on HBASE-12274 at 10/16/14 3:27 AM: -- Hi Ted, is it worth to find who close the scanner(is it possible to close it due to bug)? I asked this because I have the same question for StoreScanner, it looks only itself can call close. Regarding synchronized, I did not test it myself, synchronized method cost may be big, but googled synchronized block is fine, even better than lock. http://t.cn/R7zVKKB,http://t.cn/R7zVKK1 was (Author: tianq): Hi Ted, is it worth to find who close the scanner(is it possible to close it due to bug)? I asked this because I have the same question for StoreScanner, it looks only itself can call close. Regarding synchronized, I did not test it myself, synchronized method cost may be big, but googled synchronized block is fine, even better than lock. Race between RegionScannerImpl#nextInternal() and RegionScannerImpl#close() may produce null pointer exception -- Key: HBASE-12274 URL: https://issues.apache.org/jira/browse/HBASE-12274 Project: HBase Issue Type: Bug Affects Versions: 0.98.6.1 Reporter: Ted Yu Assignee: Ted Yu Fix For: 2.0.0, 0.98.8, 0.99.2 Attachments: 12274-v2.txt, 12274-v2.txt, 12274-v3.txt I saw the following in region server log: {code} 2014-10-15 03:28:36,976 ERROR [B.DefaultRpcServer.handler=0,queue=0,port=60020] ipc.RpcServer: Unexpected throwable object java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5023) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4932) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4923) at org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3245) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29994) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2078) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94) at java.lang.Thread.run(Thread.java:745) {code} This is where the NPE happened: {code} // Let's see what we have in the storeHeap. KeyValue current = this.storeHeap.peek(); {code} The cause was race between nextInternal(called through nextRaw) and close methods. nextRaw() is not synchronized. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long
[ https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11368: --- Attachment: hbase-11368-0.98.5.patch I forgot StoreScanner is per CF..earlier analysis is wrong: {quote} After DefaultStoreFileManager#storefiles is updated in HStore#bulkLoadHFile, notifyChangedReadersObservers is called to reset the StoreScanner#heap, so checkReseek-resetScannerStack will be triggered in next scan/read to recreate store scanners based on new storefiles. so we could introduce a new region level rwlock multiCFLock, HRegion#bulkLoadHFiles acquires the writelock before multi-CF HStore.bulkLoadHFile call. and StoreScanner#resetScannerStack acquires the readlock. this way the scanners are recreated after all CFs' store files are populated. {quote} instead, the new lock should put at regionScanner layer. see the patch attached. the mvn test and TestHRegionServerBulkLoad(large test for atomic bulkload test) passed, still need to run large tests and performance test(any suggestions for it? YCSB?). the lock can be further limited to a smaller scope by split HStore#bulkLoadHFile into 2 parts:1) rename the bulkload files and put new files into store files list 2) notifyChangedReadersObservers. only #2 needs the lock. if HDFS file rename is fast, the split may not be needed. Multi-column family BulkLoad fails if compactions go on too long Key: HBASE-11368 URL: https://issues.apache.org/jira/browse/HBASE-11368 Project: HBase Issue Type: Bug Reporter: stack Assignee: Qiang Tian Attachments: hbase-11368-0.98.5.patch Compactions take a read lock. If a multi-column family region, before bulk loading, we want to take a write lock on the region. If the compaction takes too long, the bulk load fails. Various recipes include: + Making smaller regions (lame) + [~victorunique] suggests major compacting just before bulk loading over in HBASE-10882 as a work around. Does the compaction need a read lock for that long? Does the bulk load need a full write lock when multiple column families? Can we fail more gracefully at least? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long
[ https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170649#comment-14170649 ] Qiang Tian commented on HBASE-11368: it looks to me the patch could show the value only when there is long compaction + gets/scans, not sure if [~victorunique] wants to try it in some test env? thanks. Multi-column family BulkLoad fails if compactions go on too long Key: HBASE-11368 URL: https://issues.apache.org/jira/browse/HBASE-11368 Project: HBase Issue Type: Bug Reporter: stack Assignee: Qiang Tian Attachments: hbase-11368-0.98.5.patch Compactions take a read lock. If a multi-column family region, before bulk loading, we want to take a write lock on the region. If the compaction takes too long, the bulk load fails. Various recipes include: + Making smaller regions (lame) + [~victorunique] suggests major compacting just before bulk loading over in HBASE-10882 as a work around. Does the compaction need a read lock for that long? Does the bulk load need a full write lock when multiple column families? Can we fail more gracefully at least? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11890) HBase REST Client is hard coded to http protocol
[ https://issues.apache.org/jira/browse/HBASE-11890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168065#comment-14168065 ] Qiang Tian commented on HBASE-11890: Hi [~stack], there is also a doc update :-) HBase-11890-doc.patch HBase REST Client is hard coded to http protocol Key: HBASE-11890 URL: https://issues.apache.org/jira/browse/HBASE-11890 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.96.2 Reporter: Eric Yang Assignee: Qiang Tian Fix For: 2.0.0, 0.98.7, 0.99.1 Attachments: HBase-11890-doc.patch, HBase-11890-master-v1.patch, HBase-11890-master.patch HBase REST Client executePathOnly only supports http. It would be nice if there is a option to enable REST API client to connect through SSL. org.apache.hadoop.hbase.rest.client.Cluster class does not indicate which protocol can be used, we can either set flag in Cluster class or introduce a parameter in Client class to toggle SSL. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long
[ https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166466#comment-14166466 ] Qiang Tian commented on HBASE-11368: update: the idea will cause deadlock since bulkload and scanner follow different orders to acquire bulkload lock and StoreScanner.lock. will look at if we could lower the granularity of storescanner lock. Multi-column family BulkLoad fails if compactions go on too long Key: HBASE-11368 URL: https://issues.apache.org/jira/browse/HBASE-11368 Project: HBase Issue Type: Bug Reporter: stack Assignee: Qiang Tian Compactions take a read lock. If a multi-column family region, before bulk loading, we want to take a write lock on the region. If the compaction takes too long, the bulk load fails. Various recipes include: + Making smaller regions (lame) + [~victorunique] suggests major compacting just before bulk loading over in HBASE-10882 as a work around. Does the compaction need a read lock for that long? Does the bulk load need a full write lock when multiple column families? Can we fail more gracefully at least? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long
[ https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian reassigned HBASE-11368: -- Assignee: Qiang Tian Multi-column family BulkLoad fails if compactions go on too long Key: HBASE-11368 URL: https://issues.apache.org/jira/browse/HBASE-11368 Project: HBase Issue Type: Bug Reporter: stack Assignee: Qiang Tian Compactions take a read lock. If a multi-column family region, before bulk loading, we want to take a write lock on the region. If the compaction takes too long, the bulk load fails. Various recipes include: + Making smaller regions (lame) + [~victorunique] suggests major compacting just before bulk loading over in HBASE-10882 as a work around. Does the compaction need a read lock for that long? Does the bulk load need a full write lock when multiple column families? Can we fail more gracefully at least? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long
[ https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164957#comment-14164957 ] Qiang Tian commented on HBASE-11368: Thanks [~jinghe], is it right way to run the bulkload test? {{mvn test -Dtest=TestHRegionServerBulkLoad}} the test is supposed to run for 5 minutes, but only after about 1 minutes then it exits. is it expected? Multi-column family BulkLoad fails if compactions go on too long Key: HBASE-11368 URL: https://issues.apache.org/jira/browse/HBASE-11368 Project: HBase Issue Type: Bug Reporter: stack Assignee: Qiang Tian Compactions take a read lock. If a multi-column family region, before bulk loading, we want to take a write lock on the region. If the compaction takes too long, the bulk load fails. Various recipes include: + Making smaller regions (lame) + [~victorunique] suggests major compacting just before bulk loading over in HBASE-10882 as a work around. Does the compaction need a read lock for that long? Does the bulk load need a full write lock when multiple column families? Can we fail more gracefully at least? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long
[ https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163162#comment-14163162 ] Qiang Tian commented on HBASE-11368: ideas for lowering down the lock granularity(based on 0.98.5 code base) 1)read/scan is it the primary goal for atomic multi-CF bulkload in HBASE-4552? After DefaultStoreFileManager#storefiles is updated in HStore#bulkLoadHFile, notifyChangedReadersObservers is called to reset the StoreScanner#heap, so checkReseek-resetScannerStack will be triggered in next scan/read to recreate store scanners based on new storefiles. so we could introduce a new region level rwlock multiCFLock, HRegion#bulkLoadHFiles acquires the writelock before multi-CF HStore.bulkLoadHFile call. and StoreScanner#resetScannerStack acquires the readlock. this way the scanners are recreated after all CFs' store files are populated. 2)split region. the region will be closed in SplitTransaction#stepsBeforePONR, which falls into the HRegion#lock protection area. bulk load still still need to acquire its readlock at start. 3) memstore flush. we flush to a new file which is not related to the loaded files. 4)compaction. the compaction is performed store by store. if bulkload inserts new files to {{storefiles}} during the selectCompaction process, the file list to be compacted might be impacted. e.g., the compaction for some CF do not include new loaded files, while others might include. but this does not impact the data integrity and read behavior? at the end of compaction, {{storefiles}} access is still protected by HStore#lock if there is bulk load change to the same CF. comments? thanks Multi-column family BulkLoad fails if compactions go on too long Key: HBASE-11368 URL: https://issues.apache.org/jira/browse/HBASE-11368 Project: HBase Issue Type: Bug Reporter: stack Compactions take a read lock. If a multi-column family region, before bulk loading, we want to take a write lock on the region. If the compaction takes too long, the bulk load fails. Various recipes include: + Making smaller regions (lame) + [~victorunique] suggests major compacting just before bulk loading over in HBASE-10882 as a work around. Does the compaction need a read lock for that long? Does the bulk load need a full write lock when multiple column families? Can we fail more gracefully at least? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long
[ https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163024#comment-14163024 ] Qiang Tian commented on HBASE-11368: As [~stack] mentioned in http://search-hadoop.com/m/DHED4NR0wT, the HRegion#lock is to protect region close. the comments in HRegion.java and the fact that only HRegion#doClose locks the writelock(if we do not consider HRegion#startBulkRegionOperation) also show that. so using HRegion#lock to protect multi-CF bulkload in HBASE-4552 looks too heavy-weight? from the stacktrace of HBASE-10882, all the read/scan are blocked since bulkload is waiting for lock.writelock, however compaction already acquired lock.readlock and is reading data, a time-consuming operation. and related topic is discussed again in http://search-hadoop.com/m/DHED4I11p31. perhaps we need another region level lock. Multi-column family BulkLoad fails if compactions go on too long Key: HBASE-11368 URL: https://issues.apache.org/jira/browse/HBASE-11368 Project: HBase Issue Type: Bug Reporter: stack Compactions take a read lock. If a multi-column family region, before bulk loading, we want to take a write lock on the region. If the compaction takes too long, the bulk load fails. Various recipes include: + Making smaller regions (lame) + [~victorunique] suggests major compacting just before bulk loading over in HBASE-10882 as a work around. Does the compaction need a read lock for that long? Does the bulk load need a full write lock when multiple column families? Can we fail more gracefully at least? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-11890) HBase REST Client is hard coded to http protocol
[ https://issues.apache.org/jira/browse/HBASE-11890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11890: --- Attachment: HBase-11890-doc.patch HBase REST Client is hard coded to http protocol Key: HBASE-11890 URL: https://issues.apache.org/jira/browse/HBASE-11890 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.96.2 Reporter: Eric Yang Assignee: Qiang Tian Attachments: HBase-11890-doc.patch, HBase-11890-master.patch HBase REST Client executePathOnly only supports http. It would be nice if there is a option to enable REST API client to connect through SSL. org.apache.hadoop.hbase.rest.client.Cluster class does not indicate which protocol can be used, we can either set flag in Cluster class or introduce a parameter in Client class to toggle SSL. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-11890) HBase REST Client is hard coded to http protocol
[ https://issues.apache.org/jira/browse/HBASE-11890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11890: --- Status: Patch Available (was: Open) HBase REST Client is hard coded to http protocol Key: HBASE-11890 URL: https://issues.apache.org/jira/browse/HBASE-11890 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.96.2 Reporter: Eric Yang Assignee: Qiang Tian Attachments: HBase-11890-doc.patch, HBase-11890-master.patch HBase REST Client executePathOnly only supports http. It would be nice if there is a option to enable REST API client to connect through SSL. org.apache.hadoop.hbase.rest.client.Cluster class does not indicate which protocol can be used, we can either set flag in Cluster class or introduce a parameter in Client class to toggle SSL. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-11890) HBase REST Client is hard coded to http protocol
[ https://issues.apache.org/jira/browse/HBASE-11890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11890: --- Attachment: HBase-11890-master.patch HBase REST Client is hard coded to http protocol Key: HBASE-11890 URL: https://issues.apache.org/jira/browse/HBASE-11890 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.96.2 Reporter: Eric Yang Assignee: Qiang Tian Attachments: HBase-11890-doc.patch, HBase-11890-master.patch HBase REST Client executePathOnly only supports http. It would be nice if there is a option to enable REST API client to connect through SSL. org.apache.hadoop.hbase.rest.client.Cluster class does not indicate which protocol can be used, we can either set flag in Cluster class or introduce a parameter in Client class to toggle SSL. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11890) HBase REST Client is hard coded to http protocol
[ https://issues.apache.org/jira/browse/HBASE-11890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146180#comment-14146180 ] Qiang Tian commented on HBASE-11890: Hi [~eyang] [~apurtell], sorry for late response. attached master patch(tested 0.98 can be applied directly) and doc update. tested manually(small change, most related to env configuration) HBase REST Client is hard coded to http protocol Key: HBASE-11890 URL: https://issues.apache.org/jira/browse/HBASE-11890 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.96.2 Reporter: Eric Yang Assignee: Qiang Tian Attachments: HBase-11890-doc.patch, HBase-11890-master.patch HBase REST Client executePathOnly only supports http. It would be nice if there is a option to enable REST API client to connect through SSL. org.apache.hadoop.hbase.rest.client.Cluster class does not indicate which protocol can be used, we can either set flag in Cluster class or introduce a parameter in Client class to toggle SSL. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-11890) HBase REST Client is hard coded to http protocol
[ https://issues.apache.org/jira/browse/HBASE-11890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11890: --- Status: Open (was: Patch Available) HBase REST Client is hard coded to http protocol Key: HBASE-11890 URL: https://issues.apache.org/jira/browse/HBASE-11890 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.96.2 Reporter: Eric Yang Assignee: Qiang Tian Attachments: HBase-11890-doc.patch, HBase-11890-master.patch HBase REST Client executePathOnly only supports http. It would be nice if there is a option to enable REST API client to connect through SSL. org.apache.hadoop.hbase.rest.client.Cluster class does not indicate which protocol can be used, we can either set flag in Cluster class or introduce a parameter in Client class to toggle SSL. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-11890) HBase REST Client is hard coded to http protocol
[ https://issues.apache.org/jira/browse/HBASE-11890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11890: --- Status: Patch Available (was: Open) HBase REST Client is hard coded to http protocol Key: HBASE-11890 URL: https://issues.apache.org/jira/browse/HBASE-11890 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.96.2 Reporter: Eric Yang Assignee: Qiang Tian Attachments: HBase-11890-doc.patch, HBase-11890-master-v1.patch, HBase-11890-master.patch HBase REST Client executePathOnly only supports http. It would be nice if there is a option to enable REST API client to connect through SSL. org.apache.hadoop.hbase.rest.client.Cluster class does not indicate which protocol can be used, we can either set flag in Cluster class or introduce a parameter in Client class to toggle SSL. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HBASE-11890) HBase REST Client is hard coded to http protocol
[ https://issues.apache.org/jira/browse/HBASE-11890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11890: --- Attachment: HBase-11890-master-v1.patch add the javadoc HBase REST Client is hard coded to http protocol Key: HBASE-11890 URL: https://issues.apache.org/jira/browse/HBASE-11890 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.96.2 Reporter: Eric Yang Assignee: Qiang Tian Attachments: HBase-11890-doc.patch, HBase-11890-master-v1.patch, HBase-11890-master.patch HBase REST Client executePathOnly only supports http. It would be nice if there is a option to enable REST API client to connect through SSL. org.apache.hadoop.hbase.rest.client.Cluster class does not indicate which protocol can be used, we can either set flag in Cluster class or introduce a parameter in Client class to toggle SSL. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HBASE-11890) HBase REST Client is hard coded to http protocol
[ https://issues.apache.org/jira/browse/HBASE-11890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian reassigned HBASE-11890: -- Assignee: Qiang Tian HBase REST Client is hard coded to http protocol Key: HBASE-11890 URL: https://issues.apache.org/jira/browse/HBASE-11890 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.96.2 Reporter: Eric Yang Assignee: Qiang Tian HBase REST Client executePathOnly only supports http. It would be nice if there is a option to enable REST API client to connect through SSL. org.apache.hadoop.hbase.rest.client.Cluster class does not indicate which protocol can be used, we can either set flag in Cluster class or introduce a parameter in Client class to toggle SSL. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HBASE-11812) duplicate code in SimpleRpcScheduler#SimpleRpcScheduler
[ https://issues.apache.org/jira/browse/HBASE-11812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111696#comment-14111696 ] Qiang Tian commented on HBASE-11812: hi [~stack], no, thanks!:-) duplicate code in SimpleRpcScheduler#SimpleRpcScheduler --- Key: HBASE-11812 URL: https://issues.apache.org/jira/browse/HBASE-11812 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.98.5 Reporter: Qiang Tian Priority: Minor Attachments: HBASE-11812-0.98.patch in 0.98.5 SimpleRpcScheduler#SimpleRpcScheduler: {code} int maxQueueLength = conf.getInt(CALL_QUEUE_MAX_LENGTH_CONF_KEY, conf.getInt(ipc.server.max.callqueue.length, handlerCount * RpcServer.DEFAULT_MAX_CALLQUEUE_LENGTH_PER_HANDLER)); float callqReadShare = conf.getFloat(CALL_QUEUE_READ_SHARE_CONF_KEY, conf.getFloat(ipc.server.callqueue.read.share, 0)); float callQueuesHandlersFactor = conf.getFloat(CALL_QUEUE_HANDLER_FACTOR_CONF_KEY, conf.getFloat(ipc.server.callqueue.handler.factor, 0)); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108382#comment-14108382 ] Qiang Tian commented on HBASE-11813: oops..it already points to line 210(got fever,brain is not so clear) Thanks Stack CellScanner#advance may infinitely recurse -- Key: HBASE-11813 URL: https://issues.apache.org/jira/browse/HBASE-11813 Project: HBase Issue Type: Bug Reporter: Andrew Purtell Assignee: stack Priority: Blocker Fix For: 0.99.0, 2.0.0, 0.98.6 Attachments: 11813.098.txt, 11813.098.txt, 11813.master.txt On user@hbase, johannes.schab...@visual-meta.com reported: {quote} we face a serious issue with our HBase production cluster for two days now. Every couple minutes, a random RegionServer gets stuck and does not process any requests. In addition this causes the other RegionServers to freeze within a minute which brings down the entire cluster. Stopping the affected RegionServer unblocks the cluster and everything comes back to normal. {quote} Subsequent troubleshooting reveals that RPC is getting stuck because we are losing RPC handlers. In the .out files we have this: {noformat} Exception in thread defaultRpcServer.handler=5,queue=2,port=60020 java.lang.StackOverflowError at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) [...] Exception in thread defaultRpcServer.handler=5,queue=2,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=18,queue=0,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=23,queue=2,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=24,queue=0,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=2,queue=2,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=11,queue=2,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=25,queue=1,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=20,queue=2,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=19,queue=1,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=15,queue=0,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=1,queue=1,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=7,queue=1,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=4,queue=1,port=60020 java.lang.StackOverflowError​ {noformat} That is the anonymous CellScanner instance we create from CellUtil#createCellScanner: {code} ​return new CellScanner() { private final Iterator? extends CellScannable iterator = cellScannerables.iterator(); private CellScanner cellScanner = null; @Override public Cell current() { return this.cellScanner != null? this.cellScanner.current(): null; } @Override public boolean advance() throws IOException { if (this.cellScanner == null) { if (!this.iterator.hasNext()) return false; this.cellScanner = this.iterator.next().cellScanner(); } if (this.cellScanner.advance()) return true; this.cellScanner = null; ---return advance(); } }; {code} That final return statement is the immediate problem. We should also fix this so the RegionServer aborts if it loses a handler to an Error. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HBASE-11812) duplicate code in SimpleRpcScheduler#SimpleRpcScheduler
Qiang Tian created HBASE-11812: -- Summary: duplicate code in SimpleRpcScheduler#SimpleRpcScheduler Key: HBASE-11812 URL: https://issues.apache.org/jira/browse/HBASE-11812 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.98.5 Reporter: Qiang Tian Priority: Minor in 0.98.5 SimpleRpcScheduler#SimpleRpcScheduler: {code} int maxQueueLength = conf.getInt(CALL_QUEUE_MAX_LENGTH_CONF_KEY, conf.getInt(ipc.server.max.callqueue.length, handlerCount * RpcServer.DEFAULT_MAX_CALLQUEUE_LENGTH_PER_HANDLER)); float callqReadShare = conf.getFloat(CALL_QUEUE_READ_SHARE_CONF_KEY, conf.getFloat(ipc.server.callqueue.read.share, 0)); float callQueuesHandlersFactor = conf.getFloat(CALL_QUEUE_HANDLER_FACTOR_CONF_KEY, conf.getFloat(ipc.server.callqueue.handler.factor, 0)); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11812) duplicate code in SimpleRpcScheduler#SimpleRpcScheduler
[ https://issues.apache.org/jira/browse/HBASE-11812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11812: --- Attachment: HBASE-11812-0.98.patch master branch is fine. duplicate code in SimpleRpcScheduler#SimpleRpcScheduler --- Key: HBASE-11812 URL: https://issues.apache.org/jira/browse/HBASE-11812 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.98.5 Reporter: Qiang Tian Priority: Minor Attachments: HBASE-11812-0.98.patch in 0.98.5 SimpleRpcScheduler#SimpleRpcScheduler: {code} int maxQueueLength = conf.getInt(CALL_QUEUE_MAX_LENGTH_CONF_KEY, conf.getInt(ipc.server.max.callqueue.length, handlerCount * RpcServer.DEFAULT_MAX_CALLQUEUE_LENGTH_PER_HANDLER)); float callqReadShare = conf.getFloat(CALL_QUEUE_READ_SHARE_CONF_KEY, conf.getFloat(ipc.server.callqueue.read.share, 0)); float callQueuesHandlersFactor = conf.getFloat(CALL_QUEUE_HANDLER_FACTOR_CONF_KEY, conf.getFloat(ipc.server.callqueue.handler.factor, 0)); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse
[ https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108251#comment-14108251 ] Qiang Tian commented on HBASE-11813: I'd suspect this one: {code} /** * Flatten the map of cells out under the CellScanner * @param map Map of Cell Lists; for example, the map of families to Cells that is used * inside Put, etc., keeping Cells organized by family. * @return CellScanner interface over codecellIterable/code */ public static CellScanner createCellScanner(final NavigableMapbyte [], ListCell map) { return new CellScanner() { private final IteratorEntrybyte[], ListCell entries = map.entrySet().iterator(); private IteratorCell currentIterator = null; private Cell currentCell; @Override public Cell current() { return this.currentCell; } @Override public boolean advance() { if (this.currentIterator == null) { if (!this.entries.hasNext()) return false; this.currentIterator = this.entries.next().getValue().iterator(); } if (this.currentIterator.hasNext()) { this.currentCell = this.currentIterator.next(); return true; } this.currentCell = null; this.currentIterator = null; return advance(); } }; } {code} looks the one Andrew mentioned would not trigger advance method in server side...while the other one is widely used in server side code paths..coprocessor or end point related.. CellScanner#advance may infinitely recurse -- Key: HBASE-11813 URL: https://issues.apache.org/jira/browse/HBASE-11813 Project: HBase Issue Type: Bug Reporter: Andrew Purtell Priority: Blocker Fix For: 0.99.0, 2.0.0, 0.98.6 On user@hbase, johannes.schab...@visual-meta.com reported: {quote} we face a serious issue with our HBase production cluster for two days now. Every couple minutes, a random RegionServer gets stuck and does not process any requests. In addition this causes the other RegionServers to freeze within a minute which brings down the entire cluster. Stopping the affected RegionServer unblocks the cluster and everything comes back to normal. {quote} Subsequent troubleshooting reveals that RPC is getting stuck because we are losing RPC handlers. In the .out files we have this: {noformat} Exception in thread defaultRpcServer.handler=5,queue=2,port=60020 java.lang.StackOverflowError at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210) [...] Exception in thread defaultRpcServer.handler=5,queue=2,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=18,queue=0,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=23,queue=2,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=24,queue=0,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=2,queue=2,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=11,queue=2,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=25,queue=1,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=20,queue=2,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=19,queue=1,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=15,queue=0,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=1,queue=1,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=7,queue=1,port=60020 java.lang.StackOverflowError Exception in thread defaultRpcServer.handler=4,queue=1,port=60020 java.lang.StackOverflowError​ {noformat} That is the anonymous CellScanner instance we create from CellUtil#createCellScanner: {code} ​return new CellScanner() { private final Iterator? extends CellScannable iterator = cellScannerables.iterator(); private CellScanner cellScanner = null; @Override public Cell current() { return this.cellScanner != null? this.cellScanner.current(): null; } @Override public boolean advance() throws IOException { if (this.cellScanner == null) { if (!this.iterator.hasNext()) return false; this.cellScanner = this.iterator.next().cellScanner(); } if (this.cellScanner.advance()) return true;
[jira] [Commented] (HBASE-11812) duplicate code in SimpleRpcScheduler#SimpleRpcScheduler
[ https://issues.apache.org/jira/browse/HBASE-11812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108253#comment-14108253 ] Qiang Tian commented on HBASE-11812: oops, thanks Matteo! duplicate code in SimpleRpcScheduler#SimpleRpcScheduler --- Key: HBASE-11812 URL: https://issues.apache.org/jira/browse/HBASE-11812 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.98.5 Reporter: Qiang Tian Priority: Minor Attachments: HBASE-11812-0.98.patch in 0.98.5 SimpleRpcScheduler#SimpleRpcScheduler: {code} int maxQueueLength = conf.getInt(CALL_QUEUE_MAX_LENGTH_CONF_KEY, conf.getInt(ipc.server.max.callqueue.length, handlerCount * RpcServer.DEFAULT_MAX_CALLQUEUE_LENGTH_PER_HANDLER)); float callqReadShare = conf.getFloat(CALL_QUEUE_READ_SHARE_CONF_KEY, conf.getFloat(ipc.server.callqueue.read.share, 0)); float callQueuesHandlersFactor = conf.getFloat(CALL_QUEUE_HANDLER_FACTOR_CONF_KEY, conf.getFloat(ipc.server.callqueue.handler.factor, 0)); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11673) TestIOFencing#testFencingAroundCompactionAfterWALSync fails
[ https://issues.apache.org/jira/browse/HBASE-11673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095066#comment-14095066 ] Qiang Tian commented on HBASE-11673: Hi [~mantonov], I just ran it, it passed. thanks. TestIOFencing#testFencingAroundCompactionAfterWALSync fails --- Key: HBASE-11673 URL: https://issues.apache.org/jira/browse/HBASE-11673 Project: HBase Issue Type: Test Reporter: Qiang Tian Assignee: Sergey Soldatov Fix For: 2.0.0 Attachments: HBASE_11673-v1.patch, testFencingAroundCompactionAfterWALSync.tar.gz got several test failure on the latest build: {quote} [tianq@bdvm101 surefire-reports]$ ls -1t|grep Tests run * |grep FAILURE org.apache.hadoop.hbase.client.TestReplicasClient.txt:Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 38.706 sec FAILURE! org.apache.hadoop.hbase.master.TestMasterOperationsForRegionReplicas.txt:Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 30.669 sec FAILURE! org.apache.hadoop.hbase.regionserver.TestRegionReplicas.txt:Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 39.113 sec FAILURE! org.apache.hadoop.hbase.TestIOFencing.txt:Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 177.071 sec FAILURE! {quote} the first one: {quote} failure message=Timed out waiting for the region to flush type=java.lang.AssertionErrorjava.lang.AssertionError: Timed out waiting for the region to flush -at org.junit.Assert.fail(Assert.java:88) -at org.junit.Assert.assertTrue(Assert.java:41) -at org.apache.hadoop.hbase.TestIOFencing.doTest(TestIOFencing.java:291) -at org.apache.hadoop.hbase.TestIOFencing.testFencingAroundCompactionAfterWALSync(TestIOFencing.java:236) -at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) -at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) -at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) -at java.lang.reflect.Method.invoke(Method.java:606) {quote} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
[ https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092574#comment-14092574 ] Qiang Tian commented on HBASE-11714: Thanks [~nkeywal]! good to know it. :-) I did not find it..perhaps pasting the error info helps people google it? RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly - Key: HBASE-11714 URL: https://issues.apache.org/jira/browse/HBASE-11714 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.98.3 Reporter: Qiang Tian Assignee: Qiang Tian Attachments: hbase-11714-0.98.patch Discussed on the user@hbase mailing list (http://markmail.org/thread/w3cqjxwo2smkn2jw) {quote} Recently switched from 0.94 and 0.98, and finding that periodically things are having issues - lots of retry exceptions : {quote} client log: {quote} 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158, table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops, last exception: java.net.SocketTimeoutException: Call to ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed because java.net.SocketTimeoutException: 2000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500 ops. {quote} analysis: there are 2 methods in RpcRetryingCaller: callWithRetries and callWithoutRetries. it looks the timeout setup of callWithRetries is good, while callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a valid timeout, but callWithoutRetries still calls beforeCall, which looks a method for callWithRetries only, to set timeout. since RpcRetryingCaller#callTimeout is not set, thread local timeout is set to 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final pinginterval set to the socket. when there are heavy write workload and the rpc cannot complete in 2s, the client close the connection, so the server side connection is reset and finally exposes the problem in HBASE-11705 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
[ https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian resolved HBASE-11714. Resolution: Duplicate Fix Version/s: 0.98.4 RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly - Key: HBASE-11714 URL: https://issues.apache.org/jira/browse/HBASE-11714 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.98.3 Reporter: Qiang Tian Assignee: Qiang Tian Fix For: 0.98.4 Attachments: hbase-11714-0.98.patch Discussed on the user@hbase mailing list (http://markmail.org/thread/w3cqjxwo2smkn2jw) {quote} Recently switched from 0.94 and 0.98, and finding that periodically things are having issues - lots of retry exceptions : {quote} client log: {quote} 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158, table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops, last exception: java.net.SocketTimeoutException: Call to ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed because java.net.SocketTimeoutException: 2000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500 ops. {quote} analysis: there are 2 methods in RpcRetryingCaller: callWithRetries and callWithoutRetries. it looks the timeout setup of callWithRetries is good, while callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a valid timeout, but callWithoutRetries still calls beforeCall, which looks a method for callWithRetries only, to set timeout. since RpcRetryingCaller#callTimeout is not set, thread local timeout is set to 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final pinginterval set to the socket. when there are heavy write workload and the rpc cannot complete in 2s, the client close the connection, so the server side connection is reset and finally exposes the problem in HBASE-11705 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
[ https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092410#comment-14092410 ] Qiang Tian commented on HBASE-11714: ping [~apurtell], thanks! RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly - Key: HBASE-11714 URL: https://issues.apache.org/jira/browse/HBASE-11714 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.98.3 Reporter: Qiang Tian Assignee: Qiang Tian Attachments: hbase-11714-0.98.patch Discussed on the user@hbase mailing list (http://markmail.org/thread/w3cqjxwo2smkn2jw) Recently switched from 0.94 and 0.98, and finding that periodically things are having issues - lots of retry exceptions : 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158, table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops, last exception: java.net.SocketTimeoutException: Call to ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed because java.net.SocketTimeoutException: 2000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500 ops. there are 2 methods in RpcRetryingCaller: callWithRetries and callWithoutRetries. it looks the timeout setup of callWithRetries is good, while callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a valid timeout, but callWithoutRetries still calls beforeCall, which looks a method for callWithRetries only, to set timeout. since RpcRetryingCaller#callTimeout is not set, thread local timeout is set to 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final pinginterval set to the socket. when there are heavy write workload and the rpc cannot complete in 2s, the client close the connection, so the server side connection is reset and finally exposes the problem in HBASE-11705 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
[ https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092416#comment-14092416 ] Qiang Tian commented on HBASE-11714: search hit: http://stackoverflow.com/questions/25038030/sockettimeoutexception-emerges-when-insert-millions-of-records-to-hbase I'd think below one is also the hit: https://issues.apache.org/jira/browse/PHOENIX-998 RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly - Key: HBASE-11714 URL: https://issues.apache.org/jira/browse/HBASE-11714 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.98.3 Reporter: Qiang Tian Assignee: Qiang Tian Attachments: hbase-11714-0.98.patch Discussed on the user@hbase mailing list (http://markmail.org/thread/w3cqjxwo2smkn2jw) Recently switched from 0.94 and 0.98, and finding that periodically things are having issues - lots of retry exceptions : 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158, table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops, last exception: java.net.SocketTimeoutException: Call to ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed because java.net.SocketTimeoutException: 2000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500 ops. there are 2 methods in RpcRetryingCaller: callWithRetries and callWithoutRetries. it looks the timeout setup of callWithRetries is good, while callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a valid timeout, but callWithoutRetries still calls beforeCall, which looks a method for callWithRetries only, to set timeout. since RpcRetryingCaller#callTimeout is not set, thread local timeout is set to 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final pinginterval set to the socket. when there are heavy write workload and the rpc cannot complete in 2s, the client close the connection, so the server side connection is reset and finally exposes the problem in HBASE-11705 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
[ https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11714: --- Description: Discussed on the user@hbase mailing list (http://markmail.org/thread/w3cqjxwo2smkn2jw) {quote} Recently switched from 0.94 and 0.98, and finding that periodically things are having issues - lots of retry exceptions : {quote} client log: {quote} 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158, table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops, last exception: java.net.SocketTimeoutException: Call to ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed because java.net.SocketTimeoutException: 2000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500 ops. {quote} analysis: there are 2 methods in RpcRetryingCaller: callWithRetries and callWithoutRetries. it looks the timeout setup of callWithRetries is good, while callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a valid timeout, but callWithoutRetries still calls beforeCall, which looks a method for callWithRetries only, to set timeout. since RpcRetryingCaller#callTimeout is not set, thread local timeout is set to 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final pinginterval set to the socket. when there are heavy write workload and the rpc cannot complete in 2s, the client close the connection, so the server side connection is reset and finally exposes the problem in HBASE-11705 was: Discussed on the user@hbase mailing list (http://markmail.org/thread/w3cqjxwo2smkn2jw) Recently switched from 0.94 and 0.98, and finding that periodically things are having issues - lots of retry exceptions : 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158, table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops, last exception: java.net.SocketTimeoutException: Call to ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed because java.net.SocketTimeoutException: 2000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500 ops. there are 2 methods in RpcRetryingCaller: callWithRetries and callWithoutRetries. it looks the timeout setup of callWithRetries is good, while callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a valid timeout, but callWithoutRetries still calls beforeCall, which looks a method for callWithRetries only, to set timeout. since RpcRetryingCaller#callTimeout is not set, thread local timeout is set to 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final pinginterval set to the socket. when there are heavy write workload and the rpc cannot complete in 2s, the client close the connection, so the server side connection is reset and finally exposes the problem in HBASE-11705 RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly - Key: HBASE-11714 URL: https://issues.apache.org/jira/browse/HBASE-11714 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.98.3 Reporter: Qiang Tian Assignee: Qiang Tian Attachments: hbase-11714-0.98.patch Discussed on the user@hbase mailing list (http://markmail.org/thread/w3cqjxwo2smkn2jw) {quote} Recently switched from 0.94 and 0.98, and finding that periodically things are having issues - lots of retry exceptions : {quote} client log: {quote} 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158, table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops, last exception: java.net.SocketTimeoutException: Call to ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed because java.net.SocketTimeoutException: 2000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500 ops. {quote} analysis: there are 2 methods in RpcRetryingCaller:
[jira] [Created] (HBASE-11714) batch mutationRpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
Qiang Tian created HBASE-11714: -- Summary: batch mutationRpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly Key: HBASE-11714 URL: https://issues.apache.org/jira/browse/HBASE-11714 Project: HBase Issue Type: Bug Reporter: Qiang Tian -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
[ https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11714: --- Summary: RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly (was: batch mutationRpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly - Key: HBASE-11714 URL: https://issues.apache.org/jira/browse/HBASE-11714 Project: HBase Issue Type: Bug Reporter: Qiang Tian -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
[ https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11714: --- Description: Discussed on the user@hbase mailing list (http://markmail.org/thread/w3cqjxwo2smkn2jw) Recently switched from 0.94 and 0.98, and finding that periodically things are having issues - lots of retry exceptions : 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158, table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops, last exception: java.net.SocketTimeoutException: Call to ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed because java.net.SocketTimeoutException: 2000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500 ops. there are 2 methods in RpcRetryingCaller: callWithRetries and callWithoutRetries. it looks the timeout setup of callWithRetries is good, while callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a valid timeout, but callWithoutRetries still calls beforeCall, which looks a method for callWithRetries only, to set timeout. since RpcRetryingCaller#callTimeout is not set, thread local timeout is set to 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final pinginterval set to the socket. when there are heavy write workload and the rpc cannot complete in 2s, the client close the connection, so the server side connection is reset and finally cause problem in HBASE-11705 RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly - Key: HBASE-11714 URL: https://issues.apache.org/jira/browse/HBASE-11714 Project: HBase Issue Type: Bug Reporter: Qiang Tian Discussed on the user@hbase mailing list (http://markmail.org/thread/w3cqjxwo2smkn2jw) Recently switched from 0.94 and 0.98, and finding that periodically things are having issues - lots of retry exceptions : 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158, table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops, last exception: java.net.SocketTimeoutException: Call to ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed because java.net.SocketTimeoutException: 2000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500 ops. there are 2 methods in RpcRetryingCaller: callWithRetries and callWithoutRetries. it looks the timeout setup of callWithRetries is good, while callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a valid timeout, but callWithoutRetries still calls beforeCall, which looks a method for callWithRetries only, to set timeout. since RpcRetryingCaller#callTimeout is not set, thread local timeout is set to 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final pinginterval set to the socket. when there are heavy write workload and the rpc cannot complete in 2s, the client close the connection, so the server side connection is reset and finally cause problem in HBASE-11705 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
[ https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian reassigned HBASE-11714: -- Assignee: Qiang Tian RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly - Key: HBASE-11714 URL: https://issues.apache.org/jira/browse/HBASE-11714 Project: HBase Issue Type: Bug Reporter: Qiang Tian Assignee: Qiang Tian Discussed on the user@hbase mailing list (http://markmail.org/thread/w3cqjxwo2smkn2jw) Recently switched from 0.94 and 0.98, and finding that periodically things are having issues - lots of retry exceptions : 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158, table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops, last exception: java.net.SocketTimeoutException: Call to ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed because java.net.SocketTimeoutException: 2000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500 ops. there are 2 methods in RpcRetryingCaller: callWithRetries and callWithoutRetries. it looks the timeout setup of callWithRetries is good, while callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a valid timeout, but callWithoutRetries still calls beforeCall, which looks a method for callWithRetries only, to set timeout. since RpcRetryingCaller#callTimeout is not set, thread local timeout is set to 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final pinginterval set to the socket. when there are heavy write workload and the rpc cannot complete in 2s, the client close the connection, so the server side connection is reset and finally cause problem in HBASE-11705 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
[ https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11714: --- Affects Version/s: 0.98.3 RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly - Key: HBASE-11714 URL: https://issues.apache.org/jira/browse/HBASE-11714 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.98.3 Reporter: Qiang Tian Assignee: Qiang Tian Discussed on the user@hbase mailing list (http://markmail.org/thread/w3cqjxwo2smkn2jw) Recently switched from 0.94 and 0.98, and finding that periodically things are having issues - lots of retry exceptions : 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158, table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops, last exception: java.net.SocketTimeoutException: Call to ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed because java.net.SocketTimeoutException: 2000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500 ops. there are 2 methods in RpcRetryingCaller: callWithRetries and callWithoutRetries. it looks the timeout setup of callWithRetries is good, while callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a valid timeout, but callWithoutRetries still calls beforeCall, which looks a method for callWithRetries only, to set timeout. since RpcRetryingCaller#callTimeout is not set, thread local timeout is set to 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final pinginterval set to the socket. when there are heavy write workload and the rpc cannot complete in 2s, the client close the connection, so the server side connection is reset and finally cause problem in HBASE-11705 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
[ https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11714: --- Component/s: IPC/RPC RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly - Key: HBASE-11714 URL: https://issues.apache.org/jira/browse/HBASE-11714 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.98.3 Reporter: Qiang Tian Assignee: Qiang Tian Discussed on the user@hbase mailing list (http://markmail.org/thread/w3cqjxwo2smkn2jw) Recently switched from 0.94 and 0.98, and finding that periodically things are having issues - lots of retry exceptions : 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158, table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops, last exception: java.net.SocketTimeoutException: Call to ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed because java.net.SocketTimeoutException: 2000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500 ops. there are 2 methods in RpcRetryingCaller: callWithRetries and callWithoutRetries. it looks the timeout setup of callWithRetries is good, while callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a valid timeout, but callWithoutRetries still calls beforeCall, which looks a method for callWithRetries only, to set timeout. since RpcRetryingCaller#callTimeout is not set, thread local timeout is set to 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final pinginterval set to the socket. when there are heavy write workload and the rpc cannot complete in 2s, the client close the connection, so the server side connection is reset and finally cause problem in HBASE-11705 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
[ https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091707#comment-14091707 ] Qiang Tian commented on HBASE-11714: click wrong button before filling in the info...just added more info. RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly - Key: HBASE-11714 URL: https://issues.apache.org/jira/browse/HBASE-11714 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.98.3 Reporter: Qiang Tian Assignee: Qiang Tian Discussed on the user@hbase mailing list (http://markmail.org/thread/w3cqjxwo2smkn2jw) Recently switched from 0.94 and 0.98, and finding that periodically things are having issues - lots of retry exceptions : 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158, table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops, last exception: java.net.SocketTimeoutException: Call to ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed because java.net.SocketTimeoutException: 2000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500 ops. there are 2 methods in RpcRetryingCaller: callWithRetries and callWithoutRetries. it looks the timeout setup of callWithRetries is good, while callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a valid timeout, but callWithoutRetries still calls beforeCall, which looks a method for callWithRetries only, to set timeout. since RpcRetryingCaller#callTimeout is not set, thread local timeout is set to 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final pinginterval set to the socket. when there are heavy write workload and the rpc cannot complete in 2s, the client close the connection, so the server side connection is reset and finally cause problem in HBASE-11705 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
[ https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11714: --- Attachment: hbase-11714-0.98.patch RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly - Key: HBASE-11714 URL: https://issues.apache.org/jira/browse/HBASE-11714 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.98.3 Reporter: Qiang Tian Assignee: Qiang Tian Attachments: hbase-11714-0.98.patch Discussed on the user@hbase mailing list (http://markmail.org/thread/w3cqjxwo2smkn2jw) Recently switched from 0.94 and 0.98, and finding that periodically things are having issues - lots of retry exceptions : 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158, table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops, last exception: java.net.SocketTimeoutException: Call to ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed because java.net.SocketTimeoutException: 2000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500 ops. there are 2 methods in RpcRetryingCaller: callWithRetries and callWithoutRetries. it looks the timeout setup of callWithRetries is good, while callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a valid timeout, but callWithoutRetries still calls beforeCall, which looks a method for callWithRetries only, to set timeout. since RpcRetryingCaller#callTimeout is not set, thread local timeout is set to 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final pinginterval set to the socket. when there are heavy write workload and the rpc cannot complete in 2s, the client close the connection, so the server side connection is reset and finally cause problem in HBASE-11705 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
[ https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qiang Tian updated HBASE-11714: --- Description: Discussed on the user@hbase mailing list (http://markmail.org/thread/w3cqjxwo2smkn2jw) Recently switched from 0.94 and 0.98, and finding that periodically things are having issues - lots of retry exceptions : 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158, table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops, last exception: java.net.SocketTimeoutException: Call to ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed because java.net.SocketTimeoutException: 2000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500 ops. there are 2 methods in RpcRetryingCaller: callWithRetries and callWithoutRetries. it looks the timeout setup of callWithRetries is good, while callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a valid timeout, but callWithoutRetries still calls beforeCall, which looks a method for callWithRetries only, to set timeout. since RpcRetryingCaller#callTimeout is not set, thread local timeout is set to 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final pinginterval set to the socket. when there are heavy write workload and the rpc cannot complete in 2s, the client close the connection, so the server side connection is reset and finally exposes the problem in HBASE-11705 was: Discussed on the user@hbase mailing list (http://markmail.org/thread/w3cqjxwo2smkn2jw) Recently switched from 0.94 and 0.98, and finding that periodically things are having issues - lots of retry exceptions : 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158, table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops, last exception: java.net.SocketTimeoutException: Call to ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed because java.net.SocketTimeoutException: 2000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500 ops. there are 2 methods in RpcRetryingCaller: callWithRetries and callWithoutRetries. it looks the timeout setup of callWithRetries is good, while callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a valid timeout, but callWithoutRetries still calls beforeCall, which looks a method for callWithRetries only, to set timeout. since RpcRetryingCaller#callTimeout is not set, thread local timeout is set to 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final pinginterval set to the socket. when there are heavy write workload and the rpc cannot complete in 2s, the client close the connection, so the server side connection is reset and finally cause problem in HBASE-11705 RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly - Key: HBASE-11714 URL: https://issues.apache.org/jira/browse/HBASE-11714 Project: HBase Issue Type: Bug Components: IPC/RPC Affects Versions: 0.98.3 Reporter: Qiang Tian Assignee: Qiang Tian Attachments: hbase-11714-0.98.patch Discussed on the user@hbase mailing list (http://markmail.org/thread/w3cqjxwo2smkn2jw) Recently switched from 0.94 and 0.98, and finding that periodically things are having issues - lots of retry exceptions : 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158, table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops, last exception: java.net.SocketTimeoutException: Call to ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed because java.net.SocketTimeoutException: 2000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500 ops. there are 2 methods in RpcRetryingCaller: callWithRetries and callWithoutRetries. it looks the timeout setup of callWithRetries is good, while