[jira] [Commented] (HBASE-12742) ClusterStatusPublisher crashes with a IPv6 network interface.

2014-12-28 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259874#comment-14259874
 ] 

Qiang Tian commented on HBASE-12742:


ah...ipv4 address..
thanks [~stack] for telling me the progress!

 ClusterStatusPublisher crashes with a IPv6 network interface.
 -

 Key: HBASE-12742
 URL: https://issues.apache.org/jira/browse/HBASE-12742
 Project: HBase
  Issue Type: Bug
Reporter: Jurriaan Mous
Assignee: Jurriaan Mous
 Fix For: 1.0.0, 2.0.0, 1.1.0

 Attachments: HBASE-12742-v1.patch, HBASE-12742.patch


 On my dev machine the first network interface is an IPv6 tunnel. Hbase works 
 internally with ipv4 addresses. Addressing selects the first tunnel to use. 
 This causes the  ClusterStatusPublisher its DatagramChannel group join to 
 crash. 
 Stack trace:
 {code}
 java.io.IOException: Shutting down
   at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:231)
   at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:93)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:976)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:936)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:810)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:792)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:779)
   at 
 org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:140)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
   at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
   at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
   at 
 org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
   at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
   at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
   at 
 com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
   at 
 com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:211)
   at 
 com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 Caused by: java.lang.RuntimeException: Failed construction of Master: class 
 org.apache.hadoop.hbase.master.HMaster
   at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:143)
   at 
 org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215)
   at 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153)
   at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:213)
   ... 26 more
 Caused by: java.io.IOException: Network interface not configured for IPv4
   at 
 sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:860)
   at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:885)
   at 
 io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:409)
   at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:286)
   at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:129)
   at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:379)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
   at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:139)
   ... 29 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12657) The Region is not being split and far exceeds the desired maximum size.

2014-12-11 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243652#comment-14243652
 ] 

Qiang Tian commented on HBASE-12657:


Hi [~lhofhansl]
sorry I do not quite understand that part.  we expect to see filesCompacting is 
empty in getCurrentEligibleFiles right?
regarding HStore#filesCompacting. it looks we always remove files in 
CompactionRequest by finishCompactionRequest, so not sure why we need to 
maintain it..
 


 The Region is not being split and far exceeds the desired maximum size.
 ---

 Key: HBASE-12657
 URL: https://issues.apache.org/jira/browse/HBASE-12657
 Project: HBase
  Issue Type: Bug
  Components: Compaction
Affects Versions: 0.94.25
Reporter: Vladimir Rodionov
Assignee: Vladimir Rodionov
 Fix For: 0.94.26

 Attachments: HBASE-12657-0.94.patch, HBASE-12657-0.94.patch.2


 We are seeing this behavior when creating indexes in one of our environment.
 When an index is being created, most of the requests go into a single 
 region.  The amount of time to create an index seems to take longer than 
 usual and it can take days for the regions to compact and split after the 
 index is created.
 Here is a du of the HBase index table:
 {code}
 -bash-4.1$ sudo -su hdfs hadoop fs -du /hbase/43681
 705  /hbase/43681/.tableinfo.01
 0/hbase/43681/.tmp
 27981697293  /hbase/43681/0492e22092e21d35fca8e779b21ec797
 539687093/hbase/43681/832298c4e975fc47210feb6bac3d2f71
 560660531/hbase/43681/be9bdb3bdf9365afe5fe90db4247d82c
 7081938297   /hbase/43681/cd440e524f96fbe0719b2fe969848560
 6297860287   /hbase/43681/dc893a2d8daa08c689dc69e6bb2c5b50
 7189607722   /hbase/43681/ffbceaea5e2f142dbe6cd4cbeacc00e8
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12657) The Region is not being split and far exceeds the desired maximum size.

2014-12-09 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14240783#comment-14240783
 ] 

Qiang Tian commented on HBASE-12657:


RatioBasedCompactionPolicy.getCurrentEligibleFiles does look suspicious(still 
not sure when it is triggered..)



 The Region is not being split and far exceeds the desired maximum size.
 ---

 Key: HBASE-12657
 URL: https://issues.apache.org/jira/browse/HBASE-12657
 Project: HBase
  Issue Type: Bug
  Components: Compaction
Affects Versions: 0.94.25
Reporter: Vladimir Rodionov
Assignee: Vladimir Rodionov
 Fix For: 0.94.26

 Attachments: HBASE-12657-0.94.patch


 We are seeing this behavior when creating indexes in one of our environment.
 When an index is being created, most of the requests go into a single 
 region.  The amount of time to create an index seems to take longer than 
 usual and it can take days for the regions to compact and split after the 
 index is created.
 Here is a du of the HBase index table:
 {code}
 -bash-4.1$ sudo -su hdfs hadoop fs -du /hbase/43681
 705  /hbase/43681/.tableinfo.01
 0/hbase/43681/.tmp
 27981697293  /hbase/43681/0492e22092e21d35fca8e779b21ec797
 539687093/hbase/43681/832298c4e975fc47210feb6bac3d2f71
 560660531/hbase/43681/be9bdb3bdf9365afe5fe90db4247d82c
 7081938297   /hbase/43681/cd440e524f96fbe0719b2fe969848560
 6297860287   /hbase/43681/dc893a2d8daa08c689dc69e6bb2c5b50
 7189607722   /hbase/43681/ffbceaea5e2f142dbe6cd4cbeacc00e8
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-11902) RegionServer was blocked while aborting

2014-11-27 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11902:
---
Status: Open  (was: Patch Available)

 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 hbase11902-master.patch, hbase11902-master_v2.patch, jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-11902) RegionServer was blocked while aborting

2014-11-27 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11902:
---
Attachment: hbase11902-master_v3.patch

Local run passed.
based on HBASE-8208(that is why wal.sync is introduced in memstore flush), we 
also need to fail the flush when getting HDFS failure.



 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 hbase11902-master.patch, hbase11902-master_v2.patch, 
 hbase11902-master_v3.patch, jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-11902) RegionServer was blocked while aborting

2014-11-27 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11902:
---
Status: Patch Available  (was: Open)

 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 hbase11902-master.patch, hbase11902-master_v2.patch, 
 hbase11902-master_v3.patch, jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12558) TestHCM.testClusterStatus Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedErro

2014-11-27 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228022#comment-14228022
 ] 

Qiang Tian commented on HBASE-12558:


ah..it has been there for quite a bit of time [~stack]?  interesting...why I 
cannot hit it. is there any pattern based on those hits you got? e.g. is there 
any particular machine, os? 

 TestHCM.testClusterStatus Unexpected exception, 
 expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException 
 but wasjunit.framework.AssertionFailedError
 -

 Key: HBASE-12558
 URL: https://issues.apache.org/jira/browse/HBASE-12558
 Project: HBase
  Issue Type: Bug
  Components: test
Reporter: stack
Assignee: stack
Priority: Critical
 Fix For: 2.0.0, 0.99.2

 Attachments: 12558-master.patch, 12558.ignore.txt


 Happens for me reliably on mac os x. I looked at fixing it. The listener is 
 not noticing the publish for whatever reason.  Thats where I stopped.
 {code}
 java.lang.Exception: Unexpected exception, 
 expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException 
 but wasjunit.framework.AssertionFailedError
   at junit.framework.Assert.fail(Assert.java:57)
   at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537)
   at 
 org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting

2014-11-26 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14225962#comment-14225962
 ] 

Qiang Tian commented on HBASE-11902:


the TestLogRolling creates a similar error scenario with this case. 

the testcase failure is because of below code:
{code}
  // verify the written rows are there
  assertTrue(loggedRows.contains(row1002));
  assertTrue(loggedRows.contains(row1003));
  assertTrue(loggedRows.contains(row1004));
  assertTrue(loggedRows.contains(row1005));
  // flush all regions
  ListHRegion regions = new 
ArrayListHRegion(server.getOnlineRegionsLocalContext());
  for (HRegion r: regions) {
r.flushcache();// ===the re-throwed exception 
will end the testcase
  }
{code}

adding a try/catch for flushcache call make it pass.


 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 hbase11902-master.patch, jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-11902) RegionServer was blocked while aborting

2014-11-26 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11902:
---
Status: Open  (was: Patch Available)

 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 hbase11902-master.patch, jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-11902) RegionServer was blocked while aborting

2014-11-26 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11902:
---
Attachment: hbase11902-master_v2.patch

There is much difference in WAL between master and 0.98 such as thread model, 
error handling. RS is aborted in case of HDFS failure in 0.98, but master 
branch ignores it and the writesync thread continues..  make the patch simpler.




 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 hbase11902-master.patch, hbase11902-master_v2.patch, jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-11902) RegionServer was blocked while aborting

2014-11-26 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11902:
---
Status: Patch Available  (was: Open)

 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 hbase11902-master.patch, hbase11902-master_v2.patch, jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting

2014-11-26 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227322#comment-14227322
 ] 

Qiang Tian commented on HBASE-11902:


ok. the latest failure is because, in the testcase, only WAL write fails, if we 
hide the exception( just decrement the counter) and continues, the data flush 
will succeed, so completeCacheFlush call decrement it again!.
to preserve the counter semantics, simple is the best --- return right 
away.(the original patch)

 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 hbase11902-master.patch, hbase11902-master_v2.patch, jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12558) TestHCM.testClusterStatus Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedErro

2014-11-26 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227328#comment-14227328
 ] 

Qiang Tian commented on HBASE-12558:


Thanks [~stack].
so the publisher does send but listener does not get it. perhaps netty logging 
could help show what is going on. unfortunately I cannot reproduce the problem 
in my env...if everything is fine before HBASE-12359, maybe we need to revert 
the patch. sorry for that..

 TestHCM.testClusterStatus Unexpected exception, 
 expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException 
 but wasjunit.framework.AssertionFailedError
 -

 Key: HBASE-12558
 URL: https://issues.apache.org/jira/browse/HBASE-12558
 Project: HBase
  Issue Type: Bug
  Components: test
Reporter: stack
Assignee: stack
Priority: Critical
 Fix For: 2.0.0, 0.99.2

 Attachments: 12558-master.patch, 12558.ignore.txt


 Happens for me reliably on mac os x. I looked at fixing it. The listener is 
 not noticing the publish for whatever reason.  Thats where I stopped.
 {code}
 java.lang.Exception: Unexpected exception, 
 expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException 
 but wasjunit.framework.AssertionFailedError
   at junit.framework.Assert.fail(Assert.java:57)
   at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537)
   at 
 org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12558) TestHCM.testClusterStatus Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedErro

2014-11-25 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224301#comment-14224301
 ] 

Qiang Tian commented on HBASE-12558:


still looking..(and the UT failure in 11902)
thanks.

 TestHCM.testClusterStatus Unexpected exception, 
 expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException 
 but wasjunit.framework.AssertionFailedError
 -

 Key: HBASE-12558
 URL: https://issues.apache.org/jira/browse/HBASE-12558
 Project: HBase
  Issue Type: Bug
  Components: test
Reporter: stack
Assignee: stack
Priority: Critical
 Fix For: 2.0.0, 0.99.2


 Happens for me reliably on mac os x. I looked at fixing it. The listener is 
 not noticing the publish for whatever reason.  Thats where I stopped.
 {code}
 java.lang.Exception: Unexpected exception, 
 expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException 
 but wasjunit.framework.AssertionFailedError
   at junit.framework.Assert.fail(Assert.java:57)
   at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537)
   at 
 org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12558) TestHCM.testClusterStatus Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedError

2014-11-25 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-12558:
---
Attachment: 12558-master.patch

As you mentioned, in good case the listener receives the notification and write 
There is a new dead server in log. 
2 possible cases for bad run:
1)the publisher does not collect data (looks not possible, just in case. log a 
debug message for it)
2)the publisher fail to send to client..(things handled by netty, add the same 
change to listener code..)

let's see what will happen..




 TestHCM.testClusterStatus Unexpected exception, 
 expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException 
 but wasjunit.framework.AssertionFailedError
 -

 Key: HBASE-12558
 URL: https://issues.apache.org/jira/browse/HBASE-12558
 Project: HBase
  Issue Type: Bug
  Components: test
Reporter: stack
Assignee: stack
Priority: Critical
 Fix For: 2.0.0, 0.99.2

 Attachments: 12558-master.patch


 Happens for me reliably on mac os x. I looked at fixing it. The listener is 
 not noticing the publish for whatever reason.  Thats where I stopped.
 {code}
 java.lang.Exception: Unexpected exception, 
 expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException 
 but wasjunit.framework.AssertionFailedError
   at junit.framework.Assert.fail(Assert.java:57)
   at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537)
   at 
 org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting

2014-11-24 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223976#comment-14223976
 ] 

Qiang Tian commented on HBASE-11902:


from the stacktrace, DrainBarrier.stopAndDrainOps is waiting, so 
DrainBarrier#endOp does not notify it.

looking at class DrainBarrier, it is expected that beginOp and endOp are called 
in pair. the initial value of {{valueAndFlags}} is 2, incremented by 2 in 
beginOp; decremented by 2 in endOp.

in stopAndDrainOps, if getValue(oldValAndFlags) == 1, means oldValAndFlags=2, 
all ops are completed in pair, otherwise, it needs to wait the last endOp to 
notify it:

{code}
if (getValue(oldValAndFlags) == 1) return; // There were no operations 
outstanding.
synchronized (this) { this.wait(); }
{code}

so the problem could be the beginOp/endOp is not called in pair, the hole looks 
to be here:

HRegion#internalFlushcache
{code}
// sync unflushed WAL changes when deferred log sync is enabled
// see HBASE-8208 for details
if (wal != null  !shouldSyncLog()) {
  wal.sync();
}
{code} 

at that point, wal.startCacheFlush-closeBarrier.beginOp is called, but 
completeCacheFlush-closeBarrier.endOp() is not protected by a try block..so if 
WAL/HDFS layer throws exception, the endOp will not be called.

related info in log:

  
{quote}
2014-09-03 13:38:03,789 ERROR org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
Error while AsyncWriter write, request close of hlog
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-09-03 13:38:03,789 FATAL org.apache.hadoop.hbase.regionserver.wal.FSHLog: 
Error while AsyncSyncer sync, request close of hlog
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1127)
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:924)
  at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:486)
2014-09-03 13:38:03,799 ERROR 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flush failed for 
region page_content_queue,00166,1408946731655.8671b8a0f82565f88eb2ab8a5b53e84c. 
 //==MemStoreFlusher#flushRegion
java.io.IOException: All datanodes 10.246.2.103:50010 are bad. Aborting...
{quote}


the exception thrown to caller should be here:
{code}
FSHLog#syncer:

if (txid = this.failedTxid.get()) {
assert asyncIOE != null :
  current txid is among(under) failed txids, but asyncIOE is null!;
throw asyncIOE;
}

{code}

the master branch can catch the hdfs exception, but it just ignore it, which 
looks incorrect:
{code}
  if (wal != null) {
try {
  wal.sync(); // ensure that flush marker is sync'ed
} catch (IOException ioe) {
  LOG.warn(Unexpected exception while wal.sync(), ignoring. Exception: 

  + StringUtils.stringifyException(ioe));
}
  }
{code}

Personally the exeception should not be ignored since it is severe hdfs error.


 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HBASE-11902) RegionServer was blocked while aborting

2014-11-24 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian reassigned HBASE-11902:
--

Assignee: Qiang Tian

 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11902) RegionServer was blocked while aborting

2014-11-24 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223989#comment-14223989
 ] 

Qiang Tian commented on HBASE-11902:


proposed fix for 0.98:

{code}
--- hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
+++ hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
@@ -1760,7 +1760,13 @@ public class HRegion implements HeapSize { // , Writable{
 // sync unflushed WAL changes when deferred log sync is enabled
 // see HBASE-8208 for details
 if (wal != null  !shouldSyncLog()) {
-  wal.sync();
+  try {
+wal.sync();
+  } catch (IOException e) {
+ wal.abortCacheFlush(this.getRegionInfo().getEncodedNameAsBytes());
+ LOG.warn(Unexpected exception while wal.sync(), re-throw);
+ throw e;
+  }
 }
{code}

the master branch code writes ABORT_FLUSH log before we call 
wal.abortCacheFlush. so it is also needed if wal.sync aborts?

also I am thinking about if we could make error injection test for such kind of 
failure which could mostly happen in real env but would not happen in UT?


 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-11902) RegionServer was blocked while aborting

2014-11-24 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11902:
---
Attachment: hbase11902-master.patch

bq. the master branch code writes ABORT_FLUSH log before we call 
wal.abortCacheFlush. so it is also needed if wal.sync aborts?

the wal write will fail in this case since it is severe HDFS failure..

 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 hbase11902-master.patch, jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-11902) RegionServer was blocked while aborting

2014-11-24 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11902:
---
Status: Patch Available  (was: Open)

 RegionServer was blocked while aborting
 ---

 Key: HBASE-11902
 URL: https://issues.apache.org/jira/browse/HBASE-11902
 Project: HBase
  Issue Type: Bug
  Components: regionserver, wal
Affects Versions: 0.98.4
 Environment: hbase-0.98.4, hadoop-2.3.0-cdh5.1, jdk1.7
Reporter: Victor Xu
Assignee: Qiang Tian
 Attachments: hbase-hadoop-regionserver-hadoop461.cm6.log, 
 hbase11902-master.patch, jstack_hadoop461.cm6.log


 Generally, regionserver automatically aborts when isHealth() returns false. 
 But it sometimes got blocked while aborting. I saved the jstack and logs, and 
 found out that it was caused by datanodes failures. The regionserver60020 
 thread was blocked while closing WAL. 
 This issue doesn't happen so frequently, but if it happens, it always leads 
 to huge amount of requests failure. The only way to do is KILL -9.
 I think it's a bug, but I haven't found a decent solution. Does anyone have 
 the same problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12558) TestHCM.testClusterStatus Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedErro

2014-11-23 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222589#comment-14222589
 ] 

Qiang Tian commented on HBASE-12558:


Hi Stack,
it did not fail after hbase-12359, but fail now?


 TestHCM.testClusterStatus Unexpected exception, 
 expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException 
 but wasjunit.framework.AssertionFailedError
 -

 Key: HBASE-12558
 URL: https://issues.apache.org/jira/browse/HBASE-12558
 Project: HBase
  Issue Type: Bug
  Components: test
Reporter: stack
Assignee: stack
Priority: Critical
 Fix For: 2.0.0, 0.99.2


 Happens for me reliably on mac os x. I looked at fixing it. The listener is 
 not noticing the publish for whatever reason.  Thats where I stopped.
 {code}
 java.lang.Exception: Unexpected exception, 
 expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException 
 but wasjunit.framework.AssertionFailedError
   at junit.framework.Assert.fail(Assert.java:57)
   at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537)
   at 
 org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12558) TestHCM.testClusterStatus Unexpected exception, expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but wasjunit.framework.AssertionFailedErro

2014-11-23 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222705#comment-14222705
 ] 

Qiang Tian commented on HBASE-12558:


interesting...:-)
timing issue? perhaps 10s is not long enough?
I can take a look if it happens again..(log, stacktrace etc.)


 TestHCM.testClusterStatus Unexpected exception, 
 expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException 
 but wasjunit.framework.AssertionFailedError
 -

 Key: HBASE-12558
 URL: https://issues.apache.org/jira/browse/HBASE-12558
 Project: HBase
  Issue Type: Bug
  Components: test
Reporter: stack
Assignee: stack
Priority: Critical
 Fix For: 2.0.0, 0.99.2


 Happens for me reliably on mac os x. I looked at fixing it. The listener is 
 not noticing the publish for whatever reason.  Thats where I stopped.
 {code}
 java.lang.Exception: Unexpected exception, 
 expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException 
 but wasjunit.framework.AssertionFailedError
   at junit.framework.Assert.fail(Assert.java:57)
   at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537)
   at 
 org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12451) IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region splits in rolling update of cluster

2014-11-14 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212007#comment-14212007
 ] 

Qiang Tian commented on HBASE-12451:


Shaohui boss, this looks not a small change? :-)
Even if average region count is used,  below code will use max file size in 
most cases - e.g. when tableRegionsCount = 4, the value is 16384M which is 
bigger than DEFAULT_MAX_FILE_SIZE..
{code}
return tableRegionsCount == 0 || tableRegionsCount  100 ? 
getDesiredMaxFileSize():
  Math.min(getDesiredMaxFileSize(),
this.initialSize * tableRegionsCount * tableRegionsCount * 
tableRegionsCount);

{code}
Personally I like KISS - a simple configure parameter can resolve this case - 
if we have many complains on existing region split policy, that is another 
story...my 2 cents.


 IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region splits 
 in rolling update of cluster
 

 Key: HBASE-12451
 URL: https://issues.apache.org/jira/browse/HBASE-12451
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 2.0.0

 Attachments: HBASE-12451-v1.diff


 Currently IncreasingToUpperBoundRegionSplitPolicy is the default region split 
 policy. In this policy, split size is the number of regions that are on this 
 server that all are of the same table, cubed, times 2x the region flush size.
 But when unloading regions of a regionserver in a cluster using 
 region_mover.rb, the number of regions that are on this server that all are 
 of the same table will decrease, and the split size will decrease too, which 
 may cause the left region split in the regionsever. Region Splits also 
 happens when loading regions of a regionserver in a cluster. 
 A improvment may set a minimum split size in 
 IncreasingToUpperBoundRegionSplitPolicy
 Suggestions are welcomed. Thanks~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12451) IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region splits in rolling update of cluster

2014-11-14 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212065#comment-14212065
 ] 

Qiang Tian commented on HBASE-12451:


bq. Add a minimum split size will break the first goal, right? 
it is configurable, we can set the default minimum size to 0, the calculation 
will ignore the minimum size in this case:(Pseudocode)
{code}
long splitSize = tableRegionsCount == 0 || tableRegionsCount  100 ? 
getDesiredMaxFileSize():
  Math.min(getDesiredMaxFileSize(),
this.initialSize * tableRegionsCount * tableRegionsCount * 
tableRegionsCount);

long minmumSize = conf.getLong(hbase.hregion.split.minimum.size, 0);
return minmumSize  0 ? Math.max(splitSize, minmumSize) : splitSize;
  }
{code}

bq. And actually, there is a hbase.increasing.policy.initial.size 
configuration in IncreasingToUpperBoundRegionSplitPolicy which specify the 
initial(minimum) split size. 

beside initial.size, tableRegionsCount is also a variable.

bq. And if you do not like the curve of split size, I think we should introduce 
a new split policy instead?
I do not mean I do not like it, it looks to me the case described in the jira 
is not a common case? so a simple fix can resolve it quickly...if there are 
other issues I am open for ideas :-)




 IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region splits 
 in rolling update of cluster
 

 Key: HBASE-12451
 URL: https://issues.apache.org/jira/browse/HBASE-12451
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 2.0.0

 Attachments: HBASE-12451-v1.diff


 Currently IncreasingToUpperBoundRegionSplitPolicy is the default region split 
 policy. In this policy, split size is the number of regions that are on this 
 server that all are of the same table, cubed, times 2x the region flush size.
 But when unloading regions of a regionserver in a cluster using 
 region_mover.rb, the number of regions that are on this server that all are 
 of the same table will decrease, and the split size will decrease too, which 
 may cause the left region split in the regionsever. Region Splits also 
 happens when loading regions of a regionserver in a cluster. 
 A improvment may set a minimum split size in 
 IncreasingToUpperBoundRegionSplitPolicy
 Suggestions are welcomed. Thanks~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12451) IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region splits in rolling update of cluster

2014-11-14 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212154#comment-14212154
 ] 

Qiang Tian commented on HBASE-12451:



Had a offline discuss with Duo, Get more details about the scenario. Now I 
understand the requirement.thanks


 IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region splits 
 in rolling update of cluster
 

 Key: HBASE-12451
 URL: https://issues.apache.org/jira/browse/HBASE-12451
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 2.0.0

 Attachments: HBASE-12451-v1.diff


 Currently IncreasingToUpperBoundRegionSplitPolicy is the default region split 
 policy. In this policy, split size is the number of regions that are on this 
 server that all are of the same table, cubed, times 2x the region flush size.
 But when unloading regions of a regionserver in a cluster using 
 region_mover.rb, the number of regions that are on this server that all are 
 of the same table will decrease, and the split size will decrease too, which 
 may cause the left region split in the regionsever. Region Splits also 
 happens when loading regions of a regionserver in a cluster. 
 A improvment may set a minimum split size in 
 IncreasingToUpperBoundRegionSplitPolicy
 Suggestions are welcomed. Thanks~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12359) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel

2014-11-14 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213383#comment-14213383
 ] 

Qiang Tian commented on HBASE-12359:


Hi Stack,
that is strange. the jekins run and local run were both fine. My local run 
against the latest master branch on linux VM is also fine..
from the openjdk code(suppose similar with oracle jdk code), we hit below null:

{code}
  151* Returns any IPv4 address of the given network interface, or
  152* null if the interface does not have any IPv4 addresses.
  153*/
  154   static Inet4Address anyInet4Address(final NetworkInterface interf) {
  155   return AccessController.doPrivileged(new 
PrivilegedActionInet4Address() {
  156   public Inet4Address run() {
  157   EnumerationInetAddress addrs = 
interf.getInetAddresses();
  158   while (addrs.hasMoreElements()) {
  159   InetAddress addr = addrs.nextElement();
  160   if (addr instanceof Inet4Address) {
  161   return (Inet4Address)addr;
  162   }
  163   }
  164   return null;  // here

{code}

and

{code}
  752   /**
  753* Joins channel's socket to the given group/interface and
  754* optional source address.
  755*/
  756   private MembershipKey innerJoin(InetAddress group,
  757   NetworkInterface interf,
  758   InetAddress source)
...
  826   Inet4Address target = Net.anyInet4Address(interf);
  827   if (target == null)
  828   throw new IOException(Network interface not 
configured for IPv4);
{code}

the interf comes from below MulticastPublisher code:

{code}
public void connect(Configuration conf) throws IOException {
  NetworkInterface ni = 
NetworkInterface.getByInetAddress(Addressing.getIpAddress());

{code}

but I do not change it... Where did you run it? which os?
thanks.



 MulticastPublisher should specify IPv4/v6 protocol family when creating 
 multicast channel
 -

 Key: HBASE-12359
 URL: https://issues.apache.org/jira/browse/HBASE-12359
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 2.0.0
Reporter: Qiang Tian
Assignee: Qiang Tian
Priority: Minor
 Attachments: hbase-12359-master.patch


 see http://osdir.com/ml/general/2014-10/msg56689.html
 {code}
 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error 
 starting cluster
 java.lang.RuntimeException: Failed construction of Master: class 
 org.apache.hadoop.hbase.master.HMaster
at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752)
at 
 org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at 
 org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
 

[jira] [Updated] (HBASE-12359) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel

2014-11-14 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-12359:
---
Status: Open  (was: Patch Available)

 MulticastPublisher should specify IPv4/v6 protocol family when creating 
 multicast channel
 -

 Key: HBASE-12359
 URL: https://issues.apache.org/jira/browse/HBASE-12359
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 2.0.0
Reporter: Qiang Tian
Assignee: Qiang Tian
Priority: Minor
 Attachments: hbase-12359-master.patch


 see http://osdir.com/ml/general/2014-10/msg56689.html
 {code}
 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error 
 starting cluster
 java.lang.RuntimeException: Failed construction of Master: class 
 org.apache.hadoop.hbase.master.HMaster
at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752)
at 
 org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at 
 org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
 org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
 Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 
 multicast group
at 
 sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779)
at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865)
at 
 io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394)
at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273)
at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121)
at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142)
... 26 more
 {code}
 the exception comes from below JDK code:
 {code}
  765   if (group instanceof Inet4Address) {
  766   if (family == StandardProtocolFamily.INET6  
 !Net.canIPv6SocketJoinIPv4Group())
  767   throw new 

[jira] [Updated] (HBASE-12359) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel

2014-11-14 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-12359:
---
Status: Patch Available  (was: Open)

try jekins again.

 MulticastPublisher should specify IPv4/v6 protocol family when creating 
 multicast channel
 -

 Key: HBASE-12359
 URL: https://issues.apache.org/jira/browse/HBASE-12359
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 2.0.0
Reporter: Qiang Tian
Assignee: Qiang Tian
Priority: Minor
 Attachments: hbase-12359-master.patch


 see http://osdir.com/ml/general/2014-10/msg56689.html
 {code}
 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error 
 starting cluster
 java.lang.RuntimeException: Failed construction of Master: class 
 org.apache.hadoop.hbase.master.HMaster
at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752)
at 
 org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at 
 org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
 org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
 Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 
 multicast group
at 
 sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779)
at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865)
at 
 io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394)
at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273)
at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121)
at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142)
... 26 more
 {code}
 the exception comes from below JDK code:
 {code}
  765   if (group instanceof Inet4Address) {
  766   if (family == StandardProtocolFamily.INET6  
 !Net.canIPv6SocketJoinIPv4Group())
  767   throw new 

[jira] [Commented] (HBASE-12451) IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region splits in rolling update of cluster

2014-11-10 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14205828#comment-14205828
 ] 

Qiang Tian commented on HBASE-12451:


Minimum split size is good and simple enough for me..user could have a tradeoff 
between automatic tuning and customization based on knowledge of their 
workload...(many times we do not want to expose too many configure parameters, 
but it looks really useful in some cases:-))

based on total region count looks hard to control, e.g. if user pre-split many 
regions, e.g. in 
http://search-hadoop.com/m/DHED4aS08G1 with 240 regions, the size will be quite 
big unless hbase.increasing.policy.initial.size is also configured..



 IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region splits 
 in rolling update of cluster
 

 Key: HBASE-12451
 URL: https://issues.apache.org/jira/browse/HBASE-12451
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 2.0.0


 Currently IncreasingToUpperBoundRegionSplitPolicy is the default region split 
 policy. In this policy, split size is the number of regions that are on this 
 server that all are of the same table, cubed, times 2x the region flush size.
 But when unloading regions of a regionserver in a cluster using 
 region_mover.rb, the number of regions that are on this server that all are 
 of the same table will decrease, and the split size will decrease too, which 
 may cause the left region split in the regionsever. Region Splits also 
 happens when loading regions of a regionserver in a cluster. 
 A improvment may set a minimum split size in 
 IncreasingToUpperBoundRegionSplitPolicy
 Suggestions are welcomed. Thanks~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12451) IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region splits in rolling update of cluster

2014-11-10 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14206040#comment-14206040
 ] 

Qiang Tian commented on HBASE-12451:


Thanks Duo, forgot that. 
with default values, it looks the region split size will use upper limit after 
3 regions.
according to http://hbase.apache.org/book/ops.capacity.html, region count and 
region size are most important factors, but there is no clear answer for region 
count?

bq. If we already have 240 regions of a table, and there is only one region of 
this table on a regionserver, should the region have a small split size?

the regions should be evenly spread across RS(8 RS in that case)






 IncreasingToUpperBoundRegionSplitPolicy may cause unnecessary region splits 
 in rolling update of cluster
 

 Key: HBASE-12451
 URL: https://issues.apache.org/jira/browse/HBASE-12451
 Project: HBase
  Issue Type: Bug
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 2.0.0


 Currently IncreasingToUpperBoundRegionSplitPolicy is the default region split 
 policy. In this policy, split size is the number of regions that are on this 
 server that all are of the same table, cubed, times 2x the region flush size.
 But when unloading regions of a regionserver in a cluster using 
 region_mover.rb, the number of regions that are on this server that all are 
 of the same table will decrease, and the split size will decrease too, which 
 may cause the left region split in the regionsever. Region Splits also 
 happens when loading regions of a regionserver in a cluster. 
 A improvment may set a minimum split size in 
 IncreasingToUpperBoundRegionSplitPolicy
 Suggestions are welcomed. Thanks~



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12359) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel

2014-11-05 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199630#comment-14199630
 ] 

Qiang Tian commented on HBASE-12359:


Hi [~stack], 
run on Linux is fine.
thanks.


 MulticastPublisher should specify IPv4/v6 protocol family when creating 
 multicast channel
 -

 Key: HBASE-12359
 URL: https://issues.apache.org/jira/browse/HBASE-12359
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 2.0.0
Reporter: Qiang Tian
Assignee: Qiang Tian
Priority: Minor
 Attachments: hbase-12359-master.patch


 see http://osdir.com/ml/general/2014-10/msg56689.html
 {code}
 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error 
 starting cluster
 java.lang.RuntimeException: Failed construction of Master: class 
 org.apache.hadoop.hbase.master.HMaster
at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752)
at 
 org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at 
 org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
 org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
 Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 
 multicast group
at 
 sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779)
at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865)
at 
 io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394)
at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273)
at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121)
at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142)
... 26 more
 {code}
 the exception comes from below JDK code:
 {code}
  765   if (group instanceof Inet4Address) {
  766   if (family == StandardProtocolFamily.INET6  
 

[jira] [Commented] (HBASE-12336) RegionServer failed to shutdown for NodeFailoverWorker thread

2014-11-05 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14199654#comment-14199654
 ] 

Qiang Tian commented on HBASE-12336:


Hi [~stack], 
As I understand, the zookeeper-2012 could apply to this issue as well.
the root of the problem is zk uses 2 queues for request handling. when a packet 
is not on one of the 2 queues. the exception in send thread(in this case, could 
be due to cluster restarted?) will just ignore the packet, so the main thread 
will never get response and hang there. But we need more data for proof.. (so 
far the occurrence is rare..)
thanks.


 RegionServer failed to shutdown for NodeFailoverWorker thread
 -

 Key: HBASE-12336
 URL: https://issues.apache.org/jira/browse/HBASE-12336
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.11
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 2.0.0, 0.94.26, 0.98.9, 0.99.2

 Attachments: HBASE-12336-trunk-v1.diff, stack


 After enabling hbase.zookeeper.useMulti in hbase cluster, we found that 
 regionserver failed to shutdown. Other threads have exited except a 
 NodeFailoverWorker thread.
 {code}
 ReplicationExecutor-0 prio=10 tid=0x7f0d40195ad0 nid=0x73a in 
 Object.wait() [0x7f0dc8fe6000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:485)
 at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1309)
 - locked 0x0005a16df080 (a 
 org.apache.zookeeper.ClientCnxn$Packet)
 at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:930)
 at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:912)
 at 
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:531)
 at 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1518)
 at 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper.copyQueuesFromRSUsingMulti(ReplicationZookeeper.java:804)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:612)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 {code}
 It's sure that the shutdown method of the executor is called in  
 ReplicationSourceManager#join.
  
 I am looking for the root cause and suggestions are welcomed. Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12359) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel

2014-11-03 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194370#comment-14194370
 ] 

Qiang Tian commented on HBASE-12359:


Thanks [~ashish singhi],
just noticed there are only 2 watchers! :-) ping [~stack]

Considering the description of the document.
bq. The creation of the channel should specify the ProtocolFamily that 
corresponds to the  address type of the multicast groups that the channel will 
join.

I'd think putting the new factory code into netty is better...



 MulticastPublisher should specify IPv4/v6 protocol family when creating 
 multicast channel
 -

 Key: HBASE-12359
 URL: https://issues.apache.org/jira/browse/HBASE-12359
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 2.0.0
Reporter: Qiang Tian
Assignee: Qiang Tian
Priority: Minor
 Attachments: hbase-12359-master.patch


 see http://osdir.com/ml/general/2014-10/msg56689.html
 {code}
 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error 
 starting cluster
 java.lang.RuntimeException: Failed construction of Master: class 
 org.apache.hadoop.hbase.master.HMaster
at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752)
at 
 org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at 
 org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
 org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
 Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 
 multicast group
at 
 sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779)
at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865)
at 
 io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394)
at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273)
at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121)
at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at 
 

[jira] [Updated] (HBASE-12359) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel

2014-11-02 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-12359:
---
Summary: MulticastPublisher should specify IPv4/v6 protocol family when 
creating multicast channel  (was: TestHCM.java failed on windows )

 MulticastPublisher should specify IPv4/v6 protocol family when creating 
 multicast channel
 -

 Key: HBASE-12359
 URL: https://issues.apache.org/jira/browse/HBASE-12359
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 2.0.0
Reporter: Qiang Tian
Assignee: Qiang Tian
Priority: Minor

 see http://osdir.com/ml/general/2014-10/msg56689.html
 {code}
 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error 
 starting cluster
 java.lang.RuntimeException: Failed construction of Master: class 
 org.apache.hadoop.hbase.master.HMaster
at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752)
at 
 org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at 
 org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
 org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
 Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 
 multicast group
at 
 sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779)
at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865)
at 
 io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394)
at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273)
at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121)
at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142)
... 26 more
 {code}
 the exception comes from below JDK code:
 {code}
  765   if (group instanceof Inet4Address) {
  766   if (family == StandardProtocolFamily.INET6  
 !Net.canIPv6SocketJoinIPv4Group())
  

[jira] [Updated] (HBASE-12359) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel

2014-11-02 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-12359:
---
Attachment: hbase-12359-master.patch

upload master patch

 MulticastPublisher should specify IPv4/v6 protocol family when creating 
 multicast channel
 -

 Key: HBASE-12359
 URL: https://issues.apache.org/jira/browse/HBASE-12359
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 2.0.0
Reporter: Qiang Tian
Assignee: Qiang Tian
Priority: Minor
 Attachments: hbase-12359-master.patch


 see http://osdir.com/ml/general/2014-10/msg56689.html
 {code}
 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error 
 starting cluster
 java.lang.RuntimeException: Failed construction of Master: class 
 org.apache.hadoop.hbase.master.HMaster
at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752)
at 
 org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at 
 org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
 org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
 Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 
 multicast group
at 
 sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779)
at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865)
at 
 io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394)
at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273)
at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121)
at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142)
... 26 more
 {code}
 the exception comes from below JDK code:
 {code}
  765   if (group instanceof Inet4Address) {
  766   if (family == StandardProtocolFamily.INET6  
 !Net.canIPv6SocketJoinIPv4Group())
  767   throw new 

[jira] [Updated] (HBASE-12359) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel

2014-11-02 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-12359:
---
Status: Patch Available  (was: Open)

 MulticastPublisher should specify IPv4/v6 protocol family when creating 
 multicast channel
 -

 Key: HBASE-12359
 URL: https://issues.apache.org/jira/browse/HBASE-12359
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 2.0.0
Reporter: Qiang Tian
Assignee: Qiang Tian
Priority: Minor
 Attachments: hbase-12359-master.patch


 see http://osdir.com/ml/general/2014-10/msg56689.html
 {code}
 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error 
 starting cluster
 java.lang.RuntimeException: Failed construction of Master: class 
 org.apache.hadoop.hbase.master.HMaster
at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752)
at 
 org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at 
 org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
 org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
 Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 
 multicast group
at 
 sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779)
at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865)
at 
 io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394)
at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273)
at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121)
at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142)
... 26 more
 {code}
 the exception comes from below JDK code:
 {code}
  765   if (group instanceof Inet4Address) {
  766   if (family == StandardProtocolFamily.INET6  
 !Net.canIPv6SocketJoinIPv4Group())
  767   throw new 

[jira] [Commented] (HBASE-12359) MulticastPublisher should specify IPv4/v6 protocol family when creating multicast channel

2014-11-02 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194325#comment-14194325
 ] 

Qiang Tian commented on HBASE-12359:


Hi [~ashish singhi],
could you try the patch on your windows env?


 MulticastPublisher should specify IPv4/v6 protocol family when creating 
 multicast channel
 -

 Key: HBASE-12359
 URL: https://issues.apache.org/jira/browse/HBASE-12359
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 2.0.0
Reporter: Qiang Tian
Assignee: Qiang Tian
Priority: Minor
 Attachments: hbase-12359-master.patch


 see http://osdir.com/ml/general/2014-10/msg56689.html
 {code}
 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error 
 starting cluster
 java.lang.RuntimeException: Failed construction of Master: class 
 org.apache.hadoop.hbase.master.HMaster
at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752)
at 
 org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at 
 org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
 org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
 Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 
 multicast group
at 
 sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779)
at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865)
at 
 io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394)
at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273)
at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121)
at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142)
... 26 more
 {code}
 the exception comes from below JDK code:
 {code}
  765   if (group instanceof Inet4Address) {
  766   if (family == StandardProtocolFamily.INET6  
 

[jira] [Commented] (HBASE-12336) RegionServer failed to shutdown for NodeFailoverWorker thread

2014-11-02 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14194334#comment-14194334
 ] 

Qiang Tian commented on HBASE-12336:


bq. Maybe we need to look at server.isStopped inside ReplicationSourceManager 
more often than we do?

[~stack], I think it is a zookeeper issue - the stacktrace shows the zk client 
thread is stuck
{code}
Object.wait() [0x7f0dc8fe6000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1309)
{code}

so it cannot shut down. 

as shown in ZOOKEEPER-2012, there is timing hole between the zk client 'main' 
thread and
'main-SendThread' thread.


 RegionServer failed to shutdown for NodeFailoverWorker thread
 -

 Key: HBASE-12336
 URL: https://issues.apache.org/jira/browse/HBASE-12336
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.11
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Fix For: 2.0.0, 0.94.26, 0.98.9, 0.99.2

 Attachments: HBASE-12336-trunk-v1.diff, stack


 After enabling hbase.zookeeper.useMulti in hbase cluster, we found that 
 regionserver failed to shutdown. Other threads have exited except a 
 NodeFailoverWorker thread.
 {code}
 ReplicationExecutor-0 prio=10 tid=0x7f0d40195ad0 nid=0x73a in 
 Object.wait() [0x7f0dc8fe6000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:485)
 at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1309)
 - locked 0x0005a16df080 (a 
 org.apache.zookeeper.ClientCnxn$Packet)
 at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:930)
 at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:912)
 at 
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:531)
 at 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1518)
 at 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper.copyQueuesFromRSUsingMulti(ReplicationZookeeper.java:804)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:612)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 {code}
 It's sure that the shutdown method of the executor is called in  
 ReplicationSourceManager#join.
  
 I am looking for the root cause and suggestions are welcomed. Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long

2014-10-28 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11368:
---
Status: Patch Available  (was: Open)

 Multi-column family BulkLoad fails if compactions go on too long
 

 Key: HBASE-11368
 URL: https://issues.apache.org/jira/browse/HBASE-11368
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: Qiang Tian
 Attachments: hbase-11368-0.98.5.patch, hbase11368-master.patch, 
 key_stacktrace_hbase10882.TXT, performance_improvement_verification_98.5.patch


 Compactions take a read lock.  If a multi-column family region, before bulk 
 loading, we want to take a write lock on the region.  If the compaction takes 
 too long, the bulk load fails.
 Various recipes include:
 + Making smaller regions (lame)
 + [~victorunique] suggests major compacting just before bulk loading over in 
 HBASE-10882 as a work around.
 Does the compaction need a read lock for that long?  Does the bulk load need 
 a full write lock when multiple column families?  Can we fail more gracefully 
 at least?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long

2014-10-28 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11368:
---
Attachment: hbase11368-master.patch

patch for master branch

 Multi-column family BulkLoad fails if compactions go on too long
 

 Key: HBASE-11368
 URL: https://issues.apache.org/jira/browse/HBASE-11368
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: Qiang Tian
 Attachments: hbase-11368-0.98.5.patch, hbase11368-master.patch, 
 key_stacktrace_hbase10882.TXT, performance_improvement_verification_98.5.patch


 Compactions take a read lock.  If a multi-column family region, before bulk 
 loading, we want to take a write lock on the region.  If the compaction takes 
 too long, the bulk load fails.
 Various recipes include:
 + Making smaller regions (lame)
 + [~victorunique] suggests major compacting just before bulk loading over in 
 HBASE-10882 as a work around.
 Does the compaction need a read lock for that long?  Does the bulk load need 
 a full write lock when multiple column families?  Can we fail more gracefully 
 at least?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12357) TestHCM#testClusterStatus is continuosly failing in jenkins

2014-10-28 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186563#comment-14186563
 ] 

Qiang Tian commented on HBASE-12357:


it looks here:-)
{quote}
Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 
multicast group
   at sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779)
   at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865)
   at 
io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394)
   at 
org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273)
   at 
org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121)
   at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
   at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
   at 
org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142)
{quote}

{code}
 756   private MembershipKey innerJoin(InetAddress group,
  757   NetworkInterface interf,
  758   InetAddress source)
  759   throws IOException
  760   {
  761   if (!group.isMulticastAddress())
  762   throw new IllegalArgumentException(Group not a multicast 
address);
  763   
  764   // check multicast address is compatible with this socket
  765   if (group instanceof Inet4Address) {
  766   if (family == StandardProtocolFamily.INET6  
!Net.canIPv6SocketJoinIPv4Group())
  767   throw new IllegalArgumentException(IPv6 socket cannot 
join IPv4 multicast group);
  768   } else if (group instanceof Inet6Address) {
  769   if (family != StandardProtocolFamily.INET6)
  770   throw new IllegalArgumentException(Only IPv6 sockets 
can join IPv6 multicast group);
  771   } else {
  772   throw new IllegalArgumentException(Address type not 
supported);
  773   }
{code}

looks need to specify {{StandardProtocolFamily.INET6}} for IPV6 addr..?



 TestHCM#testClusterStatus is continuosly failing in jenkins
 ---

 Key: HBASE-12357
 URL: https://issues.apache.org/jira/browse/HBASE-12357
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 1.0.0
Reporter: Ashish Singhi
Assignee: Dima Spivak

 {noformat}Tests run: 21, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 
 187.475 sec  FAILURE! - in org.apache.hadoop.hbase.client.TestHCM
 testClusterStatus(org.apache.hadoop.hbase.client.TestHCM)  Time elapsed: 
 41.477 sec   ERROR!
 java.lang.Exception: Unexpected exception, 
 expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException 
 but wasjunit.framework.AssertionFailedError
   at junit.framework.Assert.fail(Assert.java:57)
   at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537)
   at 
 org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12357) TestHCM#testClusterStatus is continuosly failing in jenkins

2014-10-28 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-12357:
---
Description: 
{}


{noformat}Tests run: 21, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 
187.475 sec  FAILURE! - in org.apache.hadoop.hbase.client.TestHCM
testClusterStatus(org.apache.hadoop.hbase.client.TestHCM)  Time elapsed: 41.477 
sec   ERROR!
java.lang.Exception: Unexpected exception, 
expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but 
wasjunit.framework.AssertionFailedError
at junit.framework.Assert.fail(Assert.java:57)
at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193)
at 
org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537)
at 
org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273)
{noformat}

  was:
{noformat}Tests run: 21, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 
187.475 sec  FAILURE! - in org.apache.hadoop.hbase.client.TestHCM
testClusterStatus(org.apache.hadoop.hbase.client.TestHCM)  Time elapsed: 41.477 
sec   ERROR!
java.lang.Exception: Unexpected exception, 
expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but 
wasjunit.framework.AssertionFailedError
at junit.framework.Assert.fail(Assert.java:57)
at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193)
at 
org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537)
at 
org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273)
{noformat}


 TestHCM#testClusterStatus is continuosly failing in jenkins
 ---

 Key: HBASE-12357
 URL: https://issues.apache.org/jira/browse/HBASE-12357
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 1.0.0
Reporter: Ashish Singhi
Assignee: Dima Spivak

 {}
 {noformat}Tests run: 21, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 
 187.475 sec  FAILURE! - in org.apache.hadoop.hbase.client.TestHCM
 testClusterStatus(org.apache.hadoop.hbase.client.TestHCM)  Time elapsed: 
 41.477 sec   ERROR!
 java.lang.Exception: Unexpected exception, 
 expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException 
 but wasjunit.framework.AssertionFailedError
   at junit.framework.Assert.fail(Assert.java:57)
   at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537)
   at 
 org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12357) TestHCM#testClusterStatus is continuosly failing in jenkins

2014-10-28 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-12357:
---
Description: 
{noformat}
014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error starting 
cluster
java.lang.RuntimeException: Failed construction of Master: class 
org.apache.hadoop.hbase.master.HMaster
   at 
org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145)
   at 
org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215)
   at 
org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153)
   at 
org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215)
   at 
org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94)
   at 
org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914)
   at 
org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877)
   at 
org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794)
   at 
org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765)
   at 
org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752)
   at 
org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:601)
   at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
   at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
   at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
   at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
   at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
   at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
   at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
   at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
   at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
   at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
   at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 
multicast group
   at sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779)
   at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865)
   at 
io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394)
   at 
org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273)
   at 
org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121)
   at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
   at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
   at 
org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142)
   ... 26 more


Tests run: 21, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 187.475 sec 
 FAILURE! - in org.apache.hadoop.hbase.client.TestHCM
testClusterStatus(org.apache.hadoop.hbase.client.TestHCM)  Time elapsed: 41.477 
sec   ERROR!
java.lang.Exception: Unexpected exception, 
expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException but 
wasjunit.framework.AssertionFailedError
at junit.framework.Assert.fail(Assert.java:57)
at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193)
at 
org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537)
at 
org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273)
{noformat}

  was:
{}


{noformat}Tests run: 21, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 
187.475 sec  FAILURE! - in org.apache.hadoop.hbase.client.TestHCM
testClusterStatus(org.apache.hadoop.hbase.client.TestHCM)  Time elapsed: 41.477 
sec   ERROR!

[jira] [Commented] (HBASE-12357) TestHCM#testClusterStatus is continuosly failing in jenkins

2014-10-28 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186580#comment-14186580
 ] 

Qiang Tian commented on HBASE-12357:


oops. the stacktrace comes from Ashish's email.
the code is JDK code about:
bq.  java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 multicast 
group


see the error here:
http://docs.oracle.com/javase/7/docs/api/java/nio/channels/MulticastChannel.html

we are hitting here:
{code}
 765   if (group instanceof Inet4Address) {
 766   if (family == StandardProtocolFamily.INET6  
!Net.canIPv6SocketJoinIPv4Group())
  767   throw new IllegalArgumentException(IPv6 socket cannot 
join IPv4 multicast group);
{code}

the parameter group is IPV4, and family is IPV6(StandardProtocolFamily.INET6), 
but I do not see family is specified to IPv6 in ClusterStPublisher.java. 
strange..




 TestHCM#testClusterStatus is continuosly failing in jenkins
 ---

 Key: HBASE-12357
 URL: https://issues.apache.org/jira/browse/HBASE-12357
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 1.0.0
Reporter: Ashish Singhi
Assignee: Dima Spivak

 {noformat}
 Tests run: 21, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 187.475 sec 
  FAILURE! - in org.apache.hadoop.hbase.client.TestHCM
 testClusterStatus(org.apache.hadoop.hbase.client.TestHCM)  Time elapsed: 
 41.477 sec   ERROR!
 java.lang.Exception: Unexpected exception, 
 expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException 
 but wasjunit.framework.AssertionFailedError
   at junit.framework.Assert.fail(Assert.java:57)
   at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537)
   at 
 org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12357) TestHCM#testClusterStatus is continuosly failing in jenkins

2014-10-28 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186584#comment-14186584
 ] 

Qiang Tian commented on HBASE-12357:


[~ashish singhi],
Can you confirm is it the same problem?

 TestHCM#testClusterStatus is continuosly failing in jenkins
 ---

 Key: HBASE-12357
 URL: https://issues.apache.org/jira/browse/HBASE-12357
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 1.0.0
Reporter: Ashish Singhi
Assignee: Dima Spivak

 {noformat}
 Tests run: 21, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 187.475 sec 
  FAILURE! - in org.apache.hadoop.hbase.client.TestHCM
 testClusterStatus(org.apache.hadoop.hbase.client.TestHCM)  Time elapsed: 
 41.477 sec   ERROR!
 java.lang.Exception: Unexpected exception, 
 expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException 
 but wasjunit.framework.AssertionFailedError
   at junit.framework.Assert.fail(Assert.java:57)
   at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537)
   at 
 org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12357) TestHCM#testClusterStatus is continuosly failing in jenkins

2014-10-28 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186602#comment-14186602
 ] 

Qiang Tian commented on HBASE-12357:


ah..sorry for that [~dimaspivak].
that looks a bug too. will open a new jira



 TestHCM#testClusterStatus is continuosly failing in jenkins
 ---

 Key: HBASE-12357
 URL: https://issues.apache.org/jira/browse/HBASE-12357
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 1.0.0
Reporter: Ashish Singhi
Assignee: Dima Spivak

 {noformat}
 Tests run: 21, Failures: 0, Errors: 1, Skipped: 2, Time elapsed: 187.475 sec 
  FAILURE! - in org.apache.hadoop.hbase.client.TestHCM
 testClusterStatus(org.apache.hadoop.hbase.client.TestHCM)  Time elapsed: 
 41.477 sec   ERROR!
 java.lang.Exception: Unexpected exception, 
 expectedorg.apache.hadoop.hbase.regionserver.RegionServerStoppedException 
 but wasjunit.framework.AssertionFailedError
   at junit.framework.Assert.fail(Assert.java:57)
   at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:193)
   at 
 org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3537)
   at 
 org.apache.hadoop.hbase.client.TestHCM.testClusterStatus(TestHCM.java:273)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-12359) TestHCM.java failed on windows

2014-10-28 Thread Qiang Tian (JIRA)
Qiang Tian created HBASE-12359:
--

 Summary: TestHCM.java failed on windows 
 Key: HBASE-12359
 URL: https://issues.apache.org/jira/browse/HBASE-12359
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 2.0.0
Reporter: Qiang Tian
Priority: Minor


see http://osdir.com/ml/general/2014-10/msg56689.html
{code}
014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error starting 
cluster
java.lang.RuntimeException: Failed construction of Master: class 
org.apache.hadoop.hbase.master.HMaster
   at 
org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145)
   at 
org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215)
   at 
org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153)
   at 
org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215)
   at 
org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94)
   at 
org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914)
   at 
org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877)
   at 
org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794)
   at 
org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765)
   at 
org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752)
   at 
org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:601)
   at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
   at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
   at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
   at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
   at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
   at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
   at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
   at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
   at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
   at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
   at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
   at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 
multicast group
   at sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779)
   at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865)
   at 
io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394)
   at 
org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273)
   at 
org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121)
   at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
   at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
   at 
org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142)
   ... 26 more
{code}

the exception comes from below JDK code:
{code}
 765   if (group instanceof Inet4Address) {
 766   if (family == StandardProtocolFamily.INET6  
!Net.canIPv6SocketJoinIPv4Group())
 767   throw new IllegalArgumentException(IPv6 socket cannot 
join IPv4 multicast group);
{code}


according to 
document(http://docs.oracle.com/javase/7/docs/api/java/nio/channels/MulticastChannel.html)

{quote}
The multicast implementation is intended to map directly to the native 
multicasting facility. Consequently, the following items should be considered 
when developing an application that receives IP multicast datagrams:

The creation of the channel should specify the ProtocolFamily that 

[jira] [Assigned] (HBASE-12359) TestHCM.java failed on windows

2014-10-28 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian reassigned HBASE-12359:
--

Assignee: Qiang Tian

 TestHCM.java failed on windows 
 ---

 Key: HBASE-12359
 URL: https://issues.apache.org/jira/browse/HBASE-12359
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 2.0.0
Reporter: Qiang Tian
Assignee: Qiang Tian
Priority: Minor

 see http://osdir.com/ml/general/2014-10/msg56689.html
 {code}
 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error 
 starting cluster
 java.lang.RuntimeException: Failed construction of Master: class 
 org.apache.hadoop.hbase.master.HMaster
at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752)
at 
 org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at 
 org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
 org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
 Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 
 multicast group
at 
 sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779)
at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865)
at 
 io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394)
at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273)
at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121)
at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142)
... 26 more
 {code}
 the exception comes from below JDK code:
 {code}
  765   if (group instanceof Inet4Address) {
  766   if (family == StandardProtocolFamily.INET6  
 !Net.canIPv6SocketJoinIPv4Group())
  767   throw new IllegalArgumentException(IPv6 socket cannot 
 join IPv4 multicast group);
 {code}
 according to 
 document(http://docs.oracle.com/javase/7/docs/api/java/nio/channels/MulticastChannel.html)
 {quote}

[jira] [Commented] (HBASE-12359) TestHCM.java failed on windows

2014-10-28 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14186632#comment-14186632
 ] 

Qiang Tian commented on HBASE-12359:


we should support both IPv4 and IPv6 family, right?


 TestHCM.java failed on windows 
 ---

 Key: HBASE-12359
 URL: https://issues.apache.org/jira/browse/HBASE-12359
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 2.0.0
Reporter: Qiang Tian
Assignee: Qiang Tian
Priority: Minor

 see http://osdir.com/ml/general/2014-10/msg56689.html
 {code}
 014-10-28 12:21:47,337 ERROR [main] hbase.MiniHBaseCluster(230): Error 
 starting cluster
 java.lang.RuntimeException: Failed construction of Master: class 
 org.apache.hadoop.hbase.master.HMaster
at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:145)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.addMaster(LocalHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.LocalHBaseCluster.init(LocalHBaseCluster.java:153)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:215)
at 
 org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:94)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:914)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:877)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:794)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:765)
at 
 org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:752)
at 
 org.apache.hadoop.hbase.client.TestHCM.setUpBeforeClass(TestHCM.java:138)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at 
 org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
 org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
 Caused by: java.lang.IllegalArgumentException: IPv6 socket cannot join IPv4 
 multicast group
at 
 sun.nio.ch.DatagramChannelImpl.innerJoin(DatagramChannelImpl.java:779)
at sun.nio.ch.DatagramChannelImpl.join(DatagramChannelImpl.java:865)
at 
 io.netty.channel.socket.nio.NioDatagramChannel.joinGroup(NioDatagramChannel.java:394)
at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher.connect(ClusterStatusPublisher.java:273)
at 
 org.apache.hadoop.hbase.master.ClusterStatusPublisher.init(ClusterStatusPublisher.java:121)
at org.apache.hadoop.hbase.master.HMaster.init(HMaster.java:307)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
at 
 org.apache.hadoop.hbase.util.JVMClusterUtil.createMasterThread(JVMClusterUtil.java:142)
... 26 more
 {code}
 the exception comes from below JDK code:
 {code}
  765   if (group instanceof Inet4Address) {
  766   if (family == StandardProtocolFamily.INET6  
 !Net.canIPv6SocketJoinIPv4Group())
  767   throw new IllegalArgumentException(IPv6 socket cannot 
 join IPv4 multicast group);
 {code}
 according to 
 

[jira] [Commented] (HBASE-12336) RegionServer failed to shutdown for NodeFailoverWorker thread

2014-10-27 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184939#comment-14184939
 ] 

Qiang Tian commented on HBASE-12336:


the zookeeper stacktrace looks similar with ZOOKEEPER-2012 - 
ClientCnxn.submitRequest never returns.
http://pastebin.com/xU4MSq9k. 
is there any zookeeper error message in RS log?



 RegionServer failed to shutdown for NodeFailoverWorker thread
 -

 Key: HBASE-12336
 URL: https://issues.apache.org/jira/browse/HBASE-12336
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.94.11
Reporter: Liu Shaohui
Assignee: Liu Shaohui
Priority: Minor
 Attachments: stack


 After enabling hbase.zookeeper.useMulti in hbase cluster, we found that 
 regionserver failed to shutdown. Other threads have exited except a 
 NodeFailoverWorker thread.
 {code}
 ReplicationExecutor-0 prio=10 tid=0x7f0d40195ad0 nid=0x73a in 
 Object.wait() [0x7f0dc8fe6000]
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:485)
 at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1309)
 - locked 0x0005a16df080 (a 
 org.apache.zookeeper.ClientCnxn$Packet)
 at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:930)
 at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:912)
 at 
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.multi(RecoverableZooKeeper.java:531)
 at 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.multiOrSequential(ZKUtil.java:1518)
 at 
 org.apache.hadoop.hbase.replication.ReplicationZookeeper.copyQueuesFromRSUsingMulti(ReplicationZookeeper.java:804)
 at 
 org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager$NodeFailoverWorker.run(ReplicationSourceManager.java:612)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 {code}
 It's sure that the shutdown method of the executor is called in  
 ReplicationSourceManager#join.
  
 I am looking for the root cause and suggestions are welcomed. Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long

2014-10-25 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11368:
---
Attachment: key_stacktrace_hbase10882.TXT

Hi [~stack],
Sorry for confusing. let me explain from scratch:
1)the root cause of problem - HRegion#lock.
From the stacktrace in HBASE-10882(also see key_stacktrace_hbase10882.TXT 
attached),  the event sequence is: 
1.1)the compaction acquires the readlock of HRegion#lock, 
1.2)the bulkload try to acquire the writelock of HRegion#lock if there are 
multiple CFs. it has to wait for compaction to release the readlock.
1.3)scanners try to acquire the readlock of HRegion#lock. they have to wait for 
the bulkload to release the writelock.
so both bulkload and scanners are blocked on HRegion#lock by compaction.

2)what is HRegion#lock used for?
Investigation on the HRegion#lock shows, it is originally designed to protect 
region close ONLY. if someone, such as region split, wants to close the region, 
it needs to wait for others release the readlock.  
Then HBASE-4552 used the lock to solve the multi-CF bulkload consistency issue. 
now we see it is too heavy.

3)can we not use HRegion#lock in bulkload?
the answer is yes. 
Internally, HStore#DefaultStoreFileManager#storefiles keeps track of the 
on-disk HFiles for a CF. we have below steps for the bulkload:
3.1)moves HFiles directly to region directory
3.2)add them into the {{storefiles}} list
3.3)notify StoreScanner that the HFile list is changed, which is done by 
resetting the StoreScanner#heap to null. this forces existing StoreScanner 
instances to reinitialize based on new the HFiles seen on disk in next 
scan/read request.
the step 3.2 and 3.3 is synchronized by HStore#lock. so we have CF level 
scan-bulkload consistency.
 
To achieve multi-CF scan-bulkload consistency, if we do not use HRegion#lock, 
we still need another region level lock --- a RegionScanner is composed of 
multiple StoreScanner, a StoreScanner(a CF scanner) is composed of a 
MemStoreScanner and multiple StoreFileScanner.

the RegionScannerImpl#sortheap(and joinedHeap) is just the entry point of 
multiple StoreScanners. to have multi-CF consistency, we need synchronization 
here - a lock is needed, but it is used only between scan and bulkload.



Regarding the code change you referenced, 
performance_improvement_verification_98.5.patch is to simulate the event 
sequence described in #1, for testing purpose only.

currently I use 98.5 for test since it is stable and easy to evaluate the 
effect of the change.
thanks.









 Multi-column family BulkLoad fails if compactions go on too long
 

 Key: HBASE-11368
 URL: https://issues.apache.org/jira/browse/HBASE-11368
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: Qiang Tian
 Attachments: hbase-11368-0.98.5.patch, key_stacktrace_hbase10882.TXT, 
 performance_improvement_verification_98.5.patch


 Compactions take a read lock.  If a multi-column family region, before bulk 
 loading, we want to take a write lock on the region.  If the compaction takes 
 too long, the bulk load fails.
 Various recipes include:
 + Making smaller regions (lame)
 + [~victorunique] suggests major compacting just before bulk loading over in 
 HBASE-10882 as a work around.
 Does the compaction need a read lock for that long?  Does the bulk load need 
 a full write lock when multiple column families?  Can we fail more gracefully 
 at least?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long

2014-10-25 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14184066#comment-14184066
 ] 

Qiang Tian commented on HBASE-11368:


the attachments:
{{key_stacktrace_hbase10882.TXT}} : the problem stacktrace
{{hbase-11368-0.98.5.patch}} : the fix
{{performance_improvement_verification_98.5.patch}}: the testcase to verify 
performance improvement




 Multi-column family BulkLoad fails if compactions go on too long
 

 Key: HBASE-11368
 URL: https://issues.apache.org/jira/browse/HBASE-11368
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: Qiang Tian
 Attachments: hbase-11368-0.98.5.patch, key_stacktrace_hbase10882.TXT, 
 performance_improvement_verification_98.5.patch


 Compactions take a read lock.  If a multi-column family region, before bulk 
 loading, we want to take a write lock on the region.  If the compaction takes 
 too long, the bulk load fails.
 Various recipes include:
 + Making smaller regions (lame)
 + [~victorunique] suggests major compacting just before bulk loading over in 
 HBASE-10882 as a work around.
 Does the compaction need a read lock for that long?  Does the bulk load need 
 a full write lock when multiple column families?  Can we fail more gracefully 
 at least?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long

2014-10-24 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11368:
---
Attachment: performance_improvement_verification_98.5.patch

A simple comparison test using updated TestHRegionServerBulkLoad.java the 
number is for just for reference. the real perf improvement might depend on a 
combination of factors, such as campaction time, bulkload time, scan/read 
workload type, request currency etc)


98.5:
---
2014-10-24 02:30:03,399 INFO  [main] 
regionserver.TestHRegionServerBulkLoad(345):   loaded 16
2014-10-24 02:30:03,399 INFO  [main] 
regionserver.TestHRegionServerBulkLoad(346):   compations 16

2014-10-24 02:30:03,399 INFO  [main] 
regionserver.TestHRegionServerBulkLoad(348): Scanners:
//average # with 50 scanners
2014-10-24 02:30:03,399 INFO  [main] 
regionserver.TestHRegionServerBulkLoad(350):   scanned 73
2014-10-24 02:30:03,400 INFO  [main] 
regionserver.TestHRegionServerBulkLoad(351):   verified 18000 rows


98.5+patch

//since bulkload has smaller conflict with compaction, we get more 
bulkload/compaction request in fixed test cycle(5 minutes)
2014-10-24 02:41:19,071 INFO  [main] 
regionserver.TestHRegionServerBulkLoad(344): Loaders:
2014-10-24 02:41:19,072 INFO  [main] 
regionserver.TestHRegionServerBulkLoad(345):   loaded 43
2014-10-24 02:41:19,072 INFO  [main] 
regionserver.TestHRegionServerBulkLoad(346):   compations 43

2014-10-24 02:41:19,073 INFO  [main] 
regionserver.TestHRegionServerBulkLoad(348): Scanners:
 //since bulkload has smaller conflict with scan, we get more scans in fixed 
test cycle(5 minutes)
//average # for 50 scanners
2014-10-24 02:41:19,073 INFO  [main] 
regionserver.TestHRegionServerBulkLoad(350):   scanned 92  
2014-10-24 02:41:19,073 INFO  [main] 
regionserver.TestHRegionServerBulkLoad(351):   verified 25000 rows



 Multi-column family BulkLoad fails if compactions go on too long
 

 Key: HBASE-11368
 URL: https://issues.apache.org/jira/browse/HBASE-11368
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: Qiang Tian
 Attachments: hbase-11368-0.98.5.patch, 
 performance_improvement_verification_98.5.patch


 Compactions take a read lock.  If a multi-column family region, before bulk 
 loading, we want to take a write lock on the region.  If the compaction takes 
 too long, the bulk load fails.
 Various recipes include:
 + Making smaller regions (lame)
 + [~victorunique] suggests major compacting just before bulk loading over in 
 HBASE-10882 as a work around.
 Does the compaction need a read lock for that long?  Does the bulk load need 
 a full write lock when multiple column families?  Can we fail more gracefully 
 at least?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long

2014-10-24 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182647#comment-14182647
 ] 

Qiang Tian commented on HBASE-11368:


Hi [~stack], [~apurtell],
any comments?
thanks!


 Multi-column family BulkLoad fails if compactions go on too long
 

 Key: HBASE-11368
 URL: https://issues.apache.org/jira/browse/HBASE-11368
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: Qiang Tian
 Attachments: hbase-11368-0.98.5.patch, 
 performance_improvement_verification_98.5.patch


 Compactions take a read lock.  If a multi-column family region, before bulk 
 loading, we want to take a write lock on the region.  If the compaction takes 
 too long, the bulk load fails.
 Various recipes include:
 + Making smaller regions (lame)
 + [~victorunique] suggests major compacting just before bulk loading over in 
 HBASE-10882 as a work around.
 Does the compaction need a read lock for that long?  Does the bulk load need 
 a full write lock when multiple column families?  Can we fail more gracefully 
 at least?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long

2014-10-23 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181168#comment-14181168
 ] 

Qiang Tian commented on HBASE-11368:


initial YCSB test:

Env:
---
hadoop 2.2.0
YCSB 1.0.4(Andrew's branch)
3 nodes, 1 master, 2 RS  //ignore cluster details since just to evaluate the 
new lock

Steps:
---
Followed Andrew's steps(see http://search-hadoop.com/m/DHED4hl7pC/)
the seed table has 3 CFs, pre-split to 20 regions
load 1 million rows to CF 'f1', using workloada
run 3 iterations for workloadc and workloada respectively. the parameter in 
each run:
bq. -p columnfamily=f1 -p operationcount=100 -s -threads 10


Results:
---
0.98.5:
workload c:
[READ], AverageLatency(us), 496.225811
[READ], AverageLatency(us), 510.206831
[READ], AverageLatency(us), 501.256123

workload a:
[READ], AverageLatency(us), 676.4527555821747
[READ], AverageLatency(us), 622.5544771452717
[READ], AverageLatency(us), 628.1365657163067


0.98.5+patch:
workload c:
[READ], AverageLatency(us), 536.334437
[READ], AverageLatency(us), 508.40
[READ], AverageLatency(us), 491.416182


workload a:
[READ], AverageLatency(us), 640.3625218319231
[READ], AverageLatency(us), 642.9719823488798
[READ], AverageLatency(us), 631.7491770928287

looks little performance penalty.

I also ran PE in the cluster, since the test table has only 1 CF, the new lock 
is actually not used. interestingly, with the patch the performance is even a 
bit better...

 Multi-column family BulkLoad fails if compactions go on too long
 

 Key: HBASE-11368
 URL: https://issues.apache.org/jira/browse/HBASE-11368
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: Qiang Tian
 Attachments: hbase-11368-0.98.5.patch


 Compactions take a read lock.  If a multi-column family region, before bulk 
 loading, we want to take a write lock on the region.  If the compaction takes 
 too long, the bulk load fails.
 Various recipes include:
 + Making smaller regions (lame)
 + [~victorunique] suggests major compacting just before bulk loading over in 
 HBASE-10882 as a work around.
 Does the compaction need a read lock for that long?  Does the bulk load need 
 a full write lock when multiple column families?  Can we fail more gracefully 
 at least?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12274) Race between RegionScannerImpl#nextInternal() and RegionScannerImpl#close() may produce null pointer exception

2014-10-16 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174650#comment-14174650
 ] 

Qiang Tian commented on HBASE-12274:


Hi Ted,
I also ran mvn test with 0.98.6. I did not hit the scanner error, but did get 
some other strange failure. the UT looks not very clean.

in RS log, the lease failure looks not expected as well.

{code}
org.apache.hadoop.hbase.regionserver.LeaseException: lease '8' does not exist
at 
org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:221)
at 
org.apache.hadoop.hbase.regionserver.Leases.cancelLease(Leases.java:206)
at 
org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3305)
{code}
it is from different rpc handler, just before NPE. 
we got NotServingRegionException? do we have more log?
thanks.




 Race between RegionScannerImpl#nextInternal() and RegionScannerImpl#close() 
 may produce null pointer exception
 --

 Key: HBASE-12274
 URL: https://issues.apache.org/jira/browse/HBASE-12274
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.6.1
Reporter: Ted Yu
Assignee: Ted Yu
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12274-region-server.log, 12274-v2.txt, 12274-v2.txt, 
 12274-v3.txt


 I saw the following in region server log:
 {code}
 2014-10-15 03:28:36,976 ERROR 
 [B.DefaultRpcServer.handler=0,queue=0,port=60020] ipc.RpcServer: Unexpected 
 throwable object
 java.lang.NullPointerException
   at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5023)
   at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4932)
   at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4923)
   at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3245)
   at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29994)
   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2078)
   at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
   at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
   at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 This is where the NPE happened:
 {code}
 // Let's see what we have in the storeHeap.
 KeyValue current = this.storeHeap.peek();
 {code}
 The cause was race between nextInternal(called through nextRaw) and close 
 methods.
 nextRaw() is not synchronized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12274) Race between RegionScannerImpl#nextInternal() and RegionScannerImpl#close() may produce null pointer exception

2014-10-16 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174674#comment-14174674
 ] 

Qiang Tian commented on HBASE-12274:


Hi Ted,
perhaps I misunderstood. sorry for that. please go ahead.
thanks.


 Race between RegionScannerImpl#nextInternal() and RegionScannerImpl#close() 
 may produce null pointer exception
 --

 Key: HBASE-12274
 URL: https://issues.apache.org/jira/browse/HBASE-12274
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.6.1
Reporter: Ted Yu
Assignee: Ted Yu
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12274-region-server.log, 12274-v2.txt, 12274-v2.txt, 
 12274-v3.txt


 I saw the following in region server log:
 {code}
 2014-10-15 03:28:36,976 ERROR 
 [B.DefaultRpcServer.handler=0,queue=0,port=60020] ipc.RpcServer: Unexpected 
 throwable object
 java.lang.NullPointerException
   at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5023)
   at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4932)
   at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4923)
   at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3245)
   at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29994)
   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2078)
   at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
   at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
   at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 This is where the NPE happened:
 {code}
 // Let's see what we have in the storeHeap.
 KeyValue current = this.storeHeap.peek();
 {code}
 The cause was race between nextInternal(called through nextRaw) and close 
 methods.
 nextRaw() is not synchronized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HBASE-12266) Slow Scan can cause dead loop in ClientScanner

2014-10-15 Thread Qiang Tian (JIRA)
Qiang Tian created HBASE-12266:
--

 Summary: Slow Scan can cause dead loop in ClientScanner 
 Key: HBASE-12266
 URL: https://issues.apache.org/jira/browse/HBASE-12266
 Project: HBase
  Issue Type: Bug
  Components: Scanners
Affects Versions: 0.96.0
Reporter: Qiang Tian
Priority: Minor


see http://search-hadoop.com/m/DHED45SVsC1.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12266) Slow Scan can cause dead loop in ClientScanner

2014-10-15 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-12266:
---
Attachment: HBASE-12266-master.patch

any particular purpose to set it to true there?
thanks.

 Slow Scan can cause dead loop in ClientScanner 
 ---

 Key: HBASE-12266
 URL: https://issues.apache.org/jira/browse/HBASE-12266
 Project: HBase
  Issue Type: Bug
  Components: Scanners
Affects Versions: 0.96.0
Reporter: Qiang Tian
Priority: Minor
 Attachments: HBASE-12266-master.patch


 see http://search-hadoop.com/m/DHED45SVsC1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-12266) Slow Scan can cause dead loop in ClientScanner

2014-10-15 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-12266:
---
Status: Patch Available  (was: Open)

 Slow Scan can cause dead loop in ClientScanner 
 ---

 Key: HBASE-12266
 URL: https://issues.apache.org/jira/browse/HBASE-12266
 Project: HBase
  Issue Type: Bug
  Components: Scanners
Affects Versions: 0.96.0
Reporter: Qiang Tian
Priority: Minor
 Attachments: HBASE-12266-master.patch


 see http://search-hadoop.com/m/DHED45SVsC1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12266) Slow Scan can cause dead loop in ClientScanner

2014-10-15 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173259#comment-14173259
 ] 

Qiang Tian commented on HBASE-12266:


Thanks Guys,
frankly it looks to me such retry(including HBASE-7070)  just makes code more 
complicated to read and easy to create new bugs in complex system. and it is 
hard to be covered by test.

As mentioned in that jira: 
bq. 1.A next request is very large, so first time it is failed because of 
timeout

since it is caused by client side timeout, why not just throw exception so that 
user(or app layer code) knows it and set a bigger value. the timeout value is 
case by case, that is why we make it configurable, right?  







 Slow Scan can cause dead loop in ClientScanner 
 ---

 Key: HBASE-12266
 URL: https://issues.apache.org/jira/browse/HBASE-12266
 Project: HBase
  Issue Type: Bug
  Components: Scanners
Affects Versions: 0.96.0
Reporter: Qiang Tian
Priority: Minor
 Attachments: 12266-v2.txt, HBASE-12266-master.patch


 see http://search-hadoop.com/m/DHED45SVsC1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-12274) Race between RegionScannerImpl#nextInternal() and RegionScannerImpl#close() may produce null pointer exception

2014-10-15 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173292#comment-14173292
 ] 

Qiang Tian commented on HBASE-12274:


Hi Ted,
is it worth to find who close the scanner(is it possible to close it due to 
bug)? I asked this because I have the same question for StoreScanner, it looks 
only itself can call close.

Regarding synchronized, I did not test it myself, synchronized method cost may 
be big, but googled  synchronized block is fine, even better than lock.


 Race between RegionScannerImpl#nextInternal() and RegionScannerImpl#close() 
 may produce null pointer exception
 --

 Key: HBASE-12274
 URL: https://issues.apache.org/jira/browse/HBASE-12274
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.6.1
Reporter: Ted Yu
Assignee: Ted Yu
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12274-v2.txt, 12274-v2.txt, 12274-v3.txt


 I saw the following in region server log:
 {code}
 2014-10-15 03:28:36,976 ERROR 
 [B.DefaultRpcServer.handler=0,queue=0,port=60020] ipc.RpcServer: Unexpected 
 throwable object
 java.lang.NullPointerException
   at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5023)
   at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4932)
   at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4923)
   at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3245)
   at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29994)
   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2078)
   at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
   at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
   at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 This is where the NPE happened:
 {code}
 // Let's see what we have in the storeHeap.
 KeyValue current = this.storeHeap.peek();
 {code}
 The cause was race between nextInternal(called through nextRaw) and close 
 methods.
 nextRaw() is not synchronized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HBASE-12274) Race between RegionScannerImpl#nextInternal() and RegionScannerImpl#close() may produce null pointer exception

2014-10-15 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-12274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14173292#comment-14173292
 ] 

Qiang Tian edited comment on HBASE-12274 at 10/16/14 3:27 AM:
--

Hi Ted,
is it worth to find who close the scanner(is it possible to close it due to 
bug)? I asked this because I have the same question for StoreScanner, it looks 
only itself can call close.

Regarding synchronized, I did not test it myself, synchronized method cost may 
be big, but googled  synchronized block is fine, even better than lock. 
http://t.cn/R7zVKKB,http://t.cn/R7zVKK1



was (Author: tianq):
Hi Ted,
is it worth to find who close the scanner(is it possible to close it due to 
bug)? I asked this because I have the same question for StoreScanner, it looks 
only itself can call close.

Regarding synchronized, I did not test it myself, synchronized method cost may 
be big, but googled  synchronized block is fine, even better than lock.


 Race between RegionScannerImpl#nextInternal() and RegionScannerImpl#close() 
 may produce null pointer exception
 --

 Key: HBASE-12274
 URL: https://issues.apache.org/jira/browse/HBASE-12274
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.98.6.1
Reporter: Ted Yu
Assignee: Ted Yu
 Fix For: 2.0.0, 0.98.8, 0.99.2

 Attachments: 12274-v2.txt, 12274-v2.txt, 12274-v3.txt


 I saw the following in region server log:
 {code}
 2014-10-15 03:28:36,976 ERROR 
 [B.DefaultRpcServer.handler=0,queue=0,port=60020] ipc.RpcServer: Unexpected 
 throwable object
 java.lang.NullPointerException
   at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5023)
   at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4932)
   at 
 org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:4923)
   at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3245)
   at 
 org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29994)
   at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2078)
   at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:108)
   at 
 org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:114)
   at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:94)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 This is where the NPE happened:
 {code}
 // Let's see what we have in the storeHeap.
 KeyValue current = this.storeHeap.peek();
 {code}
 The cause was race between nextInternal(called through nextRaw) and close 
 methods.
 nextRaw() is not synchronized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long

2014-10-14 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11368:
---
Attachment: hbase-11368-0.98.5.patch

I forgot StoreScanner is per CF..earlier analysis is wrong:
{quote}
After DefaultStoreFileManager#storefiles is updated in HStore#bulkLoadHFile, 
notifyChangedReadersObservers is called to reset the StoreScanner#heap, so 
checkReseek-resetScannerStack will be triggered in next scan/read to recreate 
store scanners based on new storefiles.

so we could introduce a new region level rwlock multiCFLock, 
HRegion#bulkLoadHFiles acquires the writelock before multi-CF 
HStore.bulkLoadHFile call. and StoreScanner#resetScannerStack acquires the 
readlock. this way the scanners are recreated after all CFs' store files are 
populated.
{quote}

instead, the new lock should put at regionScanner layer.  see the patch 
attached.

the mvn test and TestHRegionServerBulkLoad(large test for atomic bulkload 
test) passed, still need to run large tests and performance test(any 
suggestions for it? YCSB?).

the lock can be further limited to a smaller scope by split 
HStore#bulkLoadHFile into 2 parts:1) rename the bulkload files and put new 
files into store files list 2) notifyChangedReadersObservers. only #2 needs the 
lock. 
if HDFS file rename is fast, the split may not be needed.



 Multi-column family BulkLoad fails if compactions go on too long
 

 Key: HBASE-11368
 URL: https://issues.apache.org/jira/browse/HBASE-11368
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: Qiang Tian
 Attachments: hbase-11368-0.98.5.patch


 Compactions take a read lock.  If a multi-column family region, before bulk 
 loading, we want to take a write lock on the region.  If the compaction takes 
 too long, the bulk load fails.
 Various recipes include:
 + Making smaller regions (lame)
 + [~victorunique] suggests major compacting just before bulk loading over in 
 HBASE-10882 as a work around.
 Does the compaction need a read lock for that long?  Does the bulk load need 
 a full write lock when multiple column families?  Can we fail more gracefully 
 at least?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long

2014-10-14 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170649#comment-14170649
 ] 

Qiang Tian commented on HBASE-11368:


it looks to me the patch could show the value only when there is long 
compaction + gets/scans,  not sure if [~victorunique] wants to try it in some 
test env?
thanks.


 Multi-column family BulkLoad fails if compactions go on too long
 

 Key: HBASE-11368
 URL: https://issues.apache.org/jira/browse/HBASE-11368
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: Qiang Tian
 Attachments: hbase-11368-0.98.5.patch


 Compactions take a read lock.  If a multi-column family region, before bulk 
 loading, we want to take a write lock on the region.  If the compaction takes 
 too long, the bulk load fails.
 Various recipes include:
 + Making smaller regions (lame)
 + [~victorunique] suggests major compacting just before bulk loading over in 
 HBASE-10882 as a work around.
 Does the compaction need a read lock for that long?  Does the bulk load need 
 a full write lock when multiple column families?  Can we fail more gracefully 
 at least?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11890) HBase REST Client is hard coded to http protocol

2014-10-11 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168065#comment-14168065
 ] 

Qiang Tian commented on HBASE-11890:


Hi [~stack], there is also a doc update :-)

HBase-11890-doc.patch


 HBase REST Client is hard coded to http protocol
 

 Key: HBASE-11890
 URL: https://issues.apache.org/jira/browse/HBASE-11890
 Project: HBase
  Issue Type: Bug
  Components: Client
Affects Versions: 0.96.2
Reporter: Eric Yang
Assignee: Qiang Tian
 Fix For: 2.0.0, 0.98.7, 0.99.1

 Attachments: HBase-11890-doc.patch, HBase-11890-master-v1.patch, 
 HBase-11890-master.patch


 HBase REST Client executePathOnly only supports http.  It would be nice if 
 there is a option to enable REST API client to connect through SSL.  
 org.apache.hadoop.hbase.rest.client.Cluster class does not indicate which 
 protocol can be used, we can either set flag in Cluster class or introduce a 
 parameter in Client class to toggle SSL.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long

2014-10-10 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166466#comment-14166466
 ] 

Qiang Tian commented on HBASE-11368:


update: 
the idea will cause deadlock since bulkload and scanner follow different orders 
to acquire bulkload lock and StoreScanner.lock. will look at if we could lower 
the granularity of storescanner lock.


 Multi-column family BulkLoad fails if compactions go on too long
 

 Key: HBASE-11368
 URL: https://issues.apache.org/jira/browse/HBASE-11368
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: Qiang Tian

 Compactions take a read lock.  If a multi-column family region, before bulk 
 loading, we want to take a write lock on the region.  If the compaction takes 
 too long, the bulk load fails.
 Various recipes include:
 + Making smaller regions (lame)
 + [~victorunique] suggests major compacting just before bulk loading over in 
 HBASE-10882 as a work around.
 Does the compaction need a read lock for that long?  Does the bulk load need 
 a full write lock when multiple column families?  Can we fail more gracefully 
 at least?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long

2014-10-09 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian reassigned HBASE-11368:
--

Assignee: Qiang Tian

 Multi-column family BulkLoad fails if compactions go on too long
 

 Key: HBASE-11368
 URL: https://issues.apache.org/jira/browse/HBASE-11368
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: Qiang Tian

 Compactions take a read lock.  If a multi-column family region, before bulk 
 loading, we want to take a write lock on the region.  If the compaction takes 
 too long, the bulk load fails.
 Various recipes include:
 + Making smaller regions (lame)
 + [~victorunique] suggests major compacting just before bulk loading over in 
 HBASE-10882 as a work around.
 Does the compaction need a read lock for that long?  Does the bulk load need 
 a full write lock when multiple column families?  Can we fail more gracefully 
 at least?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long

2014-10-09 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14164957#comment-14164957
 ] 

Qiang Tian commented on HBASE-11368:


Thanks [~jinghe],
is it right way to run the bulkload test? {{mvn test 
-Dtest=TestHRegionServerBulkLoad}}
the test is supposed to run for 5 minutes, but only after about 1 minutes then 
it exits. is it expected?

 Multi-column family BulkLoad fails if compactions go on too long
 

 Key: HBASE-11368
 URL: https://issues.apache.org/jira/browse/HBASE-11368
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: Qiang Tian

 Compactions take a read lock.  If a multi-column family region, before bulk 
 loading, we want to take a write lock on the region.  If the compaction takes 
 too long, the bulk load fails.
 Various recipes include:
 + Making smaller regions (lame)
 + [~victorunique] suggests major compacting just before bulk loading over in 
 HBASE-10882 as a work around.
 Does the compaction need a read lock for that long?  Does the bulk load need 
 a full write lock when multiple column families?  Can we fail more gracefully 
 at least?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long

2014-10-08 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163162#comment-14163162
 ] 

Qiang Tian commented on HBASE-11368:


ideas for lowering down the lock granularity(based on 0.98.5 code base)
1)read/scan 
is it the primary goal for atomic multi-CF bulkload in HBASE-4552?

After DefaultStoreFileManager#storefiles is updated in HStore#bulkLoadHFile, 
notifyChangedReadersObservers is called to reset the StoreScanner#heap,  so 
checkReseek-resetScannerStack will be triggered in next scan/read to recreate 
store scanners based on new storefiles.

so we could introduce a new region level rwlock multiCFLock,  
HRegion#bulkLoadHFiles acquires the writelock before multi-CF 
HStore.bulkLoadHFile call. and StoreScanner#resetScannerStack acquires the 
readlock. this way the scanners are recreated after all CFs' store files are 
populated.

2)split region.
the region will be closed in SplitTransaction#stepsBeforePONR, which falls into 
the HRegion#lock protection area. bulk load still still need to acquire its 
readlock at start.

3) memstore flush.
we flush to a new file which is not related to the loaded files.

4)compaction.
the compaction is performed store by store. if bulkload inserts new files to 
{{storefiles}} during the selectCompaction process, the file list to be 
compacted might be impacted. e.g., the compaction for some CF do not include 
new loaded files, while others might include. but this does not impact the data 
integrity and read behavior?
at the end of compaction,  {{storefiles}} access is still protected by 
HStore#lock if there is bulk load change to the same CF.

comments?
thanks















 Multi-column family BulkLoad fails if compactions go on too long
 

 Key: HBASE-11368
 URL: https://issues.apache.org/jira/browse/HBASE-11368
 Project: HBase
  Issue Type: Bug
Reporter: stack

 Compactions take a read lock.  If a multi-column family region, before bulk 
 loading, we want to take a write lock on the region.  If the compaction takes 
 too long, the bulk load fails.
 Various recipes include:
 + Making smaller regions (lame)
 + [~victorunique] suggests major compacting just before bulk loading over in 
 HBASE-10882 as a work around.
 Does the compaction need a read lock for that long?  Does the bulk load need 
 a full write lock when multiple column families?  Can we fail more gracefully 
 at least?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11368) Multi-column family BulkLoad fails if compactions go on too long

2014-10-07 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14163024#comment-14163024
 ] 

Qiang Tian commented on HBASE-11368:


As [~stack] mentioned in http://search-hadoop.com/m/DHED4NR0wT, the 
HRegion#lock is to protect region close. the comments in HRegion.java and the 
fact that only HRegion#doClose locks the writelock(if we do not consider 
HRegion#startBulkRegionOperation) also show that.

so using HRegion#lock to protect multi-CF bulkload in HBASE-4552 looks too 
heavy-weight?
from the stacktrace of HBASE-10882, all the read/scan are blocked since 
bulkload is waiting for lock.writelock, however compaction already acquired 
lock.readlock and is reading data, a time-consuming operation.

and related topic is discussed again in http://search-hadoop.com/m/DHED4I11p31. 
perhaps we need another region level lock.











 Multi-column family BulkLoad fails if compactions go on too long
 

 Key: HBASE-11368
 URL: https://issues.apache.org/jira/browse/HBASE-11368
 Project: HBase
  Issue Type: Bug
Reporter: stack

 Compactions take a read lock.  If a multi-column family region, before bulk 
 loading, we want to take a write lock on the region.  If the compaction takes 
 too long, the bulk load fails.
 Various recipes include:
 + Making smaller regions (lame)
 + [~victorunique] suggests major compacting just before bulk loading over in 
 HBASE-10882 as a work around.
 Does the compaction need a read lock for that long?  Does the bulk load need 
 a full write lock when multiple column families?  Can we fail more gracefully 
 at least?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-11890) HBase REST Client is hard coded to http protocol

2014-09-24 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11890:
---
Attachment: HBase-11890-doc.patch

 HBase REST Client is hard coded to http protocol
 

 Key: HBASE-11890
 URL: https://issues.apache.org/jira/browse/HBASE-11890
 Project: HBase
  Issue Type: Bug
  Components: Client
Affects Versions: 0.96.2
Reporter: Eric Yang
Assignee: Qiang Tian
 Attachments: HBase-11890-doc.patch, HBase-11890-master.patch


 HBase REST Client executePathOnly only supports http.  It would be nice if 
 there is a option to enable REST API client to connect through SSL.  
 org.apache.hadoop.hbase.rest.client.Cluster class does not indicate which 
 protocol can be used, we can either set flag in Cluster class or introduce a 
 parameter in Client class to toggle SSL.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-11890) HBase REST Client is hard coded to http protocol

2014-09-24 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11890:
---
Status: Patch Available  (was: Open)

 HBase REST Client is hard coded to http protocol
 

 Key: HBASE-11890
 URL: https://issues.apache.org/jira/browse/HBASE-11890
 Project: HBase
  Issue Type: Bug
  Components: Client
Affects Versions: 0.96.2
Reporter: Eric Yang
Assignee: Qiang Tian
 Attachments: HBase-11890-doc.patch, HBase-11890-master.patch


 HBase REST Client executePathOnly only supports http.  It would be nice if 
 there is a option to enable REST API client to connect through SSL.  
 org.apache.hadoop.hbase.rest.client.Cluster class does not indicate which 
 protocol can be used, we can either set flag in Cluster class or introduce a 
 parameter in Client class to toggle SSL.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-11890) HBase REST Client is hard coded to http protocol

2014-09-24 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11890:
---
Attachment: HBase-11890-master.patch

 HBase REST Client is hard coded to http protocol
 

 Key: HBASE-11890
 URL: https://issues.apache.org/jira/browse/HBASE-11890
 Project: HBase
  Issue Type: Bug
  Components: Client
Affects Versions: 0.96.2
Reporter: Eric Yang
Assignee: Qiang Tian
 Attachments: HBase-11890-doc.patch, HBase-11890-master.patch


 HBase REST Client executePathOnly only supports http.  It would be nice if 
 there is a option to enable REST API client to connect through SSL.  
 org.apache.hadoop.hbase.rest.client.Cluster class does not indicate which 
 protocol can be used, we can either set flag in Cluster class or introduce a 
 parameter in Client class to toggle SSL.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11890) HBase REST Client is hard coded to http protocol

2014-09-24 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14146180#comment-14146180
 ] 

Qiang Tian commented on HBASE-11890:


Hi [~eyang] [~apurtell],
sorry for late response. attached master patch(tested 0.98 can be applied 
directly) and doc update.
tested manually(small change, most related to env configuration)


 HBase REST Client is hard coded to http protocol
 

 Key: HBASE-11890
 URL: https://issues.apache.org/jira/browse/HBASE-11890
 Project: HBase
  Issue Type: Bug
  Components: Client
Affects Versions: 0.96.2
Reporter: Eric Yang
Assignee: Qiang Tian
 Attachments: HBase-11890-doc.patch, HBase-11890-master.patch


 HBase REST Client executePathOnly only supports http.  It would be nice if 
 there is a option to enable REST API client to connect through SSL.  
 org.apache.hadoop.hbase.rest.client.Cluster class does not indicate which 
 protocol can be used, we can either set flag in Cluster class or introduce a 
 parameter in Client class to toggle SSL.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-11890) HBase REST Client is hard coded to http protocol

2014-09-24 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11890:
---
Status: Open  (was: Patch Available)

 HBase REST Client is hard coded to http protocol
 

 Key: HBASE-11890
 URL: https://issues.apache.org/jira/browse/HBASE-11890
 Project: HBase
  Issue Type: Bug
  Components: Client
Affects Versions: 0.96.2
Reporter: Eric Yang
Assignee: Qiang Tian
 Attachments: HBase-11890-doc.patch, HBase-11890-master.patch


 HBase REST Client executePathOnly only supports http.  It would be nice if 
 there is a option to enable REST API client to connect through SSL.  
 org.apache.hadoop.hbase.rest.client.Cluster class does not indicate which 
 protocol can be used, we can either set flag in Cluster class or introduce a 
 parameter in Client class to toggle SSL.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-11890) HBase REST Client is hard coded to http protocol

2014-09-24 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11890:
---
Status: Patch Available  (was: Open)

 HBase REST Client is hard coded to http protocol
 

 Key: HBASE-11890
 URL: https://issues.apache.org/jira/browse/HBASE-11890
 Project: HBase
  Issue Type: Bug
  Components: Client
Affects Versions: 0.96.2
Reporter: Eric Yang
Assignee: Qiang Tian
 Attachments: HBase-11890-doc.patch, HBase-11890-master-v1.patch, 
 HBase-11890-master.patch


 HBase REST Client executePathOnly only supports http.  It would be nice if 
 there is a option to enable REST API client to connect through SSL.  
 org.apache.hadoop.hbase.rest.client.Cluster class does not indicate which 
 protocol can be used, we can either set flag in Cluster class or introduce a 
 parameter in Client class to toggle SSL.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HBASE-11890) HBase REST Client is hard coded to http protocol

2014-09-24 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11890:
---
Attachment: HBase-11890-master-v1.patch

add the javadoc

 HBase REST Client is hard coded to http protocol
 

 Key: HBASE-11890
 URL: https://issues.apache.org/jira/browse/HBASE-11890
 Project: HBase
  Issue Type: Bug
  Components: Client
Affects Versions: 0.96.2
Reporter: Eric Yang
Assignee: Qiang Tian
 Attachments: HBase-11890-doc.patch, HBase-11890-master-v1.patch, 
 HBase-11890-master.patch


 HBase REST Client executePathOnly only supports http.  It would be nice if 
 there is a option to enable REST API client to connect through SSL.  
 org.apache.hadoop.hbase.rest.client.Cluster class does not indicate which 
 protocol can be used, we can either set flag in Cluster class or introduce a 
 parameter in Client class to toggle SSL.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (HBASE-11890) HBase REST Client is hard coded to http protocol

2014-09-10 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian reassigned HBASE-11890:
--

Assignee: Qiang Tian

 HBase REST Client is hard coded to http protocol
 

 Key: HBASE-11890
 URL: https://issues.apache.org/jira/browse/HBASE-11890
 Project: HBase
  Issue Type: Bug
  Components: Client
Affects Versions: 0.96.2
Reporter: Eric Yang
Assignee: Qiang Tian

 HBase REST Client executePathOnly only supports http.  It would be nice if 
 there is a option to enable REST API client to connect through SSL.  
 org.apache.hadoop.hbase.rest.client.Cluster class does not indicate which 
 protocol can be used, we can either set flag in Cluster class or introduce a 
 parameter in Client class to toggle SSL.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HBASE-11812) duplicate code in SimpleRpcScheduler#SimpleRpcScheduler

2014-08-26 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111696#comment-14111696
 ] 

Qiang Tian commented on HBASE-11812:


hi [~stack], 
no, thanks!:-)

 duplicate code in SimpleRpcScheduler#SimpleRpcScheduler
 ---

 Key: HBASE-11812
 URL: https://issues.apache.org/jira/browse/HBASE-11812
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Affects Versions: 0.98.5
Reporter: Qiang Tian
Priority: Minor
 Attachments: HBASE-11812-0.98.patch


 in 0.98.5 SimpleRpcScheduler#SimpleRpcScheduler:
 {code}
  int maxQueueLength = conf.getInt(CALL_QUEUE_MAX_LENGTH_CONF_KEY,
   conf.getInt(ipc.server.max.callqueue.length,
 handlerCount * RpcServer.DEFAULT_MAX_CALLQUEUE_LENGTH_PER_HANDLER));
   
 float callqReadShare = conf.getFloat(CALL_QUEUE_READ_SHARE_CONF_KEY,
   conf.getFloat(ipc.server.callqueue.read.share, 0));
 float callQueuesHandlersFactor = 
 conf.getFloat(CALL_QUEUE_HANDLER_FACTOR_CONF_KEY,
   conf.getFloat(ipc.server.callqueue.handler.factor, 0));
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse

2014-08-24 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108382#comment-14108382
 ] 

Qiang Tian commented on HBASE-11813:



oops..it already points to line 210(got fever,brain is not so clear)
Thanks Stack

 CellScanner#advance may infinitely recurse
 --

 Key: HBASE-11813
 URL: https://issues.apache.org/jira/browse/HBASE-11813
 Project: HBase
  Issue Type: Bug
Reporter: Andrew Purtell
Assignee: stack
Priority: Blocker
 Fix For: 0.99.0, 2.0.0, 0.98.6

 Attachments: 11813.098.txt, 11813.098.txt, 11813.master.txt


 On user@hbase, johannes.schab...@visual-meta.com reported:
 {quote}
 we face a serious issue with our HBase production cluster for two days now. 
 Every couple minutes, a random RegionServer gets stuck and does not process 
 any requests. In addition this causes the other RegionServers to freeze 
 within a minute which brings down the entire cluster. Stopping the affected 
 RegionServer unblocks the cluster and everything comes back to normal.
 {quote}
 Subsequent troubleshooting reveals that RPC is getting stuck because we are 
 losing RPC handlers. In the .out files we have this:
 {noformat}
 Exception in thread defaultRpcServer.handler=5,queue=2,port=60020
 java.lang.StackOverflowError
 at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
 at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
 at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
 at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
 [...]
 Exception in thread defaultRpcServer.handler=5,queue=2,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=18,queue=0,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=23,queue=2,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=24,queue=0,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=2,queue=2,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=11,queue=2,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=25,queue=1,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=20,queue=2,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=19,queue=1,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=15,queue=0,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=1,queue=1,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=7,queue=1,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=4,queue=1,port=60020
 java.lang.StackOverflowError​
 {noformat}
 That is the anonymous CellScanner instance we create from 
 CellUtil#createCellScanner:
 {code}
 ​return new CellScanner() {
   private final Iterator? extends CellScannable iterator = 
 cellScannerables.iterator();
   private CellScanner cellScanner = null;
   @Override
   public Cell current() {
 return this.cellScanner != null? this.cellScanner.current(): null;
   }
   @Override
   public boolean advance() throws IOException {
 if (this.cellScanner == null) {
   if (!this.iterator.hasNext()) return false;
   this.cellScanner = this.iterator.next().cellScanner();
 }
 if (this.cellScanner.advance()) return true;
 this.cellScanner = null;
 ---return advance();
   }
 };
 {code}
 That final return statement is the immediate problem.
 We should also fix this so the RegionServer aborts if it loses a handler to 
 an Error. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (HBASE-11812) duplicate code in SimpleRpcScheduler#SimpleRpcScheduler

2014-08-23 Thread Qiang Tian (JIRA)
Qiang Tian created HBASE-11812:
--

 Summary: duplicate code in SimpleRpcScheduler#SimpleRpcScheduler
 Key: HBASE-11812
 URL: https://issues.apache.org/jira/browse/HBASE-11812
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Affects Versions: 0.98.5
Reporter: Qiang Tian
Priority: Minor



in 0.98.5 SimpleRpcScheduler#SimpleRpcScheduler:
{code}
 int maxQueueLength = conf.getInt(CALL_QUEUE_MAX_LENGTH_CONF_KEY,
  conf.getInt(ipc.server.max.callqueue.length,
handlerCount * RpcServer.DEFAULT_MAX_CALLQUEUE_LENGTH_PER_HANDLER));
  
float callqReadShare = conf.getFloat(CALL_QUEUE_READ_SHARE_CONF_KEY,
  conf.getFloat(ipc.server.callqueue.read.share, 0));

float callQueuesHandlersFactor = 
conf.getFloat(CALL_QUEUE_HANDLER_FACTOR_CONF_KEY,
  conf.getFloat(ipc.server.callqueue.handler.factor, 0));
{code}




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-11812) duplicate code in SimpleRpcScheduler#SimpleRpcScheduler

2014-08-23 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11812:
---

Attachment: HBASE-11812-0.98.patch

master branch is fine.

 duplicate code in SimpleRpcScheduler#SimpleRpcScheduler
 ---

 Key: HBASE-11812
 URL: https://issues.apache.org/jira/browse/HBASE-11812
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Affects Versions: 0.98.5
Reporter: Qiang Tian
Priority: Minor
 Attachments: HBASE-11812-0.98.patch


 in 0.98.5 SimpleRpcScheduler#SimpleRpcScheduler:
 {code}
  int maxQueueLength = conf.getInt(CALL_QUEUE_MAX_LENGTH_CONF_KEY,
   conf.getInt(ipc.server.max.callqueue.length,
 handlerCount * RpcServer.DEFAULT_MAX_CALLQUEUE_LENGTH_PER_HANDLER));
   
 float callqReadShare = conf.getFloat(CALL_QUEUE_READ_SHARE_CONF_KEY,
   conf.getFloat(ipc.server.callqueue.read.share, 0));
 float callQueuesHandlersFactor = 
 conf.getFloat(CALL_QUEUE_HANDLER_FACTOR_CONF_KEY,
   conf.getFloat(ipc.server.callqueue.handler.factor, 0));
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-11813) CellScanner#advance may infinitely recurse

2014-08-23 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108251#comment-14108251
 ] 

Qiang Tian commented on HBASE-11813:


I'd suspect this one:

{code}
  /**
   * Flatten the map of cells out under the CellScanner
   * @param map Map of Cell Lists; for example, the map of families to Cells 
that is used
   * inside Put, etc., keeping Cells organized by family.
   * @return CellScanner interface over codecellIterable/code
   */
  public static CellScanner createCellScanner(final NavigableMapbyte [],
  ListCell map) {
return new CellScanner() {
  private final IteratorEntrybyte[], ListCell entries =
  map.entrySet().iterator();
  private IteratorCell currentIterator = null;
  private Cell currentCell;

  @Override
  public Cell current() {
return this.currentCell;
  }

  @Override
  public boolean advance() {
if (this.currentIterator == null) {
  if (!this.entries.hasNext()) return false;
  this.currentIterator = this.entries.next().getValue().iterator();
}
if (this.currentIterator.hasNext()) {
  this.currentCell = this.currentIterator.next();
  return true;
}
this.currentCell = null;
this.currentIterator = null;
return advance();
  }
};
  }
{code}
looks the one Andrew mentioned would not trigger advance method in server 
side...while the other one is widely used in server side code 
paths..coprocessor or end point related..

 CellScanner#advance may infinitely recurse
 --

 Key: HBASE-11813
 URL: https://issues.apache.org/jira/browse/HBASE-11813
 Project: HBase
  Issue Type: Bug
Reporter: Andrew Purtell
Priority: Blocker
 Fix For: 0.99.0, 2.0.0, 0.98.6


 On user@hbase, johannes.schab...@visual-meta.com reported:
 {quote}
 we face a serious issue with our HBase production cluster for two days now. 
 Every couple minutes, a random RegionServer gets stuck and does not process 
 any requests. In addition this causes the other RegionServers to freeze 
 within a minute which brings down the entire cluster. Stopping the affected 
 RegionServer unblocks the cluster and everything comes back to normal.
 {quote}
 Subsequent troubleshooting reveals that RPC is getting stuck because we are 
 losing RPC handlers. In the .out files we have this:
 {noformat}
 Exception in thread defaultRpcServer.handler=5,queue=2,port=60020
 java.lang.StackOverflowError
 at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
 at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
 at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
 at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
 [...]
 Exception in thread defaultRpcServer.handler=5,queue=2,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=18,queue=0,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=23,queue=2,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=24,queue=0,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=2,queue=2,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=11,queue=2,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=25,queue=1,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=20,queue=2,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=19,queue=1,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=15,queue=0,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=1,queue=1,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=7,queue=1,port=60020
 java.lang.StackOverflowError
 Exception in thread defaultRpcServer.handler=4,queue=1,port=60020
 java.lang.StackOverflowError​
 {noformat}
 That is the anonymous CellScanner instance we create from 
 CellUtil#createCellScanner:
 {code}
 ​return new CellScanner() {
   private final Iterator? extends CellScannable iterator = 
 cellScannerables.iterator();
   private CellScanner cellScanner = null;
   @Override
   public Cell current() {
 return this.cellScanner != null? this.cellScanner.current(): null;
   }
   @Override
   public boolean advance() throws IOException {
 if (this.cellScanner == null) {
   if (!this.iterator.hasNext()) return false;
   this.cellScanner = this.iterator.next().cellScanner();
 }
 if (this.cellScanner.advance()) return true;
 

[jira] [Commented] (HBASE-11812) duplicate code in SimpleRpcScheduler#SimpleRpcScheduler

2014-08-23 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14108253#comment-14108253
 ] 

Qiang Tian commented on HBASE-11812:


oops, thanks Matteo!

 duplicate code in SimpleRpcScheduler#SimpleRpcScheduler
 ---

 Key: HBASE-11812
 URL: https://issues.apache.org/jira/browse/HBASE-11812
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Affects Versions: 0.98.5
Reporter: Qiang Tian
Priority: Minor
 Attachments: HBASE-11812-0.98.patch


 in 0.98.5 SimpleRpcScheduler#SimpleRpcScheduler:
 {code}
  int maxQueueLength = conf.getInt(CALL_QUEUE_MAX_LENGTH_CONF_KEY,
   conf.getInt(ipc.server.max.callqueue.length,
 handlerCount * RpcServer.DEFAULT_MAX_CALLQUEUE_LENGTH_PER_HANDLER));
   
 float callqReadShare = conf.getFloat(CALL_QUEUE_READ_SHARE_CONF_KEY,
   conf.getFloat(ipc.server.callqueue.read.share, 0));
 float callQueuesHandlersFactor = 
 conf.getFloat(CALL_QUEUE_HANDLER_FACTOR_CONF_KEY,
   conf.getFloat(ipc.server.callqueue.handler.factor, 0));
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-11673) TestIOFencing#testFencingAroundCompactionAfterWALSync fails

2014-08-12 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14095066#comment-14095066
 ] 

Qiang Tian commented on HBASE-11673:


Hi [~mantonov], I just ran it, it passed.
thanks.

 TestIOFencing#testFencingAroundCompactionAfterWALSync fails
 ---

 Key: HBASE-11673
 URL: https://issues.apache.org/jira/browse/HBASE-11673
 Project: HBase
  Issue Type: Test
Reporter: Qiang Tian
Assignee: Sergey Soldatov
 Fix For: 2.0.0

 Attachments: HBASE_11673-v1.patch, 
 testFencingAroundCompactionAfterWALSync.tar.gz


 got several test failure on the latest build:
 {quote}
 [tianq@bdvm101 surefire-reports]$ ls -1t|grep Tests run * |grep  
 FAILURE 
 org.apache.hadoop.hbase.client.TestReplicasClient.txt:Tests run: 1, Failures: 
 0, Errors: 1, Skipped: 0, Time elapsed: 38.706 sec  FAILURE!
 org.apache.hadoop.hbase.master.TestMasterOperationsForRegionReplicas.txt:Tests
  run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 30.669 sec  
 FAILURE!
 org.apache.hadoop.hbase.regionserver.TestRegionReplicas.txt:Tests run: 1, 
 Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 39.113 sec  FAILURE!
 org.apache.hadoop.hbase.TestIOFencing.txt:Tests run: 2, Failures: 1, Errors: 
 0, Skipped: 0, Time elapsed: 177.071 sec  FAILURE!
 {quote} 
 the first one:
 {quote} 
 failure message=Timed out waiting for the region to flush 
 type=java.lang.AssertionErrorjava.lang.AssertionError: Timed out waiting 
 for the region to flush
 -at org.junit.Assert.fail(Assert.java:88)
 -at org.junit.Assert.assertTrue(Assert.java:41)
 -at org.apache.hadoop.hbase.TestIOFencing.doTest(TestIOFencing.java:291)
 -at 
 org.apache.hadoop.hbase.TestIOFencing.testFencingAroundCompactionAfterWALSync(TestIOFencing.java:236)
 -at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 -at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 -at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 -at java.lang.reflect.Method.invoke(Method.java:606)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly

2014-08-11 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092574#comment-14092574
 ] 

Qiang Tian commented on HBASE-11714:


Thanks [~nkeywal]! good to know it. :-)
I did not find it..perhaps pasting the error info helps people google it?



 RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
 -

 Key: HBASE-11714
 URL: https://issues.apache.org/jira/browse/HBASE-11714
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Affects Versions: 0.98.3
Reporter: Qiang Tian
Assignee: Qiang Tian
 Attachments: hbase-11714-0.98.patch


 Discussed on the user@hbase mailing list 
 (http://markmail.org/thread/w3cqjxwo2smkn2jw)
 {quote}
 Recently switched from 0.94 and 0.98, and finding that periodically things
 are having issues - lots of retry exceptions :
 {quote}
 client log:
 {quote}
 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
 table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
 last exception: java.net.SocketTimeoutException: Call to
 ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
 because java.net.SocketTimeoutException: 2000 millis timeout while waiting
 for channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on
 ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
 started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
 ops.
 {quote}
 analysis:
 there are 2 methods in RpcRetryingCaller: callWithRetries and 
 callWithoutRetries.
 it looks the timeout setup of callWithRetries is good, while 
 callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a 
 valid timeout, but callWithoutRetries still calls beforeCall, which looks a 
 method for callWithRetries only,  to set timeout. since 
 RpcRetryingCaller#callTimeout  is not set, thread local timeout is set to 
 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final 
 pinginterval set to the socket.
 when there are heavy write workload and the rpc cannot complete in 2s, the 
 client close the connection, so the server side connection is reset and 
 finally exposes the problem in HBASE-11705



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly

2014-08-11 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian resolved HBASE-11714.


   Resolution: Duplicate
Fix Version/s: 0.98.4

 RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
 -

 Key: HBASE-11714
 URL: https://issues.apache.org/jira/browse/HBASE-11714
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Affects Versions: 0.98.3
Reporter: Qiang Tian
Assignee: Qiang Tian
 Fix For: 0.98.4

 Attachments: hbase-11714-0.98.patch


 Discussed on the user@hbase mailing list 
 (http://markmail.org/thread/w3cqjxwo2smkn2jw)
 {quote}
 Recently switched from 0.94 and 0.98, and finding that periodically things
 are having issues - lots of retry exceptions :
 {quote}
 client log:
 {quote}
 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
 table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
 last exception: java.net.SocketTimeoutException: Call to
 ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
 because java.net.SocketTimeoutException: 2000 millis timeout while waiting
 for channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on
 ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
 started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
 ops.
 {quote}
 analysis:
 there are 2 methods in RpcRetryingCaller: callWithRetries and 
 callWithoutRetries.
 it looks the timeout setup of callWithRetries is good, while 
 callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a 
 valid timeout, but callWithoutRetries still calls beforeCall, which looks a 
 method for callWithRetries only,  to set timeout. since 
 RpcRetryingCaller#callTimeout  is not set, thread local timeout is set to 
 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final 
 pinginterval set to the socket.
 when there are heavy write workload and the rpc cannot complete in 2s, the 
 client close the connection, so the server side connection is reset and 
 finally exposes the problem in HBASE-11705



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly

2014-08-10 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092410#comment-14092410
 ] 

Qiang Tian commented on HBASE-11714:


ping [~apurtell], thanks!

 RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
 -

 Key: HBASE-11714
 URL: https://issues.apache.org/jira/browse/HBASE-11714
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Affects Versions: 0.98.3
Reporter: Qiang Tian
Assignee: Qiang Tian
 Attachments: hbase-11714-0.98.patch


 Discussed on the user@hbase mailing list 
 (http://markmail.org/thread/w3cqjxwo2smkn2jw)
 Recently switched from 0.94 and 0.98, and finding that periodically things
 are having issues - lots of retry exceptions :
 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
 table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
 last exception: java.net.SocketTimeoutException: Call to
 ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
 because java.net.SocketTimeoutException: 2000 millis timeout while waiting
 for channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on
 ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
 started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
 ops.
 there are 2 methods in RpcRetryingCaller: callWithRetries and 
 callWithoutRetries.
 it looks the timeout setup of callWithRetries is good, while 
 callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a 
 valid timeout, but callWithoutRetries still calls beforeCall, which looks a 
 method for callWithRetries only,  to set timeout. since 
 RpcRetryingCaller#callTimeout  is not set, thread local timeout is set to 
 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final 
 pinginterval set to the socket.
 when there are heavy write workload and the rpc cannot complete in 2s, the 
 client close the connection, so the server side connection is reset and 
 finally exposes the problem in HBASE-11705



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly

2014-08-10 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14092416#comment-14092416
 ] 

Qiang Tian commented on HBASE-11714:


search hit:
http://stackoverflow.com/questions/25038030/sockettimeoutexception-emerges-when-insert-millions-of-records-to-hbase
I'd think below one is also the hit:
https://issues.apache.org/jira/browse/PHOENIX-998


 RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
 -

 Key: HBASE-11714
 URL: https://issues.apache.org/jira/browse/HBASE-11714
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Affects Versions: 0.98.3
Reporter: Qiang Tian
Assignee: Qiang Tian
 Attachments: hbase-11714-0.98.patch


 Discussed on the user@hbase mailing list 
 (http://markmail.org/thread/w3cqjxwo2smkn2jw)
 Recently switched from 0.94 and 0.98, and finding that periodically things
 are having issues - lots of retry exceptions :
 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
 table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
 last exception: java.net.SocketTimeoutException: Call to
 ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
 because java.net.SocketTimeoutException: 2000 millis timeout while waiting
 for channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on
 ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
 started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
 ops.
 there are 2 methods in RpcRetryingCaller: callWithRetries and 
 callWithoutRetries.
 it looks the timeout setup of callWithRetries is good, while 
 callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a 
 valid timeout, but callWithoutRetries still calls beforeCall, which looks a 
 method for callWithRetries only,  to set timeout. since 
 RpcRetryingCaller#callTimeout  is not set, thread local timeout is set to 
 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final 
 pinginterval set to the socket.
 when there are heavy write workload and the rpc cannot complete in 2s, the 
 client close the connection, so the server side connection is reset and 
 finally exposes the problem in HBASE-11705



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly

2014-08-10 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11714:
---

Description: 
Discussed on the user@hbase mailing list 
(http://markmail.org/thread/w3cqjxwo2smkn2jw)
{quote}
Recently switched from 0.94 and 0.98, and finding that periodically things
are having issues - lots of retry exceptions :
{quote}

client log:
{quote}
2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
last exception: java.net.SocketTimeoutException: Call to
ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
because java.net.SocketTimeoutException: 2000 millis timeout while waiting
for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on
ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
ops.
{quote}

analysis:
there are 2 methods in RpcRetryingCaller: callWithRetries and 
callWithoutRetries.
it looks the timeout setup of callWithRetries is good, while callWithoutRetries 
is wrong(multi RPC for this user): caller cannot specify a valid timeout, but 
callWithoutRetries still calls beforeCall, which looks a method for 
callWithRetries only,  to set timeout. since RpcRetryingCaller#callTimeout  is 
not set, thread local timeout is set to 2s(MIN_RPC_TIMEOUT) via 
RpcClient.setRpcTimeout, which is the final pinginterval set to the socket.

when there are heavy write workload and the rpc cannot complete in 2s, the 
client close the connection, so the server side connection is reset and finally 
exposes the problem in HBASE-11705

  was:
Discussed on the user@hbase mailing list 
(http://markmail.org/thread/w3cqjxwo2smkn2jw)

Recently switched from 0.94 and 0.98, and finding that periodically things
are having issues - lots of retry exceptions :

2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
last exception: java.net.SocketTimeoutException: Call to
ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
because java.net.SocketTimeoutException: 2000 millis timeout while waiting
for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on
ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
ops.

there are 2 methods in RpcRetryingCaller: callWithRetries and 
callWithoutRetries.
it looks the timeout setup of callWithRetries is good, while callWithoutRetries 
is wrong(multi RPC for this user): caller cannot specify a valid timeout, but 
callWithoutRetries still calls beforeCall, which looks a method for 
callWithRetries only,  to set timeout. since RpcRetryingCaller#callTimeout  is 
not set, thread local timeout is set to 2s(MIN_RPC_TIMEOUT) via 
RpcClient.setRpcTimeout, which is the final pinginterval set to the socket.

when there are heavy write workload and the rpc cannot complete in 2s, the 
client close the connection, so the server side connection is reset and finally 
exposes the problem in HBASE-11705


 RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
 -

 Key: HBASE-11714
 URL: https://issues.apache.org/jira/browse/HBASE-11714
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Affects Versions: 0.98.3
Reporter: Qiang Tian
Assignee: Qiang Tian
 Attachments: hbase-11714-0.98.patch


 Discussed on the user@hbase mailing list 
 (http://markmail.org/thread/w3cqjxwo2smkn2jw)
 {quote}
 Recently switched from 0.94 and 0.98, and finding that periodically things
 are having issues - lots of retry exceptions :
 {quote}
 client log:
 {quote}
 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
 table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
 last exception: java.net.SocketTimeoutException: Call to
 ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
 because java.net.SocketTimeoutException: 2000 millis timeout while waiting
 for channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on
 ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
 started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
 ops.
 {quote}
 analysis:
 there are 2 methods in RpcRetryingCaller: 

[jira] [Created] (HBASE-11714) batch mutationRpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly

2014-08-09 Thread Qiang Tian (JIRA)
Qiang Tian created HBASE-11714:
--

 Summary: batch mutationRpcRetryingCaller#callWithoutRetries set 
rpc timeout to 2 seconds incorrectly
 Key: HBASE-11714
 URL: https://issues.apache.org/jira/browse/HBASE-11714
 Project: HBase
  Issue Type: Bug
Reporter: Qiang Tian






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly

2014-08-09 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11714:
---

Summary: RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds 
incorrectly  (was: batch mutationRpcRetryingCaller#callWithoutRetries set rpc 
timeout to 2 seconds incorrectly)

 RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
 -

 Key: HBASE-11714
 URL: https://issues.apache.org/jira/browse/HBASE-11714
 Project: HBase
  Issue Type: Bug
Reporter: Qiang Tian





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly

2014-08-09 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11714:
---

Description: 
Discussed on the user@hbase mailing list 
(http://markmail.org/thread/w3cqjxwo2smkn2jw)

Recently switched from 0.94 and 0.98, and finding that periodically things
are having issues - lots of retry exceptions :

2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
last exception: java.net.SocketTimeoutException: Call to
ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
because java.net.SocketTimeoutException: 2000 millis timeout while waiting
for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on
ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
ops.

there are 2 methods in RpcRetryingCaller: callWithRetries and 
callWithoutRetries.
it looks the timeout setup of callWithRetries is good, while callWithoutRetries 
is wrong(multi RPC for this user): caller cannot specify a valid timeout, but 
callWithoutRetries still calls beforeCall, which looks a method for 
callWithRetries only,  to set timeout. since RpcRetryingCaller#callTimeout  is 
not set, thread local timeout is set to 2s(MIN_RPC_TIMEOUT) via 
RpcClient.setRpcTimeout, which is the final pinginterval set to the socket.

when there are heavy write workload and the rpc cannot complete in 2s, the 
client close the connection, so the server side connection is reset and finally 
cause problem in HBASE-11705

 RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
 -

 Key: HBASE-11714
 URL: https://issues.apache.org/jira/browse/HBASE-11714
 Project: HBase
  Issue Type: Bug
Reporter: Qiang Tian

 Discussed on the user@hbase mailing list 
 (http://markmail.org/thread/w3cqjxwo2smkn2jw)
 Recently switched from 0.94 and 0.98, and finding that periodically things
 are having issues - lots of retry exceptions :
 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
 table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
 last exception: java.net.SocketTimeoutException: Call to
 ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
 because java.net.SocketTimeoutException: 2000 millis timeout while waiting
 for channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on
 ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
 started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
 ops.
 there are 2 methods in RpcRetryingCaller: callWithRetries and 
 callWithoutRetries.
 it looks the timeout setup of callWithRetries is good, while 
 callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a 
 valid timeout, but callWithoutRetries still calls beforeCall, which looks a 
 method for callWithRetries only,  to set timeout. since 
 RpcRetryingCaller#callTimeout  is not set, thread local timeout is set to 
 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final 
 pinginterval set to the socket.
 when there are heavy write workload and the rpc cannot complete in 2s, the 
 client close the connection, so the server side connection is reset and 
 finally cause problem in HBASE-11705



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly

2014-08-09 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian reassigned HBASE-11714:
--

Assignee: Qiang Tian

 RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
 -

 Key: HBASE-11714
 URL: https://issues.apache.org/jira/browse/HBASE-11714
 Project: HBase
  Issue Type: Bug
Reporter: Qiang Tian
Assignee: Qiang Tian

 Discussed on the user@hbase mailing list 
 (http://markmail.org/thread/w3cqjxwo2smkn2jw)
 Recently switched from 0.94 and 0.98, and finding that periodically things
 are having issues - lots of retry exceptions :
 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
 table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
 last exception: java.net.SocketTimeoutException: Call to
 ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
 because java.net.SocketTimeoutException: 2000 millis timeout while waiting
 for channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on
 ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
 started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
 ops.
 there are 2 methods in RpcRetryingCaller: callWithRetries and 
 callWithoutRetries.
 it looks the timeout setup of callWithRetries is good, while 
 callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a 
 valid timeout, but callWithoutRetries still calls beforeCall, which looks a 
 method for callWithRetries only,  to set timeout. since 
 RpcRetryingCaller#callTimeout  is not set, thread local timeout is set to 
 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final 
 pinginterval set to the socket.
 when there are heavy write workload and the rpc cannot complete in 2s, the 
 client close the connection, so the server side connection is reset and 
 finally cause problem in HBASE-11705



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly

2014-08-09 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11714:
---

Affects Version/s: 0.98.3

 RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
 -

 Key: HBASE-11714
 URL: https://issues.apache.org/jira/browse/HBASE-11714
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Affects Versions: 0.98.3
Reporter: Qiang Tian
Assignee: Qiang Tian

 Discussed on the user@hbase mailing list 
 (http://markmail.org/thread/w3cqjxwo2smkn2jw)
 Recently switched from 0.94 and 0.98, and finding that periodically things
 are having issues - lots of retry exceptions :
 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
 table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
 last exception: java.net.SocketTimeoutException: Call to
 ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
 because java.net.SocketTimeoutException: 2000 millis timeout while waiting
 for channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on
 ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
 started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
 ops.
 there are 2 methods in RpcRetryingCaller: callWithRetries and 
 callWithoutRetries.
 it looks the timeout setup of callWithRetries is good, while 
 callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a 
 valid timeout, but callWithoutRetries still calls beforeCall, which looks a 
 method for callWithRetries only,  to set timeout. since 
 RpcRetryingCaller#callTimeout  is not set, thread local timeout is set to 
 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final 
 pinginterval set to the socket.
 when there are heavy write workload and the rpc cannot complete in 2s, the 
 client close the connection, so the server side connection is reset and 
 finally cause problem in HBASE-11705



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly

2014-08-09 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11714:
---

Component/s: IPC/RPC

 RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
 -

 Key: HBASE-11714
 URL: https://issues.apache.org/jira/browse/HBASE-11714
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Affects Versions: 0.98.3
Reporter: Qiang Tian
Assignee: Qiang Tian

 Discussed on the user@hbase mailing list 
 (http://markmail.org/thread/w3cqjxwo2smkn2jw)
 Recently switched from 0.94 and 0.98, and finding that periodically things
 are having issues - lots of retry exceptions :
 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
 table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
 last exception: java.net.SocketTimeoutException: Call to
 ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
 because java.net.SocketTimeoutException: 2000 millis timeout while waiting
 for channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on
 ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
 started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
 ops.
 there are 2 methods in RpcRetryingCaller: callWithRetries and 
 callWithoutRetries.
 it looks the timeout setup of callWithRetries is good, while 
 callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a 
 valid timeout, but callWithoutRetries still calls beforeCall, which looks a 
 method for callWithRetries only,  to set timeout. since 
 RpcRetryingCaller#callTimeout  is not set, thread local timeout is set to 
 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final 
 pinginterval set to the socket.
 when there are heavy write workload and the rpc cannot complete in 2s, the 
 client close the connection, so the server side connection is reset and 
 finally cause problem in HBASE-11705



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly

2014-08-09 Thread Qiang Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091707#comment-14091707
 ] 

Qiang Tian commented on HBASE-11714:


click wrong button before filling in the info...just added more info.


 RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
 -

 Key: HBASE-11714
 URL: https://issues.apache.org/jira/browse/HBASE-11714
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Affects Versions: 0.98.3
Reporter: Qiang Tian
Assignee: Qiang Tian

 Discussed on the user@hbase mailing list 
 (http://markmail.org/thread/w3cqjxwo2smkn2jw)
 Recently switched from 0.94 and 0.98, and finding that periodically things
 are having issues - lots of retry exceptions :
 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
 table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
 last exception: java.net.SocketTimeoutException: Call to
 ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
 because java.net.SocketTimeoutException: 2000 millis timeout while waiting
 for channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on
 ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
 started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
 ops.
 there are 2 methods in RpcRetryingCaller: callWithRetries and 
 callWithoutRetries.
 it looks the timeout setup of callWithRetries is good, while 
 callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a 
 valid timeout, but callWithoutRetries still calls beforeCall, which looks a 
 method for callWithRetries only,  to set timeout. since 
 RpcRetryingCaller#callTimeout  is not set, thread local timeout is set to 
 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final 
 pinginterval set to the socket.
 when there are heavy write workload and the rpc cannot complete in 2s, the 
 client close the connection, so the server side connection is reset and 
 finally cause problem in HBASE-11705



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly

2014-08-09 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11714:
---

Attachment: hbase-11714-0.98.patch

 RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
 -

 Key: HBASE-11714
 URL: https://issues.apache.org/jira/browse/HBASE-11714
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Affects Versions: 0.98.3
Reporter: Qiang Tian
Assignee: Qiang Tian
 Attachments: hbase-11714-0.98.patch


 Discussed on the user@hbase mailing list 
 (http://markmail.org/thread/w3cqjxwo2smkn2jw)
 Recently switched from 0.94 and 0.98, and finding that periodically things
 are having issues - lots of retry exceptions :
 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
 table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
 last exception: java.net.SocketTimeoutException: Call to
 ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
 because java.net.SocketTimeoutException: 2000 millis timeout while waiting
 for channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on
 ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
 started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
 ops.
 there are 2 methods in RpcRetryingCaller: callWithRetries and 
 callWithoutRetries.
 it looks the timeout setup of callWithRetries is good, while 
 callWithoutRetries is wrong(multi RPC for this user): caller cannot specify a 
 valid timeout, but callWithoutRetries still calls beforeCall, which looks a 
 method for callWithRetries only,  to set timeout. since 
 RpcRetryingCaller#callTimeout  is not set, thread local timeout is set to 
 2s(MIN_RPC_TIMEOUT) via RpcClient.setRpcTimeout, which is the final 
 pinginterval set to the socket.
 when there are heavy write workload and the rpc cannot complete in 2s, the 
 client close the connection, so the server side connection is reset and 
 finally cause problem in HBASE-11705



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (HBASE-11714) RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly

2014-08-09 Thread Qiang Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-11714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiang Tian updated HBASE-11714:
---

Description: 
Discussed on the user@hbase mailing list 
(http://markmail.org/thread/w3cqjxwo2smkn2jw)

Recently switched from 0.94 and 0.98, and finding that periodically things
are having issues - lots of retry exceptions :

2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
last exception: java.net.SocketTimeoutException: Call to
ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
because java.net.SocketTimeoutException: 2000 millis timeout while waiting
for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on
ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
ops.

there are 2 methods in RpcRetryingCaller: callWithRetries and 
callWithoutRetries.
it looks the timeout setup of callWithRetries is good, while callWithoutRetries 
is wrong(multi RPC for this user): caller cannot specify a valid timeout, but 
callWithoutRetries still calls beforeCall, which looks a method for 
callWithRetries only,  to set timeout. since RpcRetryingCaller#callTimeout  is 
not set, thread local timeout is set to 2s(MIN_RPC_TIMEOUT) via 
RpcClient.setRpcTimeout, which is the final pinginterval set to the socket.

when there are heavy write workload and the rpc cannot complete in 2s, the 
client close the connection, so the server side connection is reset and finally 
exposes the problem in HBASE-11705

  was:
Discussed on the user@hbase mailing list 
(http://markmail.org/thread/w3cqjxwo2smkn2jw)

Recently switched from 0.94 and 0.98, and finding that periodically things
are having issues - lots of retry exceptions :

2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
last exception: java.net.SocketTimeoutException: Call to
ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
because java.net.SocketTimeoutException: 2000 millis timeout while waiting
for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on
ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
ops.

there are 2 methods in RpcRetryingCaller: callWithRetries and 
callWithoutRetries.
it looks the timeout setup of callWithRetries is good, while callWithoutRetries 
is wrong(multi RPC for this user): caller cannot specify a valid timeout, but 
callWithoutRetries still calls beforeCall, which looks a method for 
callWithRetries only,  to set timeout. since RpcRetryingCaller#callTimeout  is 
not set, thread local timeout is set to 2s(MIN_RPC_TIMEOUT) via 
RpcClient.setRpcTimeout, which is the final pinginterval set to the socket.

when there are heavy write workload and the rpc cannot complete in 2s, the 
client close the connection, so the server side connection is reset and finally 
cause problem in HBASE-11705


 RpcRetryingCaller#callWithoutRetries set rpc timeout to 2 seconds incorrectly
 -

 Key: HBASE-11714
 URL: https://issues.apache.org/jira/browse/HBASE-11714
 Project: HBase
  Issue Type: Bug
  Components: IPC/RPC
Affects Versions: 0.98.3
Reporter: Qiang Tian
Assignee: Qiang Tian
 Attachments: hbase-11714-0.98.patch


 Discussed on the user@hbase mailing list 
 (http://markmail.org/thread/w3cqjxwo2smkn2jw)
 Recently switched from 0.94 and 0.98, and finding that periodically things
 are having issues - lots of retry exceptions :
 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
 table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
 last exception: java.net.SocketTimeoutException: Call to
 ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
 because java.net.SocketTimeoutException: 2000 millis timeout while waiting
 for channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
 remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] on
 ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
 started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
 ops.
 there are 2 methods in RpcRetryingCaller: callWithRetries and 
 callWithoutRetries.
 it looks the timeout setup of callWithRetries is good, while 
 

  1   2   3   >