[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-20 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258137#comment-13258137
 ] 

nkeywal commented on HBASE-5844:


Patch v1. Tested on a local cluster  pseudo distributed, by stopping the 
server by kill -9. I will do some minor improvements on the java code and then 
test on a real cluster, but I'm interested by a feedback on the script.


 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Attachments: 5844.v1.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5729) Jenkins build failing; failsafe NPE'ing

2012-04-06 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248156#comment-13248156
 ] 

nkeywal commented on HBASE-5729:


For future reference if it happens again:

root exception was
{noformat}
Caused by: java.lang.NullPointerException
at 
org.apache.maven.surefire.util.DefaultDirectoryScanner.processIncludesExcludes(DefaultDirectoryScanner.java:136)
at 
org.apache.maven.surefire.util.DefaultDirectoryScanner.collectTests(DefaultDirectoryScanner.java:113)
at 
org.apache.maven.surefire.util.DefaultDirectoryScanner.locateTestClasses(DefaultDirectoryScanner.java:62)
at 
org.apache.maven.surefire.junitcore.JUnitCoreProvider.scanClassPath(JUnitCoreProvider.java:188)
at 
org.apache.maven.surefire.junitcore.JUnitCoreProvider.getSuitesAsList(JUnitCoreProvider.java:145)
at 
org.apache.maven.surefire.junitcore.JUnitCoreProvider.getSuites(JUnitCoreProvider.java:103)
{noformat}

Initial error was:

{noformat}
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-failsafe-plugin:2.12-TRUNK-HBASE-2:integration-test
 (integration-test) on project hbase: Execution integration-test of goal 
org.apache.maven.plugins:maven-failsafe-plugin:2.12-TRUNK-HBASE-2:integration-test
 failed: java.lang.reflect.InvocationTargetException; nested exception is 
java.lang.reflect.InvocationTargetException: null: NullPointerException - 
[Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
org.apache.maven.plugins:maven-failsafe-plugin:2.12-TRUNK-HBASE-2:integration-test
 (integration-test) on project hbase: Execution integration-test of goal 
org.apache.maven.plugins:maven-failsafe-plugin:2.12-TRUNK-HBASE-2:integration-test
 failed: java.lang.reflect.InvocationTargetException; nested exception is 
java.lang.reflect.InvocationTargetException: null
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:225)
{noformat}

 Jenkins build failing; failsafe NPE'ing
 ---

 Key: HBASE-5729
 URL: https://issues.apache.org/jira/browse/HBASE-5729
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: stack
Priority: Blocker

 Builds up on jenkins have been failing over the last few days.  Looking at it 
 w/ nkeyway, its kinda odd.  I ran exact command locally as did N and it works 
 fine.  I removed all of my repo and still works.  N looked at surefire 
 source.  Its the includes that is coming back empty causing the NPE we see up 
 on jenkins.  Extra odd is that it does not seem like it a checkin of ours 
 that brought this on.  See here where its 'working' on 0.94 branch: 
 https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.94/76/  Then a 
 little later Ted triggers a build w/ no changes made: 
 https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.94/77/console
 Its failing running the integration test phase.  Let me mess around and try 
 and get it going again.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5729) Jenkins build failing; failsafe NPE'ing

2012-04-06 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13248243#comment-13248243
 ] 

nkeywal commented on HBASE-5729:


{noformat}
mvn -e  site install assembly:single  -Dtest.exclude.pattern=whatever
{noformat}

works as well. It seems that surefire wants the variables to be defined in
{noformat}
excludes
  exlude${unittest.include}/exlude
  exclude**/*$*/exclude
  exclude${test.exclude.pattern}/exclude
/excludes
{noformat}

So suppressing the line in the .pom sould be ok, even if it does not explain 
why it was working before.


 Jenkins build failing; failsafe NPE'ing
 ---

 Key: HBASE-5729
 URL: https://issues.apache.org/jira/browse/HBASE-5729
 Project: HBase
  Issue Type: Bug
Reporter: stack
Assignee: stack
Priority: Blocker

 Builds up on jenkins have been failing over the last few days.  Looking at it 
 w/ nkeyway, its kinda odd.  I ran exact command locally as did N and it works 
 fine.  I removed all of my repo and still works.  N looked at surefire 
 source.  Its the includes that is coming back empty causing the NPE we see up 
 on jenkins.  Extra odd is that it does not seem like it a checkin of ours 
 that brought this on.  See here where its 'working' on 0.94 branch: 
 https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.94/76/  Then a 
 little later Ted triggers a build w/ no changes made: 
 https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.94/77/console
 Its failing running the integration test phase.  Let me mess around and try 
 and get it going again.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5729) Jenkins build failing; failsafe NPE'ing

2012-04-05 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13247414#comment-13247414
 ] 

nkeywal commented on HBASE-5729:


locally, I reproduce the issue with 
{noformat}
mvn -e -X  site install assembly:single
{noformat}

 Jenkins build failing; failsafe NPE'ing
 ---

 Key: HBASE-5729
 URL: https://issues.apache.org/jira/browse/HBASE-5729
 Project: HBase
  Issue Type: Bug
Reporter: stack
Priority: Blocker

 Builds up on jenkins have been failing over the last few days.  Looking at it 
 w/ nkeyway, its kinda odd.  I ran exact command locally as did N and it works 
 fine.  I removed all of my repo and still works.  N looked at surefire 
 source.  Its the includes that is coming back empty causing the NPE we see up 
 on jenkins.  Extra odd is that it does not seem like it a checkin of ours 
 that brought this on.  See here where its 'working' on 0.94 branch: 
 https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.94/76/  Then a 
 little later Ted triggers a build w/ no changes made: 
 https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.94/77/console
 Its failing running the integration test phase.  Let me mess around and try 
 and get it going again.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)

2012-04-02 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1320#comment-1320
 ] 

nkeywal commented on HBASE-5682:


.bq none of ZKUtil methods actually throw exceptions
From what is see on 0.96 it should, as the return is not reached: the pattern 
is too call keeperException, and keeperException throws an exception.
{noformat}
  public void keeperException(KeeperException ke)
  throws KeeperException {
LOG.error(prefix(Received unexpected KeeperException, re-throwing 
exception), ke);
throw ke;
  }
{noformat}


 Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 
 only)
 --

 Key: HBASE-5682
 URL: https://issues.apache.org/jira/browse/HBASE-5682
 Project: HBase
  Issue Type: Improvement
  Components: client
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
Priority: Critical
 Fix For: 0.94.0

 Attachments: 5682-all-v2.txt, 5682-all-v3.txt, 5682-all.txt, 
 5682-v2.txt, 5682.txt


 Just realized that without this HBASE-4805 is broken.
 I.e. there's no point keeping a persistent HConnection around if it can be 
 rendered permanently unusable if the ZK connection is lost temporarily.
 Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to 
 backport)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5682) Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 only)

2012-04-02 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244484#comment-13244484
 ] 

nkeywal commented on HBASE-5682:


In 0.96 this should work, with the restriction that the logic is that you can 
get a non working connection, that will get fixed when you try to use it. It's 
a different mechanism than the one for HBaseAdmin, as HBaseAdmin first check 
the connection. Thz ZK mechanism is more efficient (you save a remote call to 
check that the connection is really working), but is more complex.

However it seems it does not work at the end:
bq. What I saw in 0.96 is that the client was blocked for a very long time 
(gave up after a few minutes), even though I had set all timeouts to low 
values. This is also deadly in an app server setting. Might be a simple fix 
there, didn't dig deeper.

@lars What did you exactly do? I can do the fix it on 0.96.

 Allow HConnectionImplementation to recover from ZK connection loss (for 0.94 
 only)
 --

 Key: HBASE-5682
 URL: https://issues.apache.org/jira/browse/HBASE-5682
 Project: HBase
  Issue Type: Improvement
  Components: client
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
Priority: Critical
 Fix For: 0.94.0

 Attachments: 5682-all-v2.txt, 5682-all-v3.txt, 5682-all-v4.txt, 
 5682-all.txt, 5682-v2.txt, 5682.txt


 Just realized that without this HBASE-4805 is broken.
 I.e. there's no point keeping a persistent HConnection around if it can be 
 rendered permanently unusable if the ZK connection is lost temporarily.
 Note that this is fixed in 0.96 with HBASE-5399 (but that seems to big to 
 backport)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5693) When creating a region, the master initializes it and creates a memstore within the master server

2012-04-02 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13244519#comment-13244519
 ] 

nkeywal commented on HBASE-5693:


I don't think so. I didn't see them locally.




 When creating a region, the master initializes it and creates a memstore 
 within the master server
 -

 Key: HBASE-5693
 URL: https://issues.apache.org/jira/browse/HBASE-5693
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5593.v2.patch, 5693.v1.patch


 I didn't do a complete analysis, but the attached patch saves more than 0.25s 
 for each region creation and locally all the unit tests work.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5693) When creating a region, the master initializes it and creates a memstore within the master server

2012-04-01 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243751#comment-13243751
 ] 

nkeywal commented on HBASE-5693:


I didn't look very far in the code. CreateTableHandler  is executed on the
master. It does not need to initialize the memstore  so on.
The underlying method is called from the region server as well; and here
the initialization code is called. May be there is some thing more complex
I didn't see, but at least all the unit tests went well.



On Sun, Apr 1, 2012 at 5:28 PM, Ted Yu (Commented) (JIRA)



 When creating a region, the master initializes it and creates a memstore 
 within the master server
 -

 Key: HBASE-5693
 URL: https://issues.apache.org/jira/browse/HBASE-5693
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5693.v1.patch


 I didn't do a complete analysis, but the attached patch saves more than 0.25s 
 for each region creation and locally all the unit tests work.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5693) When creating a region, the master initializes it and creates a memstore within the master server

2012-04-01 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13243775#comment-13243775
 ] 

nkeywal commented on HBASE-5693:


Ok, I will do that + a test on a real cluster.

On Sun, Apr 1, 2012 at 6:12 PM, Ted Yu (Commented) (JIRA)



 When creating a region, the master initializes it and creates a memstore 
 within the master server
 -

 Key: HBASE-5693
 URL: https://issues.apache.org/jira/browse/HBASE-5693
 Project: HBase
  Issue Type: Improvement
  Components: master, regionserver
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5693.v1.patch


 I didn't do a complete analysis, but the attached patch saves more than 0.25s 
 for each region creation and locally all the unit tests work.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5573) Replace client ZooKeeper watchers by simple ZooKeeper reads

2012-03-30 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242117#comment-13242117
 ] 

nkeywal commented on HBASE-5573:


It can be committed imho. 

 Replace client ZooKeeper watchers by simple ZooKeeper reads
 ---

 Key: HBASE-5573
 URL: https://issues.apache.org/jira/browse/HBASE-5573
 Project: HBase
  Issue Type: Improvement
  Components: client, zookeeper
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5573.v1.patch, 5573.v2.patch, 5573.v4.patch, 
 5573.v6.patch, 5573.v7.patch, 5573.v8.patch


 Some code in the package needs to read data in ZK. This could be done by a 
 simple read, but is actually implemented with a watcher. This holds ZK 
 resources.
 Fixing this could also be an opportunity to remove the need for the client to 
 provide the master address and port.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5666) RegionServer doesn't retry to check if base node is available

2012-03-30 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13242796#comment-13242796
 ] 

nkeywal commented on HBASE-5666:


I confirm I didn't modify this part in trunk. But who knows. I will have a look 
at it this week end.

 RegionServer doesn't retry to check if base node is available
 -

 Key: HBASE-5666
 URL: https://issues.apache.org/jira/browse/HBASE-5666
 Project: HBase
  Issue Type: Bug
  Components: regionserver, zookeeper
Reporter: Matteo Bertozzi
Assignee: Matteo Bertozzi
 Attachments: hbase-1-regionserver.log, hbase-2-regionserver.log, 
 hbase-3-regionserver.log, hbase-master.log, hbase-regionserver.log, 
 hbase-zookeeper.log


 I've a script that starts hbase and a couple of region servers in distributed 
 mode (hbase.cluster.distributed = true)
 {code}
 $HBASE_HOME/bin/start-hbase.sh
 $HBASE_HOME/bin/local-regionservers.sh start 1 2 3
 {code}
 but the region servers are not able to start...
 It seems that during the RS start the the znode is still not available, and 
 HRegionServer.initializeZooKeeper() check just once if the base not is 
 available.
 {code}
 2012-03-28 21:54:05,013 INFO 
 org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Check the value 
 configured in 'zookeeper.znode.parent'. There could be a mismatch with the 
 one configured in the master.
 2012-03-28 21:54:08,598 FATAL 
 org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
 localhost,60202,133296824: Initialization of RS failed.  Hence aborting 
 RS.
 java.io.IOException: Received the shutdown message while waiting.
   at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.blockAndCheckIfStopped(HRegionServer.java:626)
   at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:596)
   at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:558)
   at 
 org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:672)
   at java.lang.Thread.run(Thread.java:662)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5573) Replace client ZooKeeper watchers by simple ZooKeeper reads

2012-03-23 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13236499#comment-13236499
 ] 

nkeywal commented on HBASE-5573:


bq. System.exit(1);
Actually is was already like that in hbasefsck, I replaced a tracker by a 
watcher that does not watch to read the data, that's all.

bq. Why not add a create method to ZooKeeperWatcher that takes a name, conf, 
and Abortable? Or is that a ZKW Constructor altogether?
Yes, the question is what to do when you're asked to abort. Here I reused the 
approach in hbasefsck, just exit.

bq. N, can you explain more about what is going on here. How is it that we are 
not taking a Watcher when we are creating a ZKW? Because we don't call start? 
(If so, that'd be 'elegant' solution)

A ZKW is a watcher. When you create a ZKW, you create a RecoverableZooKeeper 
with yourself as a parameter. Pseudo code is:
{noformat}
class RecoverableZooKeeper {
 ZooKeeper zk;
 RecoverableZooKeeper (Watcher w){ zk=new ZooKeeper(w) }
}

class ZooKeeperWatcher implements Watcher  {
 RecoverableZooKeeper rz;
 ZooKeeperWatcher (){ rz = new RecoverableZooKeeper(this); }
}
{noformat}

Using 'this' in a constructor is looking for problems but it works in this case 
(remember, that's the existing code, not mine :-) ). Basically all these 
classes are very strongly coupled. When I tried to partially decouple them it 
exploded in my hands because you anyway need a watcher to manage the session 
expiry stuff. I don't have a middle solution here: it's either a full rewriting 
with a lot of fun to keep the existing interfaces for backward compatibility or 
nothing.

So in the final patch I've just done some cleanup (removed the last usage of 
getZooKeeperWatcher) and the usage of any watcher. So there's no proof in the 
code, just that actually all the functions we use on the client don't use a 
watcher. Anyway, they have a session in the ZK servers so they are expensive. 
But thanks to #5399 the session on ZK will be closed after 5 minutes. So if you 
have an architecture with clients coming up and down, you will be able to 
increase the number of clients. 

Three last comments:
- one of the design issue is that there ate two API: you can use directly any 
of the ZKW, RZK, RK object or you can go through the static ZKUtils. May be the 
intermediate solutions lie around this area.
- even if the existing design should not be shown to innocent scholars it's not 
that terrible, because it's small. I didn't really like my first patches 
because I was adding more classes and complexity without fixing the design. 
- On the long term, I think that it actually make sense to have a watcher in 
the client. It's not about the previous code: The previous code was not really 
using watchers. The previous code was setting watchers without using them. The 
new code (after #5399 and #5573) does not use or set watchers. But when you 
have a fat client architecture like we have, it makes sense to share some 
global state information, and it scales better when the info is pushed vs. 
pulled. Having said that, there are many questions left: possible issues in how 
expensive it is with ZooKeeper today, may be ZooKeeper is not really designed 
for this (it's not really a global coordination work, as the client would be 
readers only) and so on. FWIW, it seems that the current limit is around 10K 
sessions in ZK:

{panel}
Patrick Hunt / Nov 18, 2010; 8:57pm
Re: number of clients/watchers

fyi: I haven't heard of anyone running over 10k sessions. I've tried 20k before 
and had issues. 
[...]
A session is represented by a ZooKeeper object. One session per object. So if 
you have 10 client hosts each creating it's own ZooKeeper instance you'll have 
10 sessions. This is regardless of the number of znodes, watches, etc... 
Watches were designed to be lightweight and you can maintain a large number of 
them. (25million spread across 500 sessions in my example)
{panel}

There were also a discussion on ZK mailing list about lightweith sessions. 
http://markmail.org/message/cyow2xkneh2t3juc



 Replace client ZooKeeper watchers by simple ZooKeeper reads
 ---

 Key: HBASE-5573
 URL: https://issues.apache.org/jira/browse/HBASE-5573
 Project: HBase
  Issue Type: Improvement
  Components: client, zookeeper
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5573.v1.patch, 5573.v2.patch, 5573.v4.patch, 
 5573.v6.patch


 Some code in the package needs to read data in ZK. This could be done by a 
 simple read, but is actually implemented with a watcher. This holds ZK 
 resources.
 Fixing this could also be an opportunity to remove the need for the client to 
 provide the master address and port.

--
This message 

[jira] [Commented] (HBASE-5573) Replace client ZooKeeper watchers by simple ZooKeeper reads

2012-03-23 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237143#comment-13237143
 ] 

nkeywal commented on HBASE-5573:


bq. OK. Any recommendation you can make here having been down deep in this 
code? We should make everyone go via ZKUtils and via ZKAssign, etc., and clean 
up any other errant use of zkw directly? Would that be good to do?
It would do no harm as it's not good to have two APIs. I could be a first step 
to change the internal design. I haven't checked the impact.

bq. Do you want to be consistent? You call methods getZKW most times and then 
getZooKeeperWatcher in this test code (I prefer the latter).
Ok, I will change all this to getZooKeeperWatcher.

 Replace client ZooKeeper watchers by simple ZooKeeper reads
 ---

 Key: HBASE-5573
 URL: https://issues.apache.org/jira/browse/HBASE-5573
 Project: HBase
  Issue Type: Improvement
  Components: client, zookeeper
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5573.v1.patch, 5573.v2.patch, 5573.v4.patch, 
 5573.v6.patch


 Some code in the package needs to read data in ZK. This could be done by a 
 simple read, but is actually implemented with a watcher. This holds ZK 
 resources.
 Fixing this could also be an opportunity to remove the need for the client to 
 provide the master address and port.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4993) Performance regression in minicluster creation

2012-03-23 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237342#comment-13237342
 ] 

nkeywal commented on HBASE-4993:


Hi JD,

You're right, there is a bug. The code does not do what the documentation
says it does.
It seems that changing the code to {code} (count  minToStart ||
lastCountChange+interval  now){code} would make the code works as
documented.

If you have 0 region servers that checked in and you are under the
interval, you wait: (true or true) = true.
If you have 0 region servers but you are above the interval, you wait:
(true or false) = true.
If you have 1 or more region servers that checked in and you are under the
interval, you wait: (false or true) = true.

Would that be ok for you, or do you want to change the documented behavior
as well?
Note that we could also change the default values for the min number of
servers or the interval.

On Fri, Mar 23, 2012 at 11:07 PM, Jean-Daniel Cryans (Commented) (JIRA) 



 Performance regression in minicluster creation
 --

 Key: HBASE-4993
 URL: https://issues.apache.org/jira/browse/HBASE-4993
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.94.0

 Attachments: 4993.patch, 4993.v3.patch


 Side effect of 4610: the mini cluster needs 4,5 seconds to start

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4955) Use the official versions of surefire junit

2012-03-22 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13235605#comment-13235605
 ] 

nkeywal commented on HBASE-4955:


Monthly update:
 - The JUnit team merged my fix on the JUnit trunk, so the next JUnit version 
(should be 4.11) will be ok for us. I don't have the release date however.
 - The surefire regression (SUREFIRE-827) is not yet fixed, still planned for 
2.13. No release date.

 Use the official versions of surefire  junit
 -

 Key: HBASE-4955
 URL: https://issues.apache.org/jira/browse/HBASE-4955
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor

 We currently use private versions for Surefire  JUnit since HBASE-4763.
 This JIRA traks what we need to move to official versions.
 Surefire 2.11 is just out, but, after some tests, it does not contain all 
 what we need.
 JUnit. Could be for JUnit 4.11. Issue to monitor:
 https://github.com/KentBeck/junit/issues/359: fixed in our version, no 
 feedback for an integration on trunk
 Surefire: Could be for Surefire 2.12. Issues to monitor are:
 329 (category support): fixed, we use the official implementation from the 
 trunk
 786 (@Category with forkMode=always): fixed, we use the official 
 implementation from the trunk
 791 (incorrect elapsed time on test failure): fixed, we use the official 
 implementation from the trunk
 793 (incorrect time in the XML report): Not fixed (reopen) on trunk, fixed on 
 our version.
 760 (does not take into account the test method): fixed in trunk, not fixed 
 in our version
 798 (print immediately the test class name): not fixed in trunk, not fixed in 
 our version
 799 (Allow test parallelization when forkMode=always): not fixed in trunk, 
 not fixed in our version
 800 (redirectTestOutputToFile not taken into account): not yet fix on trunk, 
 fixed on our version
 800  793 are the more important to monitor, it's the only ones that are 
 fixed in our version but not on trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5573) Replace client ZooKeeper watchers by simple ZooKeeper reads

2012-03-21 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234254#comment-13234254
 ] 

nkeywal commented on HBASE-5573:


For ZK and ZKW dependency, prefixing zk for the ZooKeeper api objects and hb 
for the HBase objects:
a zk.ZooKeeper can be created with a zk.Watcher as a parameter for the 
constructor. a hb.ZooKeeperWatcher implements the zk.Watcher interface. So we 
have:

hb.ZooKeeperWatcher implements zk.ZooKeeper.
hb.ZooKeeperWatcher contains hb.RecoverableZooKeeper
hb.RecoverableZooKeeper contains zk.ZooKeeper
zk.ZooKeeper contains (hb.ZooKeeperWatcher implements zk.ZooKeeper)

loop done.

It was like this before my patch. After my patch there are two cases:
1) As above
2) for ZooKeeperHBaseNodes, the watcher is actually null, so it becomes:
hb.ZooKeeperHBaseNodes contains hb.RecoverableZooKeeper
hb.RecoverableZooKeeper contains zk.ZooKeeper
zk.ZooKeeper contains null

It allows to share the code, but it makes it more complex.


bq. I see. ROZK has a RZK. Can RZK have a ZooKeeper only?
Yes, but it would lead to some code duplication and we would lose the 
recoverable feature (or we would need to duplicate it as well). 


bq. ZooKeeperHBaseNodes = ReadOnlyZooKeeper
It's not really a readonly zookeeper: you can write with it. 


bq. Looks good
Ok, I am gonna finish it with this approach then. 

 Replace client ZooKeeper watchers by simple ZooKeeper reads
 ---

 Key: HBASE-5573
 URL: https://issues.apache.org/jira/browse/HBASE-5573
 Project: HBase
  Issue Type: Improvement
  Components: client, zookeeper
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5573.v1.patch, 5573.v2.patch


 Some code in the package needs to read data in ZK. This could be done by a 
 simple read, but is actually implemented with a watcher. This holds ZK 
 resources.
 Fixing this could also be an opportunity to remove the need for the client to 
 provide the master address and port.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5573) Replace client ZooKeeper watchers by simple ZooKeeper reads

2012-03-21 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234413#comment-13234413
 ] 

nkeywal commented on HBASE-5573:


bq; This seems way broke Nicolas: zk.ZooKeeper contains (hb.ZooKeeperWatcher 
implements zk.ZooKeeper)
I wanted to say that in ZK API, you want give a Watcher as a parameter to the 
ZooKeeper object. In HBase, this watcher is the ZooKeeperWatcher. And this 
ZooKeeperWatcher  contains the RecoverableZK that contains the ZooKeeper 
object, so we have a loop.

bq. Is ZKW doing everything, not just Watching? If so, would fixing this help?
Yes it does everything. With the split there is now a new object when we just 
want to read/write.

For the name, let's go for NoWatchZK.

I'm currently testing the patch. There is still an issue with stuff like 
ZKAssign.getData: it sets a watcher, but is it really needed?


 Replace client ZooKeeper watchers by simple ZooKeeper reads
 ---

 Key: HBASE-5573
 URL: https://issues.apache.org/jira/browse/HBASE-5573
 Project: HBase
  Issue Type: Improvement
  Components: client, zookeeper
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5573.v1.patch, 5573.v2.patch


 Some code in the package needs to read data in ZK. This could be done by a 
 simple read, but is actually implemented with a watcher. This holds ZK 
 resources.
 Fixing this could also be an opportunity to remove the need for the client to 
 provide the master address and port.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5573) Replace client ZooKeeper watchers by simple ZooKeeper reads

2012-03-21 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13234462#comment-13234462
 ] 

nkeywal commented on HBASE-5573:


Oops; there is another -big- issue: you need a watcher if you want to get the 
info on session expiry... So isolating it is not possible. That breaks the 
patch it seems. 

 Replace client ZooKeeper watchers by simple ZooKeeper reads
 ---

 Key: HBASE-5573
 URL: https://issues.apache.org/jira/browse/HBASE-5573
 Project: HBase
  Issue Type: Improvement
  Components: client, zookeeper
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5573.v1.patch, 5573.v2.patch


 Some code in the package needs to read data in ZK. This could be done by a 
 simple read, but is actually implemented with a watcher. This holds ZK 
 resources.
 Fixing this could also be an opportunity to remove the need for the client to 
 provide the master address and port.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5551) Some functions should not be used by customer code and must be deprecated in 0.94

2012-03-20 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233540#comment-13233540
 ] 

nkeywal commented on HBASE-5551:


@Jonathan
The client code should not need to connect to the master directly, so all the 
functions should be available from HBaseAdmin or HConnection. 

 Some functions should not be used by customer code and must be deprecated in 
 0.94
 -

 Key: HBASE-5551
 URL: https://issues.apache.org/jira/browse/HBASE-5551
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.92.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.94.0

 Attachments: 5551.092.patch


 They are:
 HBaseAdmin#getMaster
 HConnection#getZooKeeperWatcher
 HConnection#getMaster

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5573) Replace client ZooKeeper watchers by simple ZooKeeper reads

2012-03-20 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13233721#comment-13233721
 ] 

nkeywal commented on HBASE-5573:


Patch to get a first feedback. 

Unfortunately, it's more a hack than anything else, because I'am trying to keep 
the existing code  interface and not rewriting everything.

Today HBase considers any ZK client as a client that will watch the values, and 
does not distinguish simple readers vs. watchers. To change this, I:
- Split ZooKeeperWatcher in two classes, one ZooKeeperWatcher with the same 
responsibilities as today, and another, ZooKeeperHBaseNodes, that contains the 
hbase znode definition. ZooKeeperWatcher extends ZooKeeperHBaseNodes.
- In ZKUtils, depending if a watch is involved or not, changed the expected 
type from ZooKeeperWatcher to ZooKeeperHBaseNodes. 

That's not a hack yet. The issues are:
- The client is supposed to wait if the root location znode is not yet created 
in ZK. I don't think that the trunk implementation actually works. But it's 
done with a watcher. As we don't want a watcher, I changed it to a loop.

- As HConnectionImplementation now uses a simple connection and not a Watcher, 
the deprecated interface (that returns a ZooKeeperWatcher) cannot reuse the 
internal connection to ZK, but must be duplicated.

- In trunk, the current dependencies are:
 - RecovableZooKeeper depends(contains) on ZooKeeper
 - ZooKeeper depends(contains) on ZooKeeperWatcher
 - ZooKeeperWatcher depends(contains) RecovableZooKeeper
 - ZKUtils depends(uses) RecovableZooKeeper and ZooKeeperWatcher

That makes it difficult to reuse any part of code without having a 
ZooKeeperWatcher. To be able to reuse it, what's happening when using a 
ZooKeeperHBaseNodes is that the underlying ZooKeeperWatcher is actually null.

I still have to do a lot of renaming if we go for this approach.

I had some failure that could be unrelated, but I haven't looked at them yet:
org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster
org.apache.hadoop.hbase.io.encoding.TestLoadAndSwitchEncodeOnDisk

 Replace client ZooKeeper watchers by simple ZooKeeper reads
 ---

 Key: HBASE-5573
 URL: https://issues.apache.org/jira/browse/HBASE-5573
 Project: HBase
  Issue Type: Improvement
  Components: client, zookeeper
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor

 Some code in the package needs to read data in ZK. This could be done by a 
 simple read, but is actually implemented with a watcher. This holds ZK 
 resources.
 Fixing this could also be an opportunity to remove the need for the client to 
 provide the master address and port.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5569) Do not collect deleted KVs when they are still in use by a scanner.

2012-03-19 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232513#comment-13232513
 ] 

nkeywal commented on HBASE-5569:


I stopped it after 2700 iterations (10 hours), no error = patch seems to be 
fix the issue...

 Do not collect deleted KVs when they are still in use by a scanner.
 ---

 Key: HBASE-5569
 URL: https://issues.apache.org/jira/browse/HBASE-5569
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.94.0, 0.96.0

 Attachments: 5569-v2.txt, 5569-v3.txt, 5569-v4.txt, 5569.txt, 
 TestAtomicOperation-output.trunk_120313.rar


 I noticed this because TestAtomicOperation.testMultiRowMutationMultiThreads 
 fails rarely.
 The solution is similar to HBASE-2856, where expired KVs are not collected 
 when in use by a scanner.
 ---
 What I pieced together so far is that it is the *scanning* side that has 
 problems sometimes.
 Every time I see a assertion failure in the log I see this before:
 {quote}
 2012-03-12 21:48:49,523 DEBUG [Thread-211] regionserver.StoreScanner(499): 
 Storescanner.peek() is changed where before = 
 rowB/colfamily11:qual1/75366/Put/vlen=6,and after = 
 rowB/colfamily11:qual1/75203/DeleteColumn/vlen=0
 {quote}
 The order of if the Put and Delete is sometimes reversed.
 The test threads should always see exactly one KV, if the before was the 
 Put the thread see 0 KVs, if the before was the Delete the threads see 2 
 KVs.
 This debug message comes from StoreScanner to checkReseek. It seems we still 
 some consistency issue with scanning sometimes :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5569) Do not collect deleted KVs when they are still in use by a scanner.

2012-03-18 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232382#comment-13232382
 ] 

nkeywal commented on HBASE-5569:


I've got the testRowMutationMultiThreads running currently on the patch v3. No 
issue so far. I will make it run 5000 times, previously it always failed before 
1000 iterations.

 Do not collect deleted KVs when they are still in use by a scanner.
 ---

 Key: HBASE-5569
 URL: https://issues.apache.org/jira/browse/HBASE-5569
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.94.0, 0.96.0

 Attachments: 5569-v2.txt, 5569-v3.txt, 5569.txt, 
 TestAtomicOperation-output.trunk_120313.rar


 I noticed this because TestAtomicOperation.testMultiRowMutationMultiThreads 
 fails rarely.
 The solution is similar to HBASE-2856, where expired KVs are not collected 
 when in use by a scanner.
 ---
 What I pieced together so far is that it is the *scanning* side that has 
 problems sometimes.
 Every time I see a assertion failure in the log I see this before:
 {quote}
 2012-03-12 21:48:49,523 DEBUG [Thread-211] regionserver.StoreScanner(499): 
 Storescanner.peek() is changed where before = 
 rowB/colfamily11:qual1/75366/Put/vlen=6,and after = 
 rowB/colfamily11:qual1/75203/DeleteColumn/vlen=0
 {quote}
 The order of if the Put and Delete is sometimes reversed.
 The test threads should always see exactly one KV, if the before was the 
 Put the thread see 0 KVs, if the before was the Delete the threads see 2 
 KVs.
 This debug message comes from StoreScanner to checkReseek. It seems we still 
 some consistency issue with scanning sometimes :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5569) Do not collect deleted KVs when they are still in use by a scanner.

2012-03-18 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232390#comment-13232390
 ] 

nkeywal commented on HBASE-5569:


Right now it's still running well. I'm doing the test on a small server, with a 
4 core Intel Xeon E3-1220.

 Do not collect deleted KVs when they are still in use by a scanner.
 ---

 Key: HBASE-5569
 URL: https://issues.apache.org/jira/browse/HBASE-5569
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Assignee: Lars Hofhansl
 Fix For: 0.94.0, 0.96.0

 Attachments: 5569-v2.txt, 5569-v3.txt, 5569-v4.txt, 5569.txt, 
 TestAtomicOperation-output.trunk_120313.rar


 I noticed this because TestAtomicOperation.testMultiRowMutationMultiThreads 
 fails rarely.
 The solution is similar to HBASE-2856, where expired KVs are not collected 
 when in use by a scanner.
 ---
 What I pieced together so far is that it is the *scanning* side that has 
 problems sometimes.
 Every time I see a assertion failure in the log I see this before:
 {quote}
 2012-03-12 21:48:49,523 DEBUG [Thread-211] regionserver.StoreScanner(499): 
 Storescanner.peek() is changed where before = 
 rowB/colfamily11:qual1/75366/Put/vlen=6,and after = 
 rowB/colfamily11:qual1/75203/DeleteColumn/vlen=0
 {quote}
 The order of if the Put and Delete is sometimes reversed.
 The test threads should always see exactly one KV, if the before was the 
 Put the thread see 0 KVs, if the before was the Delete the threads see 2 
 KVs.
 This debug message comes from StoreScanner to checkReseek. It seems we still 
 some consistency issue with scanning sometimes :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5549) Master can fail if ZooKeeper session expires

2012-03-17 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13232005#comment-13232005
 ] 

nkeywal commented on HBASE-5549:


@lars: there is no theoretical dependency. In practice, I think that both 
impacted some same tests so it may require some minor changes when applying the 
patch.

 Master can fail if ZooKeeper session expires
 

 Key: HBASE-5549
 URL: https://issues.apache.org/jira/browse/HBASE-5549
 Project: HBase
  Issue Type: Bug
  Components: master, zookeeper
Affects Versions: 0.96.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5549.v10.patch, 5549.v11.patch, 5549.v6.patch, 
 5549.v7.patch, 5549.v8.patch, 5549.v9.patch, nochange.patch


 There is a retry mechanism in RecoverableZooKeeper, but when the session 
 expires, the whole ZooKeeperWatcher is recreated, hence the retry mechanism 
 does not work in this case. This is why a sleep is needed in 
 TestZooKeeper#testMasterSessionExpired: we need to wait for ZooKeeperWatcher 
 to be recreated before using the connection.
 This can happen in real life, it can happen when:
 - master  zookeeper starts
 - zookeeper connection is cut
 - master enters the retry loop
 - in the meantime the session expires
 - the network comes back, the session is recreated
 - the retries continues, but on the wrong object, hence fails.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5549) Master can fail if ZooKeeper session expires

2012-03-16 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231305#comment-13231305
 ] 

nkeywal commented on HBASE-5549:


Can be committed. Hopefully this is the end of the ZooKeeper expiry flakiness.


 Master can fail if ZooKeeper session expires
 

 Key: HBASE-5549
 URL: https://issues.apache.org/jira/browse/HBASE-5549
 Project: HBase
  Issue Type: Bug
  Components: master, zookeeper
Affects Versions: 0.96.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5549.v10.patch, 5549.v6.patch, 5549.v7.patch, 
 5549.v8.patch, 5549.v9.patch, nochange.patch


 There is a retry mechanism in RecoverableZooKeeper, but when the session 
 expires, the whole ZooKeeperWatcher is recreated, hence the retry mechanism 
 does not work in this case. This is why a sleep is needed in 
 TestZooKeeper#testMasterSessionExpired: we need to wait for ZooKeeperWatcher 
 to be recreated before using the connection.
 This can happen in real life, it can happen when:
 - master  zookeeper starts
 - zookeeper connection is cut
 - master enters the retry loop
 - in the meantime the session expires
 - the network comes back, the session is recreated
 - the retries continues, but on the wrong object, hence fails.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5549) Master can fail if ZooKeeper session expires

2012-03-16 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231441#comment-13231441
 ] 

nkeywal commented on HBASE-5549:


Yes, see HBASE-5572 for the reasons...

On Fri, Mar 16, 2012 at 6:41 PM, Zhihong Yu (Commented) (JIRA) 



 Master can fail if ZooKeeper session expires
 

 Key: HBASE-5549
 URL: https://issues.apache.org/jira/browse/HBASE-5549
 Project: HBase
  Issue Type: Bug
  Components: master, zookeeper
Affects Versions: 0.96.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5549.v10.patch, 5549.v6.patch, 5549.v7.patch, 
 5549.v8.patch, 5549.v9.patch, nochange.patch


 There is a retry mechanism in RecoverableZooKeeper, but when the session 
 expires, the whole ZooKeeperWatcher is recreated, hence the retry mechanism 
 does not work in this case. This is why a sleep is needed in 
 TestZooKeeper#testMasterSessionExpired: we need to wait for ZooKeeperWatcher 
 to be recreated before using the connection.
 This can happen in real life, it can happen when:
 - master  zookeeper starts
 - zookeeper connection is cut
 - master enters the retry loop
 - in the meantime the session expires
 - the network comes back, the session is recreated
 - the retries continues, but on the wrong object, hence fails.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5549) Master can fail if ZooKeeper session expires

2012-03-16 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231444#comment-13231444
 ] 

nkeywal commented on HBASE-5549:


Can't create a review, I got error 500 as well...

 Master can fail if ZooKeeper session expires
 

 Key: HBASE-5549
 URL: https://issues.apache.org/jira/browse/HBASE-5549
 Project: HBase
  Issue Type: Bug
  Components: master, zookeeper
Affects Versions: 0.96.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5549.v10.patch, 5549.v6.patch, 5549.v7.patch, 
 5549.v8.patch, 5549.v9.patch, nochange.patch


 There is a retry mechanism in RecoverableZooKeeper, but when the session 
 expires, the whole ZooKeeperWatcher is recreated, hence the retry mechanism 
 does not work in this case. This is why a sleep is needed in 
 TestZooKeeper#testMasterSessionExpired: we need to wait for ZooKeeperWatcher 
 to be recreated before using the connection.
 This can happen in real life, it can happen when:
 - master  zookeeper starts
 - zookeeper connection is cut
 - master enters the retry loop
 - in the meantime the session expires
 - the network comes back, the session is recreated
 - the retries continues, but on the wrong object, hence fails.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5549) Master can fail if ZooKeeper session expires

2012-03-16 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231453#comment-13231453
 ] 

nkeywal commented on HBASE-5549:


Now it works: https://reviews.apache.org/r/4391/

 Master can fail if ZooKeeper session expires
 

 Key: HBASE-5549
 URL: https://issues.apache.org/jira/browse/HBASE-5549
 Project: HBase
  Issue Type: Bug
  Components: master, zookeeper
Affects Versions: 0.96.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5549.v10.patch, 5549.v6.patch, 5549.v7.patch, 
 5549.v8.patch, 5549.v9.patch, nochange.patch


 There is a retry mechanism in RecoverableZooKeeper, but when the session 
 expires, the whole ZooKeeperWatcher is recreated, hence the retry mechanism 
 does not work in this case. This is why a sleep is needed in 
 TestZooKeeper#testMasterSessionExpired: we need to wait for ZooKeeperWatcher 
 to be recreated before using the connection.
 This can happen in real life, it can happen when:
 - master  zookeeper starts
 - zookeeper connection is cut
 - master enters the retry loop
 - in the meantime the session expires
 - the network comes back, the session is recreated
 - the retries continues, but on the wrong object, hence fails.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5549) Master can fail if ZooKeeper session expires

2012-03-16 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13231598#comment-13231598
 ] 

nkeywal commented on HBASE-5549:


v11 with the comments taken into account... Thank you for the review.

 Master can fail if ZooKeeper session expires
 

 Key: HBASE-5549
 URL: https://issues.apache.org/jira/browse/HBASE-5549
 Project: HBase
  Issue Type: Bug
  Components: master, zookeeper
Affects Versions: 0.96.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5549.v10.patch, 5549.v11.patch, 5549.v6.patch, 
 5549.v7.patch, 5549.v8.patch, 5549.v9.patch, nochange.patch


 There is a retry mechanism in RecoverableZooKeeper, but when the session 
 expires, the whole ZooKeeperWatcher is recreated, hence the retry mechanism 
 does not work in this case. This is why a sleep is needed in 
 TestZooKeeper#testMasterSessionExpired: we need to wait for ZooKeeperWatcher 
 to be recreated before using the connection.
 This can happen in real life, it can happen when:
 - master  zookeeper starts
 - zookeeper connection is cut
 - master enters the retry loop
 - in the meantime the session expires
 - the network comes back, the session is recreated
 - the retries continues, but on the wrong object, hence fails.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

2012-03-15 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13230035#comment-13230035
 ] 

nkeywal commented on HBASE-5572:


@stack: it seems the patch could be committed?

 KeeperException.SessionExpiredException management could be improved in Master
 --

 Key: HBASE-5572
 URL: https://issues.apache.org/jira/browse/HBASE-5572
 Project: HBase
  Issue Type: Improvement
  Components: master
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch, 
 5572.v2.patch


 Synthesis:
  1) TestMasterZKSessionRecovery distinguish two cases on 
 SessionExpiredException. One is explicitly not managed. However, is seems 
 that there is no reason for this.
  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a 
 quite complex function, with a useless recursive call.
  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is 
 equivalent to TestZooKeeper#testMasterSessionExpired
  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be 
 removed if we merge the two cases mentioned above.
 Changes are:
  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a 
 single case and remove recursion
  1) Removing TestMasterZKSessionRecovery
 Detailed justification:
 testMasterZKSessionRecoveryFailure says:
 {noformat}
   /**
* Negative test of master recovery from zk session expiry.
*
* Starts with one master. Fakes the master zk session expired.
* Ensures the master cannot recover the expired zk session since
* the master zk node is still there.
*/
   public void testMasterZKSessionRecoveryFailure() throws Exception {
 MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
 HMaster m = cluster.getMaster();
 m.abort(Test recovery from zk session expired,
   new KeeperException.SessionExpiredException());
 assertTrue(m.isStopped());
   }
 {noformat}
 This tests works, i.e. the assertion is always verified.
 But do we really want this behavior?
 When looking at the code, we see that this what's happening is strange:
 - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false 
 HMaster#abort stops the master.
 - HMaster#abortNow checks the exception type. As it's a 
 SessionExpiredException it will try to recover, calling 
 HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false 
 (and that will make HMaster#abort stopping the master)
 - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and 
 then try to become the active master. If it cannot, it will return false (and 
 that will make HMaster#abort stopping the master).
 - HMaster#becomeActiveMaster returns the result of 
 ActiveMasterManager#blockUntilBecomingActiveMaster. 
 blockUntilBecomingActiveMaster says it will return false if there is any 
 error preventing it to become the active master.
 - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master 
 address. If it's the same port  host, it deletes the nodes, that will start 
 a recursive call to blockUntilBecomingActiveMaster. This second call succeeds 
 (we became the active master) and return true. This result is ignored by the 
 first blockUntilBecomingActiveMaster: it return false (even if we actually 
 became the active master), hence the whole suite call returns false and 
 HMaster#abort stops the master.
 In other words, the comment says Ensures the master cannot recover the 
 expired zk session since the master zk node is still there. but we're 
 actually doing a check just for this and deleting the node. If we were not 
 ignoring the result, we would return true, so we would not stop the master, 
 so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-03-15 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13230040#comment-13230040
 ] 

nkeywal commented on HBASE-5399:


I've actually been able to reproduce the issue on a previous trunk version, so 
it's not linked to my changes, and that's a great step forward in term of 
rationality.

Sometimes when we trigger a zk session expiry the zk connection remains in the 
connected state. I don't have any explication for this, but at least it's not 
a new behavior. I will manage this case explicitly in the version I'm writing 
for #5549.

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 
 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 
 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 
 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 
 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 
 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 
 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 
 5399_inprogress.v9.patch, nochange.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5569) TestAtomicOperation.testMultiRowMutationMultiThreads fails occasionally

2012-03-14 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229035#comment-13229035
 ] 

nkeywal commented on HBASE-5569:


There's no message on Storescanner.peek, nor error or warning. Here's the log 
when it fails:

{noformat}
2012-03-14 03:14:02,146 DEBUG [Thread-51] 
regionserver.TestAtomicOperation$1(305): keyvalues=NONE
Exception in thread Thread-51 
junit.framework.AssertionFailedError
at junit.framework.Assert.fail(Assert.java:48)
at junit.framework.Assert.fail(Assert.java:56)
at 
org.apache.hadoop.hbase.regionserver.TestAtomicOperation$1.run(TestAtomicOperation.java:307)
2012-03-14 03:14:02,228 DEBUG [Thread-92] 
regionserver.TestAtomicOperation$1(279): flushing
{noformat}

Reproduced on Feb' 24th trunk as well, after ~700 iterations, same logs.

 TestAtomicOperation.testMultiRowMutationMultiThreads fails occasionally
 ---

 Key: HBASE-5569
 URL: https://issues.apache.org/jira/browse/HBASE-5569
 Project: HBase
  Issue Type: Bug
Reporter: Lars Hofhansl
Priority: Minor
 Attachments: TestAtomicOperation-output.trunk_120313.rar


 What I pieced together so far is that it is the *scanning* side that has 
 problems sometimes.
 Every time I see a assertion failure in the log I see this before:
 {quote}
 2012-03-12 21:48:49,523 DEBUG [Thread-211] regionserver.StoreScanner(499): 
 Storescanner.peek() is changed where before = 
 rowB/colfamily11:qual1/75366/Put/vlen=6,and after = 
 rowB/colfamily11:qual1/75203/DeleteColumn/vlen=0
 {quote}
 The order of if the Put and Delete is sometimes reversed.
 The test threads should always see exactly one KV, if the before was the 
 Put the thread see 0 KVs, if the before was the Delete the threads see 2 
 KVs.
 This debug message comes from StoreScanner to checkReseek. It seems we still 
 some consistency issue with scanning sometimes :(

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-03-14 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13229476#comment-13229476
 ] 

nkeywal commented on HBASE-5399:


You're right. It's a regression. I will fix this with 5549. I'm having a hard 
time trying to understand what's going on exactly here, and the sleep coding 
pattern does not help to say the least.

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 
 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 
 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 
 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 
 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 
 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 
 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 
 5399_inprogress.v9.patch, nochange.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-03-13 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228300#comment-13228300
 ] 

nkeywal commented on HBASE-5399:


@stack
Yes, this test is flaky... I reproduce the error on the trunk as of March 10th 
as well. I've seen it failing previously, I think it's flaky for at the very 
least a month (and may be much more)

git log:
{noformat}
commit 0f3e025a62f89763fffbf8298d565a6c4e5b7d06
Date:   Sat Mar 10 02:27:05 2012 +
{noformat}


With the same stack as in trunk #2676:
{noformat}
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.444 sec  
FAILURE!
testMultiRowMutationMultiThreads(org.apache.hadoop.hbase.regionserver.TestAtomicOperation)
  Time elapsed: 7.083 sec   FAILURE!
junit.framework.AssertionFailedError: expected:0 but was:1
at junit.framework.Assert.fail(Assert.java:50)
at junit.framework.Assert.failNotEquals(Assert.java:287)
at junit.framework.Assert.assertEquals(Assert.java:67)
at junit.framework.Assert.assertEquals(Assert.java:199)
at junit.framework.Assert.assertEquals(Assert.java:205)
at 
org.apache.hadoop.hbase.regionserver.TestAtomicOperation.testMultiRowMutationMultiThreads(TestAtomicOperation.java:416)
{noformat}

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 
 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 
 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 
 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 
 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 
 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 
 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 
 5399_inprogress.v9.patch, nochange.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 

[jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

2012-03-13 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228476#comment-13228476
 ] 

nkeywal commented on HBASE-5572:


Yes. I've done 3 modifications in the code, two like for like (hopefully!) and 
one with a different behavior. I:
- removed the variable named cleanSetOfActiveMaster, replaced by return true 
or return false. 
- replaced the recursive call by a while(true) loop. 
- implicitly (it's hidden because there is no recursive call anymore) changed 
the function behavior: we now return the final result. For this reason the 
function behaves differently (we return true instead of false), but it's more 
on line with the method contract. This change breaks the 
testMasterZKSessionRecoveryFailure, because it does not fail anymore. 

TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure was testing 
explicitly the behavior with both SessionExpired AND master with same host  
port. I removed it, but I can move it to TestZooKeeper (to save a cluster 
start/stop) and reverse the assertion in the test (now it does not fail).





 KeeperException.SessionExpiredException management could be improved in Master
 --

 Key: HBASE-5572
 URL: https://issues.apache.org/jira/browse/HBASE-5572
 Project: HBase
  Issue Type: Improvement
  Components: master
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5572.v1.patch


 Synthesis:
  1) TestMasterZKSessionRecovery distinguish two cases on 
 SessionExpiredException. One is explicitly not managed. However, is seems 
 that there is no reason for this.
  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a 
 quite complex function, with a useless recursive call.
  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is 
 equivalent to TestZooKeeper#testMasterSessionExpired
  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be 
 removed if we merge the two cases mentioned above.
 Changes are:
  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a 
 single case and remove recursion
  1) Removing TestMasterZKSessionRecovery
 Detailed justification:
 testMasterZKSessionRecoveryFailure says:
 {noformat}
   /**
* Negative test of master recovery from zk session expiry.
*
* Starts with one master. Fakes the master zk session expired.
* Ensures the master cannot recover the expired zk session since
* the master zk node is still there.
*/
   public void testMasterZKSessionRecoveryFailure() throws Exception {
 MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
 HMaster m = cluster.getMaster();
 m.abort(Test recovery from zk session expired,
   new KeeperException.SessionExpiredException());
 assertTrue(m.isStopped());
   }
 {noformat}
 This tests works, i.e. the assertion is always verified.
 But do we really want this behavior?
 When looking at the code, we see that this what's happening is strange:
 - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false 
 HMaster#abort stops the master.
 - HMaster#abortNow checks the exception type. As it's a 
 SessionExpiredException it will try to recover, calling 
 HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false 
 (and that will make HMaster#abort stopping the master)
 - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and 
 then try to become the active master. If it cannot, it will return false (and 
 that will make HMaster#abort stopping the master).
 - HMaster#becomeActiveMaster returns the result of 
 ActiveMasterManager#blockUntilBecomingActiveMaster. 
 blockUntilBecomingActiveMaster says it will return false if there is any 
 error preventing it to become the active master.
 - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master 
 address. If it's the same port  host, it deletes the nodes, that will start 
 a recursive call to blockUntilBecomingActiveMaster. This second call succeeds 
 (we became the active master) and return true. This result is ignored by the 
 first blockUntilBecomingActiveMaster: it return false (even if we actually 
 became the active master), hence the whole suite call returns false and 
 HMaster#abort stops the master.
 In other words, the comment says Ensures the master cannot recover the 
 expired zk session since the master zk node is still there. but we're 
 actually doing a check just for this and deleting the node. If we were not 
 ignoring the result, we would return true, so we would not stop the master, 
 so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please 

[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-03-13 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228587#comment-13228587
 ] 

nkeywal commented on HBASE-5399:


@Stack; I think it fails 20% of the time. I run it alone, i.e. with 
-Dtest=TestAtomicOperation#testMultiRowMutationMultiThreads with nothing else 
running on the machine, and a mvn clean. No clue on when it started to happen.

@Lars: I'm not sure I haven't seen failures on testRowMutationMultiThreads as 
well, I will launch a few tests to see if it happens.

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 
 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 
 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 
 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 
 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 
 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 
 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 
 5399_inprogress.v9.patch, nochange.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-03-13 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228662#comment-13228662
 ] 

nkeywal commented on HBASE-5399:


@Lars, Stack:
After 50 tries, on trunk (fbd4bebd5cca129f49e91ec9936f604998a7025a) + 5572 I 
got it:

testRowMutationMultiThreads(org.apache.hadoop.hbase.regionserver.TestAtomicOperation):
 expected:0 but was:5
at 
org.apache.hadoop.hbase.regionserver.TestAtomicOperation.testRowMutationMultiThreads(TestAtomicOperation.java:331)

So the probability for testRowMutationMultiThreads is 10 times inferior than 
for testMultiRowMutationMultiThreads but it can occur as well..

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 
 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 
 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 
 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 
 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 
 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 
 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 
 5399_inprogress.v9.patch, nochange.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5572) KeeperException.SessionExpiredException management could be improved in Master

2012-03-13 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228689#comment-13228689
 ] 

nkeywal commented on HBASE-5572:


for an unknown reason the first two patches didn't make it to hadoop-qa. 
Rewriting once again.

 KeeperException.SessionExpiredException management could be improved in Master
 --

 Key: HBASE-5572
 URL: https://issues.apache.org/jira/browse/HBASE-5572
 Project: HBase
  Issue Type: Improvement
  Components: master
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5572.v1.patch, 5572.v2.patch, 5572.v2.patch, 
 5572.v2.patch


 Synthesis:
  1) TestMasterZKSessionRecovery distinguish two cases on 
 SessionExpiredException. One is explicitly not managed. However, is seems 
 that there is no reason for this.
  2) The issue lies in ActiveMasterManager#blockUntilBecomingActiveMaster, a 
 quite complex function, with a useless recursive call.
  3) TestMasterZKSessionRecovery#testMasterZKSessionRecoverySuccess is 
 equivalent to TestZooKeeper#testMasterSessionExpired
  4) TestMasterZKSessionRecovery#testMasterZKSessionRecoveryFailure can be 
 removed if we merge the two cases mentioned above.
 Changes are:
  2) Changing ActiveMasterManager#blockUntilBecomingActiveMaster to have a 
 single case and remove recursion
  1) Removing TestMasterZKSessionRecovery
 Detailed justification:
 testMasterZKSessionRecoveryFailure says:
 {noformat}
   /**
* Negative test of master recovery from zk session expiry.
*
* Starts with one master. Fakes the master zk session expired.
* Ensures the master cannot recover the expired zk session since
* the master zk node is still there.
*/
   public void testMasterZKSessionRecoveryFailure() throws Exception {
 MiniHBaseCluster cluster = TEST_UTIL.getHBaseCluster();
 HMaster m = cluster.getMaster();
 m.abort(Test recovery from zk session expired,
   new KeeperException.SessionExpiredException());
 assertTrue(m.isStopped());
   }
 {noformat}
 This tests works, i.e. the assertion is always verified.
 But do we really want this behavior?
 When looking at the code, we see that this what's happening is strange:
 - HMaster#abort calls Master#abortNow. If HMaster#abortNow returns false 
 HMaster#abort stops the master.
 - HMaster#abortNow checks the exception type. As it's a 
 SessionExpiredException it will try to recover, calling 
 HMaster#tryRecoveringExpiredZKSession. If it cannot, it will return false 
 (and that will make HMaster#abort stopping the master)
 - HMaster#tryRecoveringExpiredZKSession recreates a ZooKeeperConnection and 
 then try to become the active master. If it cannot, it will return false (and 
 that will make HMaster#abort stopping the master).
 - HMaster#becomeActiveMaster returns the result of 
 ActiveMasterManager#blockUntilBecomingActiveMaster. 
 blockUntilBecomingActiveMaster says it will return false if there is any 
 error preventing it to become the active master.
 - ActiveMasterManager#blockUntilBecomingActiveMaster reads ZK for the master 
 address. If it's the same port  host, it deletes the nodes, that will start 
 a recursive call to blockUntilBecomingActiveMaster. This second call succeeds 
 (we became the active master) and return true. This result is ignored by the 
 first blockUntilBecomingActiveMaster: it return false (even if we actually 
 became the active master), hence the whole suite call returns false and 
 HMaster#abort stops the master.
 In other words, the comment says Ensures the master cannot recover the 
 expired zk session since the master zk node is still there. but we're 
 actually doing a check just for this and deleting the node. If we were not 
 ignoring the result, we would return true, so we would not stop the master, 
 so the test would fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-03-13 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13228816#comment-13228816
 ] 

nkeywal commented on HBASE-5399:


It's ok, we're all in the same boat :-)
I've got the test running on a 2 weeks old version of the trunk, I will
have the result tomorrow.

On Wed, Mar 14, 2012 at 12:12 AM, Lars Hofhansl (Commented) (JIRA) 



 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 
 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 
 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 
 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 
 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 
 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 
 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 
 5399_inprogress.v9.patch, nochange.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-03-12 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13227455#comment-13227455
 ] 

nkeywal commented on HBASE-5399:


Ok, even an empty patch gets these errors from hadoop-qa, so the v42 can be 
committed imho.

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 
 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 
 5399.v42.patch, 5399.v42.patch, 5399_inprogress.patch, 
 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 
 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 
 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 
 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 
 5399_inprogress.v9.patch, nochange.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-03-10 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13226903#comment-13226903
 ] 

nkeywal commented on HBASE-5399:


I don't know. The errors are the same, but I don't reproduce them locally.
Let's retry, if we have them a third time I will look at them tomorrow or 
Monday.

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 
 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 
 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 
 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 
 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 
 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-03-09 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13226195#comment-13226195
 ] 

nkeywal commented on HBASE-5399:


They're all different from the ones I got locally. It could be pure test 
flakiness. Let's retry.

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 
 5399.v40.patch, 5399.v41.patch, 5399.v42.patch, 5399.v42.patch, 
 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 
 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 
 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 
 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5551) Some functions should not be used by customer code and must be deprecated in 0.94

2012-03-09 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13226204#comment-13226204
 ] 

nkeywal commented on HBASE-5551:


Patch contains only comments  meta info modifications. Done on 0.94 branch. 
Can be committed.

 Some functions should not be used by customer code and must be deprecated in 
 0.94
 -

 Key: HBASE-5551
 URL: https://issues.apache.org/jira/browse/HBASE-5551
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.92.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.94.0

 Attachments: 5551.092.patch


 They are:
 HBaseAdmin#getMaster
 HConnection#getZooKeeperWatcher
 HConnection#getMaster

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5551) Some functions should not be used by customer code and must be deprecated in 0.94

2012-03-09 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13226235#comment-13226235
 ] 

nkeywal commented on HBASE-5551:


We expect that all the functions needed are already in HTable or HBaseAdmin. We 
can detail this in the javadoc of course (but then first in 5399 to ensure that 
all the info is in the trunk for 0.96).

 Some functions should not be used by customer code and must be deprecated in 
 0.94
 -

 Key: HBASE-5551
 URL: https://issues.apache.org/jira/browse/HBASE-5551
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.92.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.94.0

 Attachments: 5551.092.patch


 They are:
 HBaseAdmin#getMaster
 HConnection#getZooKeeperWatcher
 HConnection#getMaster

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-03-06 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13223089#comment-13223089
 ] 

nkeywal commented on HBASE-5399:


TestRegionRebalancing: seems to be a flaky test. Will retry on Hadoop-QA, but I 
don't reproduce it here.
TestRegionRebalancing: With the 7s sleep (i.e. same sleep as before), I don't 
reproduce it. I will try to understand why this sleep changes the result, but 
anyway it's not a regression. 

So this patch is a good candidate for a commit I think. Further enhancement 
(clusterId, ZK watcher replacement by simple calls) could be put in another 
JIRA.

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 
 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 
 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 
 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 
 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-03-06 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13223657#comment-13223657
 ] 

nkeywal commented on HBASE-5399:


@all; Ok, I will provide an updated patch with the comments taken into account. 
I also understood the issue behind TestZooKeeper. My fix currently breaks other 
unit tests, but if I manage to make it work I will include it.

For 0.94 vs. 0.96 ok as well, however, it would makes sense to deprecate some 
methods in the 0.94, especially getMaster  getZooKeeper in HConnection: people 
will get more time to react this way.

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 
 5399.v40.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 
 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 
 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 
 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 
 5399_inprogress.v32.patch, 5399_inprogress.v9.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-03-06 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13223664#comment-13223664
 ] 

nkeywal commented on HBASE-5399:


I can do a patch with only the deprecated stuff, or it can be be added in a
raw commit, as you like.

On Tue, Mar 6, 2012 at 10:03 PM, stack (Commented) (JIRA)



 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Fix For: 0.96.0

 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 
 5399.v40.patch, 5399_inprogress.patch, 5399_inprogress.v14.patch, 
 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 
 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 
 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 
 5399_inprogress.v32.patch, 5399_inprogress.v9.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-03-05 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13222436#comment-13222436
 ] 

nkeywal commented on HBASE-5399:


org.apache.hadoop.hbase.TestZooKeeper is surprising, because:
- if we add a 7s sleep before in  testMasterSessionExpired(), then it's much 
more difficult to reproduce.
- in RecovableZooKeeper, there is no tests for SESSIONEXPIRED: if it happens 
there is no retry.

So I will tend to think it's an existing issue, even if I need to understand 
how it's supposed to work when there is a session timeout. I tried to add it 
but it does not work.

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5399.v27.patch, 5399.v38.patch, 5399.v39.patch, 
 5399_inprogress.patch, 5399_inprogress.v14.patch, 5399_inprogress.v16.patch, 
 5399_inprogress.v18.patch, 5399_inprogress.v20.patch, 
 5399_inprogress.v21.patch, 5399_inprogress.v23.patch, 
 5399_inprogress.v3.patch, 5399_inprogress.v32.patch, 5399_inprogress.v9.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-03-02 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13220964#comment-13220964
 ] 

nkeywal commented on HBASE-5399:


@stack. A lot of variance, but not that much. So I know I broke something 
somewhere. I fixed a synchronization issue in v23 (plus the points mentioned in 
you review). 

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5399_inprogress.patch, 5399_inprogress.v14.patch, 
 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 
 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 
 5399_inprogress.v23.patch, 5399_inprogress.v3.patch, 5399_inprogress.v9.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-03-01 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13220156#comment-13220156
 ] 

nkeywal commented on HBASE-5399:


The patch is still in progress. Submitted to the build to see how it goes in 
hadoop-qa.

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5399_inprogress.patch, 5399_inprogress.v14.patch, 
 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 
 5399_inprogress.v3.patch, 5399_inprogress.v9.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-03-01 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13220171#comment-13220171
 ] 

nkeywal commented on HBASE-5399:


@stack: done on https://reviews.apache.org/r/3967/

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5399_inprogress.patch, 5399_inprogress.v14.patch, 
 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 
 5399_inprogress.v20.patch, 5399_inprogress.v3.patch, 5399_inprogress.v9.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-03-01 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13220317#comment-13220317
 ] 

nkeywal commented on HBASE-5399:


fyi, here are the results on 4 runs.

HADOOP-QA 1
org.apache.hadoop.hbase.client.TestAdmin
org.apache.hadoop.hbase.client.TestScannerTimeout
org.apache.hadoop.hbase.replication.TestReplicationPeer
org.apache.hadoop.hbase.util.TestMergeTool
org.apache.hadoop.hbase.mapred.TestTableMapReduce
org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
org.apache.hadoop.hbase.master.TestSplitLogManager

Local 1
org.apache.hadoop.hbase.regionserver.TestAtomicOperation
org.apache.hadoop.hbase.regionserver.wal.TestHLog
org.apache.hadoop.hbase.master.TestDistributedLogSplitting
org.apache.hadoop.hbase.replication.TestReplication
org.apache.hadoop.hbase.util.TestMergeTool
org.apache.hadoop.hbase.client.TestAdmin
org.apache.hadoop.hbase.client.TestScannerTimeout

Local 2
org.apache.hadoop.hbase.regionserver.TestSplitLogWorker
org.apache.hadoop.hbase.client.TestScannerTimeout
org.apache.hadoop.hbase.util.TestMergeTool

Local 3
org.apache.hadoop.hbase.coprocessor.TestClassLoading
org.apache.hadoop.hbase.master.TestSplitLogManager
org.apache.hadoop.hbase.util.TestMergeTool
org.apache.hadoop.hbase.coprocessor.TestAggregateProtocol
org.apache.hadoop.hbase.io.encoding.TestChangingEncoding
org.apache.hadoop.hbase.TestZooKeeper
org.apache.hadoop.hbase.client.TestScannerTimeout
org.apache.hadoop.hbase.replication.TestReplication

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5399_inprogress.patch, 5399_inprogress.v14.patch, 
 5399_inprogress.v16.patch, 5399_inprogress.v18.patch, 
 5399_inprogress.v20.patch, 5399_inprogress.v21.patch, 
 5399_inprogress.v3.patch, 5399_inprogress.v9.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 

[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-02-29 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13219681#comment-13219681
 ] 

nkeywal commented on HBASE-5399:


v16 (still in progress, some unit tests fail, indent  comments to redo  so 
on), after a discussion with Stack.

HConnection is a connection to the cluster.
However, the fact that the cluster is composed of master, zookeeper and region 
servers should be hidden from the client: some functions need a master some 
others not. This is not the client problem. Especially, these functions can 
move (from example, getting a table descriptor, today a master function, could 
become a region server function). So the getMaster  getZookeeperWatcher, 
shared or not, should not be in HConnection interface.

Client functions are today split in two classes HBaseAdmin  HConnection (with 
stuff like processBatch, listTables, getHTableDescriptor). It could make sense 
to split HConnection further, but anyway we already have two classes using 
master, and the master connection should remain shared between these two 
classes. This should be handled by the HConnection as it is its core 
responsibility and it's as well much simpler technically. So we need to have 
package visible function to get them for HBaseAdmin. I prefer to have them in 
HConnectionImplementation only, even it it implies a cast in HBaseAdmin, as 
this makes HConnection clean from a client point of view.

We stick to the keep alive mechanism with the Closeable, and hence a dynamic 
proxy for master and a subclass for ZooKeeperMaster.

Note that if master  zookeeper are implementation details, the same should 
apply to HRegionInterface (HConnection#getHRegionConnection). But there are 
dependencies from the master package so it can not be included in this jira. 
The keep alive mechanism could be applied to HRegionInterface as well.

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5399_inprogress.patch, 5399_inprogress.v14.patch, 
 5399_inprogress.v16.patch, 5399_inprogress.v3.patch, 5399_inprogress.v9.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin 

[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-02-28 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13218257#comment-13218257
 ] 

nkeywal commented on HBASE-5399:


If we really want it, I found 3 options, and tried 2.

1) Adding 'close' to the HMasterInterface
After looking at it, I don't think it's a good option: HMasterInterface is an 
interface shared between the client  the server. So adding a close function to 
it would mean the server must implement it, while it's a client side function 
only. I believe that's the reason why there is already a function 'stopProxy' 
in the RPCENgine instead of a close function. 

2) Adding the possibility to have a delayed close in RPCENgine
Instead of doing it for HMasterInterface in Connection only, we could do it all 
proxies and code this in RPCENgine.
There is already a reference counting in the hbase RPCENgine. So we could add 
here some code to allow a delayed close. I don't see why it would not be 
possible, all the code seems to be in the hbase package (and not hadoop). This 
would require smart convention to make it configurable on a per proxy basis, 
but it should work.

3) Add an class with a delegation 
So I've got this
{noformat}
public interface SharedMaster extends HMasterInterface, Closeable {}
{noformat}

With this in HConnection

{noformat}
public interface HConnection extends Abortable, Closeable {
  public SharedMaster  getSharedMaster()
}
{noformat}

Then the client writes
{noformat}
SharedMaster  master = connection.getSharedMaster();
try {
   master.move(encodedRegionName, destServerName);
} finally {
   master.close();
}
{noformat}

With a java proxy to manage the delegation for us:
{noformat}
private static class SharedMasterHandler implements InvocationHandler {
   private HConnectionImplementation connection;
   private HMasterInterface master;
  SharedMasterHandler(HConnectionImplementation connection, 
HMasterInterface master){
this.connection = connection;
this.master = master;
  }

  @Override
  public Object invoke(Object proxy, Method method, Object[] args) throws 
Throwable {
if (method.getName().equals(close)){
  connection.releaseSharedMaster((HMasterInterface)master);
  return null;
} else {
  return method.invoke(master, args);
}
  }
}
{noformat}
releaseSharedMaster is private in this solution.

It was not really my first idea, but it's a reasonable way to get to the 
objective. The reflective delegation is not fast, obviously, but it doesn't 
matter here as there is much more expensive remote call just after...


I'am currently testing it, it seems to work.  

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5399_inprogress.patch, 5399_inprogress.v3.patch, 
 5399_inprogress.v9.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a 

[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-02-27 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217177#comment-13217177
 ] 

nkeywal commented on HBASE-5399:


bq. Do we have to have the getSharedZookeeperWatcher and 
releaseSharedZookeeperWatcher and getSharedMaster, etc., in the HConnection 
API? Are these not implementation details? (Or would it be too hard to undo 
them – you'd have not way of counting zk and master connections?)

Yes, we need the user to explicitly release the connection, as we can't hide 
that we managing the object life cycle. I would prefer to use close for this, 
but I didn't find an easy way to extend the master proxy to make it closeable. 
So I preferred to use release for both objects to make it consistent.

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5399_inprogress.patch, 5399_inprogress.v3.patch, 
 5399_inprogress.v9.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5075) regionserver crashed and failover

2012-02-27 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217183#comment-13217183
 ] 

nkeywal commented on HBASE-5075:


In the case that you want to manage (region server crash without any hardware 
issue, i.e. a pure application bug), a possible solution as well is loop in the 
launch script. This typically allows 20s failover (stop time + start time), and 
is even compatible with hotfailover. It's faster and less error prone than 
monitoring pid. But for HBase it would be a new start mode. It could make sense 
if we observe much more application bugs than hw issues (this solution is quite 
common with C/C++ stuff as it's easy to crash a process with these languages...)

 regionserver crashed and failover
 -

 Key: HBASE-5075
 URL: https://issues.apache.org/jira/browse/HBASE-5075
 Project: HBase
  Issue Type: Improvement
  Components: monitoring, regionserver, replication, zookeeper
Affects Versions: 0.92.1
Reporter: zhiyuan.dai
 Fix For: 0.90.5

 Attachments: Degion of Failure Detection.pdf, HBase-5075-shell.patch, 
 HBase-5075-src.patch


 regionserver crashed,it is too long time to notify hmaster.when hmaster know 
 regionserver's shutdown,it is long time to fetch the hlog's lease.
 hbase is a online db, availability is very important.
 i have a idea to improve availability, monitor node to check regionserver's 
 pid.if this pid not exsits,i think the rs down,i will delete the znode,and 
 force close the hlog file.
 so the period maybe 100ms.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4955) Use the official versions of surefire junit

2012-02-27 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217200#comment-13217200
 ] 

nkeywal commented on HBASE-4955:


Small update:
- No news from JUnit team. Issues was raised in november and they classified it 
as a bug, but my one line fix is not yet integrated. The main committers don't 
have a lot of time to spend on the project...
- Surefire 2.12 is available, but there is a regression (SUREFIRE-827) that 
they are currently fixing, so it seems better to wait for the 2.13.

 Use the official versions of surefire  junit
 -

 Key: HBASE-4955
 URL: https://issues.apache.org/jira/browse/HBASE-4955
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor

 We currently use private versions for Surefire  JUnit since HBASE-4763.
 This JIRA traks what we need to move to official versions.
 Surefire 2.11 is just out, but, after some tests, it does not contain all 
 what we need.
 JUnit. Could be for JUnit 4.11. Issue to monitor:
 https://github.com/KentBeck/junit/issues/359: fixed in our version, no 
 feedback for an integration on trunk
 Surefire: Could be for Surefire 2.12. Issues to monitor are:
 329 (category support): fixed, we use the official implementation from the 
 trunk
 786 (@Category with forkMode=always): fixed, we use the official 
 implementation from the trunk
 791 (incorrect elapsed time on test failure): fixed, we use the official 
 implementation from the trunk
 793 (incorrect time in the XML report): Not fixed (reopen) on trunk, fixed on 
 our version.
 760 (does not take into account the test method): fixed in trunk, not fixed 
 in our version
 798 (print immediately the test class name): not fixed in trunk, not fixed in 
 our version
 799 (Allow test parallelization when forkMode=always): not fixed in trunk, 
 not fixed in our version
 800 (redirectTestOutputToFile not taken into account): not yet fix on trunk, 
 fixed on our version
 800  793 are the more important to monitor, it's the only ones that are 
 fixed in our version but not on trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-02-27 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217305#comment-13217305
 ] 

nkeywal commented on HBASE-5399:


yes :-)
The user would code something like:
{noformat}
HMasterInterface master = connection.getSharedMaster();
try {
  master.move(encodedRegionName, destServerName);
} finally {
  master.close();
}
{noformat}

HMasterInterface is an interface, with a proxy sending the calls to the master 
server.
I would need to add a close that would not be a remote call but would 
decrement a reference counter.

I could add another proxy object, but it's not very clean (it should work 
however, may be it's an option).
If I don't want to do that, I need to add the method in the object returned by 
getProxy. You think it makes sense?






 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5399_inprogress.patch, 5399_inprogress.v3.patch, 
 5399_inprogress.v9.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-02-19 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211422#comment-13211422
 ] 

nkeywal commented on HBASE-5399:


In v9 I added some comments and fixed some issues. 

bq. I'd say lets not check master is there till we need it. Seems like a PITA 
going ahead and checking master on construction. This changes the behavior but 
I think its one thats ok to change.

Agreed, removed in v9.

bq. You are doing your own Callables. You've seen the Callables that go on in 
HTable. Any reason you avoid them? I suppose this is different in that you want 
to let go of the shared master. Looks fine.

I could also add the master management in the HTable callables. I would still 
need the one I wrote for HBaseAdmin, but it could be useful to be able to use 
master from the HTable callables?

I still need to work on the unit tests, some large tests are failing today:
{noformat}
Failed tests:   testBackgroundEvictionThread[1](org.apache.hadoop.hbase.io.hfile
.TestLruBlockCache)
  testShutdownFixupWhenDaughterHasSplit(org.apache.hadoop.hbase.regionserver.Tes
tSplitTransactionOnCluster): expected:1 but was:0
  queueFailover(org.apache.hadoop.hbase.replication.TestReplication): Waited too
 much time for queueFailover replication
  testMergeTool(org.apache.hadoop.hbase.util.TestMergeTool): 'merging regions 0
and 1' failed with errCode -1

Tests in error:
  test3686a(org.apache.hadoop.hbase.client.TestScannerTimeout): 64142ms passed s
ince the last invocation, timeout is currently set to 1
{noformat}

For some of them it could be random and unrelated, but I am sure there are some 
real issues for of them.
I am on vacations next week; so I will come back to it the week after...



 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5399_inprogress.patch, 5399_inprogress.v3.patch, 
 5399_inprogress.v9.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.

[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-02-19 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211435#comment-13211435
 ] 

nkeywal commented on HBASE-5399:


Moreover, v9 patch work on trunk as of today.

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5399_inprogress.patch, 5399_inprogress.v3.patch, 
 5399_inprogress.v9.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-02-19 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13211703#comment-13211703
 ] 

nkeywal commented on HBASE-5399:


@Lars and all: done see https://reviews.apache.org/r/3967/
I sent it to the whole hbase group, I hope it's the right thing to do.

I'm on vacation this week so I will see the comments after the 27th... 


 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5399_inprogress.patch, 5399_inprogress.v3.patch, 
 5399_inprogress.v9.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-02-17 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13210561#comment-13210561
 ] 

nkeywal commented on HBASE-5399:


It's not the last version (it needs more comments, unit tests and likely bug 
fixes), but there is already a lot. Master  ZooKeeper connection are now 
created only when necessary, and are closed if not used for 5 minutes.
I added the keep alive stuff. It's not a nice to have; without it the unit 
tests take twice more time.
There is an issue with the masterCheck part, the previous behavior was strange. 
I need to review it in details.
The patch is on monday trunk. I will make it compatible on current trunk this 
week-end.
I will move isTableEnabled  so on in an other patch, this one is already too 
big...

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5399_inprogress.patch, 5399_inprogress.v3.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-02-16 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13209565#comment-13209565
 ] 

nkeywal commented on HBASE-5399:


After some thinking and different tries; I believe it makes sense to have a 
shared master connection and a shared zookeeper connection with a reference 
counter. The creation/release/close function would be in HConnection. We could 
add the refCounter in ZooKeeperWatcher, but it's more difficult to add in the 
proxy (I could do it however).

Ideally, there would be a keep alive, i.e. we would not close the physical 
connection after when the counter becomes zero, but wait a minute or so. I'am 
just not sure on how to distinguish cleanly a simple release to a jvm shutdown.

Patch is underway, I think the complete version for review will be ready 
tomorrow.

However, I have not found a new place for the table descriptor part, today it's 
still in the connection.

 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5399_inprogress.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5399) Cut the link between the client and the zookeeper ensemble

2012-02-14 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13208051#comment-13208051
 ] 

nkeywal commented on HBASE-5399:


bq. Why are these deprecated: getHRegionConnection(HServerAddress regionServer)
It's because HServerAddress is deprecated. I will check that all the 
redirection are there.

bq. Do we get the clusterid on connection setup? Do we have to? Can we just get 
that when someone asks for it?
Today yes, but we also manage the case when there is no id.

bq. Should we be using straight ZooKeeper instead of ZooKeeperWatcher? We don't 
need watch facility?
We can do that, it requires some work to do it well (ZooKeeperWatcher has 
spread too much, and has a lot of responsibilities (for example, it's the owner 
of the znode names, created from the config parameters)). It would be much 
cleaner, and a little bit faster. We would still pay for the tcp connection 
however.

bq. isTableEnabled, etc., should they be deprecated, moved out of HConnnection
I was thinking putting them in HBaseAdmin, does it makes sense?



 Cut the link between the client and the zookeeper ensemble
 --

 Key: HBASE-5399
 URL: https://issues.apache.org/jira/browse/HBASE-5399
 Project: HBase
  Issue Type: Improvement
  Components: client
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5399_inprogress.patch


 The link is often considered as an issue, for various reasons. One of them 
 being that there is a limit on the number of connection that ZK can manage. 
 Stack was suggesting as well to remove the link to master from HConnection.
 There are choices to be made considering the existing API (that we don't want 
 to break).
 The first patches I will submit on hadoop-qa should not be committed: they 
 are here to show the progress on the direction taken.
 ZooKeeper is used for:
 - public getter, to let the client do whatever he wants, and close ZooKeeper 
 when closing the connection = we have to deprecate this but keep it.
 - read get master address to create a master = now done with a temporary 
 zookeeper connection
 - read root location = now done with a temporary zookeeper connection, but 
 questionable. Used in public function locateRegion. To be reworked.
 - read cluster id = now done once with a temporary zookeeper connection.
 - check if base done is available = now done once with a zookeeper 
 connection given as a parameter
 - isTableDisabled/isTableAvailable = public functions, now done with a 
 temporary zookeeper connection.
  - Called internally from HBaseAdmin and HTable
 - getCurrentNrHRS(): public function to get the number of region servers and 
 create a pool of thread = now done with a temporary zookeeper connection
 -
 Master is used for:
 - getMaster public getter, as for ZooKeeper = we have to deprecate this but 
 keep it.
 - isMasterRunning(): public function, used internally by HMerge  HBaseAdmin
 - getHTableDescriptor*: public functions offering access to the master.  = 
 we could make them using a temporary master connection as well.
 Main points are:
 - hbase class for ZooKeeper; ZooKeeperWatcher is really designed for a 
 strongly coupled architecture ;-). This can be changed, but requires a lot of 
 modifications in these classes (likely adding a class in the middle of the 
 hierarchy, something like that). Anyway, non connected client will always be 
 really slower, because it's a tcp connection, and establishing a tcp 
 connection is slow.
 - having a link between ZK and all the client seems to make sense for some 
 Use Cases. However, it won't scale if a TCP connection is required for every 
 client
 - if we move the table descriptor part away from the client, we need to find 
 a new place for it.
 - we will have the same issue if HBaseAdmin (for both ZK  Master), may be we 
 can put a timeout on the connection. That would make the whole system less 
 deterministic however.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or hadoop QA on trunk (The directory is already locked.)

2012-01-10 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13183655#comment-13183655
 ] 

nkeywal commented on HBASE-5163:


I've got a fix using a variation of #2. Tested 100 times without any failure.
The advantage of #1 for me is that it eliminates a quite tricky behavior, but 
the fix would then be outside HBase...


 TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or 
 hadoop QA on trunk (The directory is already locked.)
 -

 Key: HBASE-5163
 URL: https://issues.apache.org/jira/browse/HBASE-5163
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor

 The stack is typically:
 {noformat}
 error message=Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked. 
 type=java.io.IOExceptionjava.io.IOException: Cannot lock storage 
 /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
  The directory is already locked.
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
   at 
 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.lt;initgt;(DataNode.java:290)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
   at 
 org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
   at 
 org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
 // ...
 {noformat}
 It can be reproduced without parallelization or without executing the other 
 tests in the class. It seems to fail about 5% of the time.
 This comes from the naming policy for the directories in 
 MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
 in the cluster, and does not take into account previous starts/stops:
 {noformat}
for (int i = curDatanodesNum; i  curDatanodesNum+numDataNodes; i++) {
   if (manageDfsDirs) {
 File dir1 = new File(data_dir, data+(2*i+1));
 File dir2 = new File(data_dir, data+(2*i+2));
 dir1.mkdirs();
 dir2.mkdirs();
   // [...]
 {noformat}
 This means that it if we want to stop/start a datanode, we should always stop 
 the last one, if not the names will conflict. This test exhibits the behavior:
 {noformat}
   @Test
   public void testMiniDFSCluster_startDataNode() throws Exception {
 assertTrue( dfsCluster.getDataNodes().size() == 2 );
 // Works, as we kill the last datanode, we can now start a datanode
 dfsCluster.stopDataNode(1);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
 // Fails, as it's not the last datanode, the directory will conflict on
 //  creation
 dfsCluster.stopDataNode(0);
 try {
   dfsCluster
 .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
   fail(There should be an exception because the directory already 
 exists);
 } catch (IOException e) {
   assertTrue( e.getMessage().contains(The directory is already 
 locked.));
   LOG.info(Expected (!) exception caught  + e.getMessage());
 }
 // Works, as we kill the last datanode, we can now restart 2 datanodes
 // This makes us back with 2 nodes
 dfsCluster.stopDataNode(0);
 dfsCluster
   .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
   }
 {noformat}
 And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
 because when we do
 {noformat}
 DatanodeInfo[] pipeline = getPipeline(log);
 assertTrue(pipeline.length == fs.getDefaultReplication());
 {noformat}
 and then kill the datanodes in the pipeline, we will have:
  - most of the time: pipeline = 1  2, so after killing 12 we can start a 
 new datanode that will reuse the available 2's directory.
  - sometimes: pipeline = 1  3. In this case,when we try to launch the 

[jira] [Commented] (HBASE-5064) utilize surefire tests parallelization

2012-01-01 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13178161#comment-13178161
 ] 

nkeywal commented on HBASE-5064:


bq. Anyway we need to find out why TestMetaReaderEditor wasn't executed.

Likely because it belongs to the too long list of flaky tests. A test should 
not fail randomly. Here is the list, from 2974 to 2598, with a timeouted 
TestMetaReaderEditor 
https://builds.apache.org/job/HBase-TRUNK/2597/testReport/org.apache.hadoop.hbase.catalog/
https://builds.apache.org/job/HBase-TRUNK/2588/testReport/org.apache.hadoop.hbase.catalog/
https://builds.apache.org/job/HBase-TRUNK/2585/testReport/org.apache.hadoop.hbase.catalog/
https://builds.apache.org/job/HBase-TRUNK/2584/testReport/org.apache.hadoop.hbase.catalog/
https://builds.apache.org/job/HBase-TRUNK/2582/testReport/org.apache.hadoop.hbase.catalog/
https://builds.apache.org/job/HBase-TRUNK/2577/testReport/org.apache.hadoop.hbase.catalog/
https://builds.apache.org/job/HBase-TRUNK/2574/testReport/org.apache.hadoop.hbase.catalog/

It happens on hadoop-qa as well:
https://builds.apache.org/job/PreCommit-HBASE-Build/640/testReport/org.apache.hadoop.hbase.catalog/
https://builds.apache.org/job/PreCommit-HBASE-Build/638/testReport/org.apache.hadoop.hbase.catalog/
https://builds.apache.org/job/PreCommit-HBASE-Build/630/testReport/org.apache.hadoop.hbase.catalog/
https://builds.apache.org/job/PreCommit-HBASE-Build/622/testReport/org.apache.hadoop.hbase.catalog/

That's around 25% of the time for trunk. I can't get back before 2574. I will 
spend some time on it to see if there is anything simple  to do(you never 
know). If I find something I will create a separate jira.




 utilize surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 
 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 
 5064.v15.patch, 5064.v16.patch, 5064.v17.patch, 5064.v18.patch, 
 5064.v18.patch, 5064.v19.patch, 5064.v19.patch, 5064.v19.patch, 
 5064.v2.patch, 5064.v20.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 
 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 
 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 
 5064.v8.patch, 5064.v9.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) utilize surefire tests parallelization

2012-01-01 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13178183#comment-13178183
 ] 

nkeywal commented on HBASE-5064:


I found some issues in TestMetaReaderEditor, explaining at least why it 
timeouts instead of failing (I don't know what it fails sometimes, still 
looking).

For 2588, when the process timeouts (i.e. forkedProcessTimeoutInSeconds is 
reached), there is no entry in the jenkins report nor in the console. The only 
way to find out is that the test is missing, and you can see that 
TestMetaReaderEditor is actually missing in the list...

 utilize surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 
 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 
 5064.v15.patch, 5064.v16.patch, 5064.v17.patch, 5064.v18.patch, 
 5064.v18.patch, 5064.v19.patch, 5064.v19.patch, 5064.v19.patch, 
 5064.v2.patch, 5064.v20.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 
 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 
 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 
 5064.v8.patch, 5064.v9.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) utilize surefire tests parallelization

2011-12-31 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13178042#comment-13178042
 ] 

nkeywal commented on HBASE-5064:


The exact same test already timeouted in trunk in
https://builds.apache.org/job/HBase-TRUNK/2588/testReport/
I don't think it's related to this patch.

On Sat, Dec 31, 2011 at 4:28 PM, Zhihong Yu (Commented) (JIRA) 



 utilize surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 
 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 
 5064.v15.patch, 5064.v16.patch, 5064.v17.patch, 5064.v18.patch, 
 5064.v18.patch, 5064.v19.patch, 5064.v19.patch, 5064.v19.patch, 
 5064.v2.patch, 5064.v20.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 
 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 
 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 
 5064.v8.patch, 5064.v9.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-30 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177637#comment-13177637
 ] 

nkeywal commented on HBASE-5064:


the v20 is ok for commit imho.

There are two processes by default, and 4 on hadoop-qa. It possible to change 
the number of processes used by specifying 
-Dsurefire.secondPartThreadCount=WhatYouWant on mvn command line.  Using 
-Dsurefire.secondPartThreadCount=1 means no parallelization.


 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 
 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 
 5064.v15.patch, 5064.v16.patch, 5064.v17.patch, 5064.v18.patch, 
 5064.v18.patch, 5064.v19.patch, 5064.v19.patch, 5064.v19.patch, 
 5064.v2.patch, 5064.v20.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 
 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 
 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 
 5064.v8.patch, 5064.v9.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4955) Use the official versions of surefire junit

2011-12-30 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177644#comment-13177644
 ] 

nkeywal commented on HBASE-4955:


We're now using 2.12-TRUNK-HBASE-2.

It's a private version, built on the 2.12 trunk (i.e.: it does not contain 
eveyrthing that will be in 2.12 final).

Surefire: Could be for Surefire 2.12. Issues to monitor are:
329 (category support): fixed, we use the official implementation from the trunk
773 (forked processes not killed after timeout): not fixed in trunk, not fixed 
in our version
786 (@Category with forkMode=always): fixed, we use the official implementation 
from the trunk
791 (incorrect elapsed time on test failure): fixed, we use the official 
implementation from the trunk
793 (incorrect time in the XML report): Not fixed (reopen) in trunk, partial 
fixed in our version.
760 (does not take into account the test method): fixed, we use the official 
implementation from the trunk
798 (print immediately the test class name): not fixed in trunk, not fixed in 
our version
799 (Allow test parallelization when forkMode=always): fixed in trunk, fixed in 
our version with some minimal differences.
800 (redirectTestOutputToFile not taken into account): not yet fix on trunk, 
fixed in our version
806 (Ignore selection criteria when -Dtest= is specified):  not fixed in trunk, 
not fixed in our version
813 (Randomly wrong tests count and empty summary files): fixed in trunk, fixed 
in our version 

800  793 are the more important to monitor, it's the only ones that are fixed 
in our version but not on trunk.

 Use the official versions of surefire  junit
 -

 Key: HBASE-4955
 URL: https://issues.apache.org/jira/browse/HBASE-4955
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
 Environment: all
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor

 We currently use private versions for Surefire  JUnit since HBASE-4763.
 This JIRA traks what we need to move to official versions.
 Surefire 2.11 is just out, but, after some tests, it does not contain all 
 what we need.
 JUnit. Could be for JUnit 4.11. Issue to monitor:
 https://github.com/KentBeck/junit/issues/359: fixed in our version, no 
 feedback for an integration on trunk
 Surefire: Could be for Surefire 2.12. Issues to monitor are:
 329 (category support): fixed, we use the official implementation from the 
 trunk
 786 (@Category with forkMode=always): fixed, we use the official 
 implementation from the trunk
 791 (incorrect elapsed time on test failure): fixed, we use the official 
 implementation from the trunk
 793 (incorrect time in the XML report): Not fixed (reopen) on trunk, fixed on 
 our version.
 760 (does not take into account the test method): fixed in trunk, not fixed 
 in our version
 798 (print immediately the test class name): not fixed in trunk, not fixed in 
 our version
 799 (Allow test parallelization when forkMode=always): not fixed in trunk, 
 not fixed in our version
 800 (redirectTestOutputToFile not taken into account): not yet fix on trunk, 
 fixed on our version
 800  793 are the more important to monitor, it's the only ones that are 
 fixed in our version but not on trunk.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-29 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177274#comment-13177274
 ] 

nkeywal commented on HBASE-5064:


@ted: The 3 last tests seems ok on hadoop-qa, even if I'am sure that we will 
discover some issues. A 4th test is in progress. I will do a 5th also. Anything 
meaningful in the logs for TestMiniClusterLoadParallel?

Note that the v18 patch is with 4 tests in //, I believe we will need to commit 
it with 2 or 3 to be sure that a 1 year old laptop can run the tests smoothly. 
I use 4 here to maximize // and minimize test time. We can set a parameter to 
the hadoop-qa build to use 4 or more if we want to...

 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 
 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 
 5064.v15.patch, 5064.v16.patch, 5064.v17.patch, 5064.v18.patch, 
 5064.v18.patch, 5064.v19.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 
 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 
 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 
 5064.v8.patch, 5064.v8.patch, 5064.v9.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-29 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177285#comment-13177285
 ] 

nkeywal commented on HBASE-5064:


You want 2 process in // only for hadoop-qa? It seems that the machine never 
runs two builds in //, so we should maximize the number of process. That makes 
the build last ~35 minutes instead of 2 hours.


 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 
 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 
 5064.v15.patch, 5064.v16.patch, 5064.v17.patch, 5064.v18.patch, 
 5064.v18.patch, 5064.v19.patch, 5064.v19.patch, 5064.v2.patch, 5064.v3.patch, 
 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 
 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 
 5064.v7.patch, 5064.v8.patch, 5064.v8.patch, 5064.v9.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-29 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13177388#comment-13177388
 ] 

nkeywal commented on HBASE-5064:


From the last 5 execution, it seems we have NumberFormatException + some 
random failure that I would tend to attribute to the usual flakiness (I may be 
wrong :-)). The FileNotFound exception was not reproduced.

@ted: from the time taken by TestAdmin (twice the time on hadoop-qa, and very 
close to the test timeout of 900 seconds), I think that the machine is heavily 
loaded, or lacks some memory and swaps. That makes me think that we should 
default the // degree to 2 (and keep 4 on hadoop-qa).

 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 
 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 
 5064.v15.patch, 5064.v16.patch, 5064.v17.patch, 5064.v18.patch, 
 5064.v18.patch, 5064.v19.patch, 5064.v19.patch, 5064.v19.patch, 
 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 
 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 
 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v8.patch, 
 5064.v9.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-28 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176658#comment-13176658
 ] 

nkeywal commented on HBASE-5064:


once again a java.io.FileNotFoundException: 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/trunk/../logs/userlogs/job_20111228130003142_0001/attempt_20111228130003142_0001_m_01_2/log.index
 (No such file or directory)

Will have to understand this. 

 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 
 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 
 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 
 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 
 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v8.patch, 
 5064.v9.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-28 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176830#comment-13176830
 ] 

nkeywal commented on HBASE-5064:


Yep, may be the NumberFormatException was hiding the FileNotFoundException 
before (it's possible as well to have both error simultaneously). I haven't 
tried v16 on the whole suite test locally, but it seems to work on my subset. 
Let's see.

 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 
 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 
 5064.v15.patch, 5064.v16.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 
 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 
 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 
 5064.v8.patch, 5064.v8.patch, 5064.v9.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-28 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13176860#comment-13176860
 ] 

nkeywal commented on HBASE-5064:


fail: java.io.FileNotFoundException: 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/trunk/target/test-data/238434d5-7c45-413f-ba99-1440a560e1f1/hadoop-log-dir/userlogs/job_20111228220208254_0001/attempt_20111228220208254_0001_m_02_0/log.index
 (No such file or directory)

enough for today

 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 
 5064.v12.patch, 5064.v13.patch, 5064.v14.patch, 5064.v14.patch, 
 5064.v15.patch, 5064.v16.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 
 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 
 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 
 5064.v8.patch, 5064.v8.patch, 5064.v9.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-24 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175718#comment-13175718
 ] 

nkeywal commented on HBASE-5064:


Below a review of the tests on trunk  prebuild.

We have often, nearly always, hanging tests in hadoop-qa, while it's occurs 
only sometimes on main build.
It seems that it hangs more often when // is activated
It seems that we have this kind of behavior when we cannot start a thread.

I will look if we can enhance the reporting in surefire (i.e. it seems some 
errors are not reported today)



#2572
Tests run: 789, Failures: 0, Errors: 0, Skipped: 9

#2570
Tests run: 789, Failures: 0, Errors: 0, Skipped: 9

#2569
Tests run: 789, Failures: 0, Errors: 0, Skipped: 9

#2568
Tests run: 789, Failures: 0, Errors: 0, Skipped: 9

#2566
Tests run: 789, Failures: 0, Errors: 0, Skipped: 9

#2565 (marked as success, but there is a hanging test)
wal.TestLogRollAbort

#596 5064 1 processesn min number of threads
Tests run: 786, Failures: 3, Errors: 1, Skipped: 9

#595 5064.v7.patch - 3 processesn min number of threads
Tests run: 785, Failures: 3, Errors: 1, Skipped: 9
hbck.TestOfflineMetaRebuildOverlap
master.TestMasterRestartAfterDisablingTable
replication.TestMasterReplication

#594 5064.v7.patch - 3 processesn min number of threads
Tests run: 786, Failures: 3, Errors: 1, Skipped: 9
util.TestMiniClusterLoadParallel
coprocessor.TestMasterCoprocessorExceptionWithRemove

#593
Tests run: 779, Failures: 4, Errors: 2, Skipped: 9
util.TestMiniClusterLoadSequential

#592 - Data-block-encoding-2011-12-23.patch - fails, but with a clear message
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 32756 bytes for 
ChunkPool::allocate
# An error report file with more information is saved as:
# 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/trunk/hs_err_pid20773.log
Tests run: 739, Failures: 3, Errors: 1, Skipped: 9

#591 - scannerMVCC1.txt 
Tests run: 780, Failures: 3, Errors: 1, Skipped: 9
master.TestDistributedLogSplitting

#589 - 5064.v12.patch, 2 threads
Tests run: 783, Failures: 3, Errors: 1, Skipped: 9
coprocessor.TestMasterCoprocessorExceptionWithRemove
hbase.TestRegionRebalancing
master.TestDistributedLogSplitting

#588 - 5064 4 threads
Tests run: 768, Failures: 3, Errors: 1, Skipped: 9
util.hbck.TestOfflineMetaRebuildBase
catalog.TestMetaReaderEditor
coprocessor.TestMasterCoprocessorExceptionWithRemove
replication.regionserver.TestReplicationSink
replication.TestMasterReplication

#587 - 5064 (7 processes, standard config)
replication.TestMasterReplication
master.TestMasterFailover
master.TestRollingRestart

#586  - 5064  (6 processes, min number of threads)
Tests run: 775, Failures: 3, Errors: 1, Skipped: 9
master.TestRestartCluster
regionserver.wal.TestLogRollAbort
client.TestMetaScanner
TestGlobalMemStoreSize
client.TestMultiParallel

#585   - 5064 (3 processes, min number of threads)
Tests run: 788, Failures: 4, Errors: 1, Skipped: 9

#583 - HBASE-4720.trunk.v2.patch
Tests run: 788, Failures: 3, Errors: 1, Skipped: 9
replication.TestMasterReplication

#582 - Delta-encoding.patch-2011-12-22_11_52_07.patch
unable to create new native thread
replication.TestReplication: 
Tests run: 2, Failures: 0, Errors: 1, Skipped: 0

 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 
 5064.v12.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 
 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 
 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 
 5064.v8.patch, 5064.v9.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5085) fix test-patch script from setting the ulimit

2011-12-23 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175374#comment-13175374
 ] 

nkeywal commented on HBASE-5085:


On prebuild #584, we have:
open files  (-n) 6
max user processes  (-u) 10240
This should be ok, trying immediately.

 fix test-patch script from setting the ulimit
 -

 Key: HBASE-5085
 URL: https://issues.apache.org/jira/browse/HBASE-5085
 Project: HBase
  Issue Type: Bug
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Fix For: 0.94.0

 Attachments: 5085-v2-experiment.txt, 5085-v2-experiment.txt, 
 5085-v3-experiment.txt, 5085-v3-experiment.txt, 5085-v4-experiment.txt, 
 5085-v5.txt, hbase-5085.patch


 test-patch.sh script sets the ulimit -n 1024 just after triggering the patch 
 setting this overrides the underlying systems ulimit and hence failing the 
 hbase tests.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-23 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175463#comment-13175463
 ] 

nkeywal commented on HBASE-5064:


#585 (3 processes, min number of threads)
Total time: 40:46.855s
Tests run: 788, Failures: 4, Errors: 1, Skipped: 9

Invalid result expected:134 but was:190
TestCoprocessorEndpoint.testAggregation

NumberFormatException
TestTableMapReduce.testMultiRegionTable
TestHFileOutputFormat.testMRIncrementalLoad
TestHFileOutputFormat.testMRIncrementalLoadWithSplit
TestHFileOutputFormat.testExcludeMinorCompaction

Hung
None

Parallelization went well, nearly 100% linearility, as we're 3 times faster 
with 3 processes.
TestCoprocessorEndpoint.testAggregation failed as well in prebuild #584, so it 
could be unrelated to the //.

Let's try with 6 processes.


 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 
 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 
 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 5064.v9.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-23 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175486#comment-13175486
 ] 

nkeywal commented on HBASE-5064:


#586 (6 processes, min number of threads)
Total time: 22:54.145s
Tests run: 775, Failures: 3, Errors: 1, Skipped: 9

NumberFormatException
TestTableMapReduce.testMultiRegionTable
TestHFileOutputFormat.testMRIncrementalLoad
TestHFileOutputFormat.testMRIncrementalLoadWithSplit
TestHFileOutputFormat.testExcludeMinorCompaction

Hung
master.TestRestartCluster
regionserver.wal.TestLogRollAbort
client.TestMetaScanner
TestGlobalMemStoreSize
client.TestMultiParallel

Still nearly 100% linearility (5 times faster with 6 processes)
A lot of tests hung.
TestCoprocessorEndpoint.testAggregation didn't fail this time.

 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v2.patch, 
 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 
 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 5064.v8.patch, 
 5064.v9.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-23 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175496#comment-13175496
 ] 

nkeywal commented on HBASE-5064:


#587 (7 processes, standard config)
Total time: 31:58.502s
Tests run: 782, Failures: 3, Errors: 1, Skipped: 9

NumberFormatException
TestTableMapReduce.testMultiRegionTable
TestHFileOutputFormat.testMRIncrementalLoad
TestHFileOutputFormat.testMRIncrementalLoadWithSplit
TestHFileOutputFormat.testExcludeMinorCompaction

Hung
replication.TestMasterReplication
master.TestMasterFailover
master.TestRollingRestart

As usual nearly 100% linearility
3 tests hung.
TestCoprocessorEndpoint.testAggregation didn't fail this time.



 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 
 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 
 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 5064.v7.patch, 
 5064.v8.patch, 5064.v9.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-23 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13175517#comment-13175517
 ] 

nkeywal commented on HBASE-5064:


#588
Total time: 39:50.795s
Tests run: 768, Failures: 3, Errors: 1, Skipped: 9

Hanging test: Running 
org.apache.hadoop.hbase.util.hbck.TestOfflineMetaRebuildBase
Hanging test: Running org.apache.hadoop.hbase.catalog.TestMetaReaderEditor
Hanging test: Running 
org.apache.hadoop.hbase.coprocessor.TestMasterCoprocessorExceptionWithRemove
Hanging test: Running 
org.apache.hadoop.hbase.replication.regionserver.TestReplicationSink
Hanging test: Running org.apache.hadoop.hbase.replication.TestMasterReplication

Gonna try with only 2 threads.

 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v10.patch, 5064.v11.patch, 
 5064.v12.patch, 5064.v2.patch, 5064.v3.patch, 5064.v4.patch, 5064.v5.patch, 
 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 5064.v7.patch, 
 5064.v7.patch, 5064.v8.patch, 5064.v9.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-21 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13173984#comment-13173984
 ] 

nkeywal commented on HBASE-5064:


#554 (No //, min number of threads)
Total time: 1:55:37.351s
Tests run: 781, Failures: 3, Errors: 2, Skipped: 9

Too many open files
TestInstantSchemaChange.testInstantSchemaJanitor

NumberFormatException
TestTableMapReduce.testMultiRegionTable
TestHFileOutputFormat.testMRIncrementalLoad
TestHFileOutputFormat.testMRIncrementalLoadWithSplit
TestHFileOutputFormat.testExcludeMinorCompaction

Hung
TestReplication

 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 
 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 
 5064.v7.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-21 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174089#comment-13174089
 ] 

nkeywal commented on HBASE-5064:


#565 (No //, min number of threads)
Total time: 1:59:24.242s
Tests run: 788, Failures: 3, Errors: 3, Skipped: 9

Too many open files
TestInstantSchemaChange.testInstantSchemaJanitor

NumberFormatException
TestTableMapReduce.testMultiRegionTable
TestHFileOutputFormat.testMRIncrementalLoad
TestHFileOutputFormat.testMRIncrementalLoadWithSplit
TestHFileOutputFormat.testExcludeMinorCompaction

Hung
None it seems, and at it least not TestReplication

The directory is already locked.
TestLogRolling.testLogRollOnDatanodeDeath


 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 
 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 
 5064.v6.patch, 5064.v7.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-21 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174207#comment-13174207
 ] 

nkeywal commented on HBASE-5064:


#566 (No //, min number of threads)
Total time: 1:56:55.198s
Tests run: 789, Failures: 3, Errors: 2, Skipped: 9

Too many open files
TestInstantSchemaChange.testInstantSchemaJanitor

NumberFormatException
TestTableMapReduce.testMultiRegionTable
TestHFileOutputFormat.testMRIncrementalLoad
TestHFileOutputFormat.testMRIncrementalLoadWithSplit
TestHFileOutputFormat.testExcludeMinorCompaction

Hung
None

 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 
 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 
 5064.v6.patch, 5064.v7.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-21 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13174277#comment-13174277
 ] 

nkeywal commented on HBASE-5064:


No, we need the hadoopqa setting increased before activating //.
Here, I am launching multiple tests to identify most of the random issues
we can have, this will help to distinguish what is random vs. what is due
to //.
However, I can split this patch in smaller ones:
1) one to activate the lastest version of surefire
2) one for the minimal thread settings
3) one for the // itself.

From all the tests done so far, 1  2 seem ok today.

On Wed, Dec 21, 2011 at 6:49 PM, Zhihong Yu (Commented) (JIRA) 



 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 
 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v6.patch, 5064.v6.patch, 
 5064.v6.patch, 5064.v7.patch, 5064.v8.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-19 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172086#comment-13172086
 ] 

nkeywal commented on HBASE-5064:


Tests run: 780, Failures: 4, Errors: 2, Skipped: 9
Total time: 1:05:24.474s

To be compared with prebuild #535:
Tests run: 785, Failures: 0, Errors: 3, Skipped: 9
Total time: 1:49:04.397s

Still faster (~45minutes)
Still some tests missing in the final report: (795 vs. 797). Don't know why yet.

TestInstantSchemaChange.testInstantSchemaJanitor: Too many open files
TestFromClientSide.testPoolBehavior: Fixed in v4.
TestTableMapReduce.testMultiRegionTable: NumberFormatException
TestHFileOutputFormat.testMRIncrementalLoad: NumberFormatException
TestHFileOutputFormat.testMRIncrementalLoadWithSplit: NumberFormatException
TestHFileOutputFormat.testExcludeMinorCompaction: NumberFormatException

NumberFormatException: seems to be unrelated, I've got it on other patches as 
well.

@stack: it's a parallelisation per process, so it should not be necessary: 
there is no added load on maven itself. However, the results of the first run 
seem to say that the hadoop-qa machine was overloaded. But it seems that the 
other runs went well.


 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 
 5064.v4.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-19 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172259#comment-13172259
 ] 

nkeywal commented on HBASE-5064:


Error while running command to get file permissions : java.io.IOException: 
Cannot run program /bin/ls: 
java.lang.OutOfMemoryError: unable to create new native thread

= undersized hadoop-qa

= reducing the number of threads used didn't help much
= that two failures for undersizing out of 5 tries. 

Let's do another one with only 1 thread to see if everything goes right when 
there is no //

 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 
 5064.v4.patch, 5064.v5.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5051) HBaseTestingUtility#getHBaseAdmin() creates a new HBaseAdmin instance at each call

2011-12-19 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172263#comment-13172263
 ] 

nkeywal commented on HBASE-5051:


TestMasterReplication hang; but it hang on trunk as well (see: 
https://builds.apache.org/job/HBase-TRUNK/2558/console)
TestReplication: usually flaky.

I think the patch is ok.


 HBaseTestingUtility#getHBaseAdmin() creates a new HBaseAdmin instance at each 
 call
 --

 Key: HBASE-5051
 URL: https://issues.apache.org/jira/browse/HBASE-5051
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5051.patch, 5051.v2.patch, 5051.v2.patch, 5051.v2.patch, 
 5051.v2.patch


 As it's a new instance, it should be closed. As the function name seems to 
 imply that it's an instance managed by HBaseTestingUtility, most of the users 
 don't close it = leak

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-19 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172375#comment-13172375
 ] 

nkeywal commented on HBASE-5064:


#535 (with previous version of surefire)
Total time: 1:49:04.397s
Tests run: 785, Failures: 0, Errors: 3, Skipped: 9

#537 (2 threads)
Total time: 1:13:18.269s
Tests run: 771, Failures: 3, Errors: 1, Skipped: 9

#539 (2 threads, thread min configuration)
Total time: 1:05:24.474s
Tests run: 780, Failures: 4, Errors: 2, Skipped: 9

#546 (1 threads, thread min configuration)
Tests run: 771, Failures: 3, Errors: 2, Skipped: 9
Total time: 2:07:01.697s

Seems that the missing tests are more or less random as well. Let's do another 
try with 3 threads.



 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 
 5064.v4.patch, 5064.v5.patch, 5064.v6.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-19 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172383#comment-13172383
 ] 

nkeywal commented on HBASE-5064:


@stack: yes,it should be that there are two many threads
- jvm limitation that we should manage with -Xmx or -Xss = but then we should 
reproduce it on our machine as well (asd far as I know no one does reproduce it 
locally), and we already set it to -Xmx1900m
- a limitation on the total number of thread on the machine = but it's quite 
strange
- the ulimit per process, but then running the tests on multiple processes 
should not change the probability to have an issue

The patch v2+ changed HBase configuration, On TestAdmin, we need 40 less 
threads, so the benefit is not high. And I saw no real impact.

There are some stuff to look at, as sysctl kernel.threads-max, i will check if 
it is printed in the default script.



 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 
 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v7.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-19 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172393#comment-13172393
 ] 

nkeywal commented on HBASE-5064:


I've seen something. If it's not the problem, it is a problem.

On hadoop QA, we have
max user processes  (-u) 2048

Try this on you computer:
ulimit -u 256
Then run a test such as TestAdmin
You will have a nice java.lang.OutOfMemoryError: unable to create new native 
thread


This because this limit controls not only the number of processes but as well 
the number of native threads.

We should have this set to a much bigger value.





 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 
 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v7.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-19 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172420#comment-13172420
 ] 

nkeywal commented on HBASE-5064:


There are two issues:

1) open files (-n) 1024
This limit is per process
This is the cause for too many open file
This occurs when a test, for a reason or another, goes beyond this limit (some 
of the tests are very close)

2) max user processes (-u) 2048
This limit is per user/machine, not per process!
This is the cause for unable to create new native thread
This occurs when the machine runs multiple tests in parallel, and  when all 
tests happen to use a lot of threads.

Both must be changed. As we should limit the number of file descriptor used, it 
makes sense to set open files to something reasonable (like 2048 or 4096), but 
the number of max user processes could be set to unlimited without any issue, 
as when it fails it's because there are other processes using threads as well, 
and we don't control them.




 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 
 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v7.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-19 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172435#comment-13172435
 ] 

nkeywal commented on HBASE-5064:


For #547
NumberFormatException:
mapreduce.TestHFileOutputFormat.testExcludeMinorCompaction
mapreduce.TestHFileOutputFormat.testMRIncrementalLoadWithSplit
mapreduce.TestHFileOutputFormat.testMRIncrementalLoad
mapred.TestTableMapReduce.testMultiRegionTable

Too many open files
TestInstantSchemaChange.testInstantSchemaJanitor

??
TestSplitLogManager.testMultipleResubmits


Total time: 41:21.481s
Tests run: 733, Failures: 4, Errors: 2, Skipped: 6

Many tests hung or are not in the report, so the final result is not 
meaningfull.
But it will make more sense to try when the ulimit settings are updated.

 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 
 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v7.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-19 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172438#comment-13172438
 ] 

nkeywal commented on HBASE-5064:


@stack; the synthesis is at 19/Dec/11 17:26, and yes, we really need these 
values to be changed :-)

 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch, 5064.v2.patch, 5064.v3.patch, 
 5064.v4.patch, 5064.v5.patch, 5064.v6.patch, 5064.v7.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-18 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171811#comment-13171811
 ] 

nkeywal commented on HBASE-5064:


java.lang.OutOfMemoryError: unable to create new native thread: hadoop-qa 
can't execute 3 tests in //
Let's try with 2.

 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-18 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171815#comment-13171815
 ] 

nkeywal commented on HBASE-5064:


hum, it was already set to 2. Let's retry, but hadoop-qa is not really 
oversized...

 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5064) use surefire tests parallelization

2011-12-18 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171847#comment-13171847
 ] 

nkeywal commented on HBASE-5064:


This time is worked: I have the errors above on another patch as well.

Logs says: Total time: 1:13:18.269s
To be compared with prebuild #535: Total time: 1:49:04.397s
=50 minutes faster.

However, we have some tests without results: Tests run: 771, Failures: 3, 
Errors: 1, Skipped: 9
To be compared with prebuild #535: Tests run: 785, Failures: 0, Errors: 3, 
Skipped: 9


 use surefire tests parallelization
 --

 Key: HBASE-5064
 URL: https://issues.apache.org/jira/browse/HBASE-5064
 Project: HBase
  Issue Type: Improvement
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5064.patch, 5064.patch


 To be tried multiple times on hadoop-qa before committing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5051) HBaseTestingUtility#getHBaseAdmin() creates a new HBaseAdmin instance at each call

2011-12-17 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171522#comment-13171522
 ] 

nkeywal commented on HBASE-5051:


mapred.TestTableMapReduce.testMultiRegionTable: NumberFormatException: For 
input string: 18446743988037478956
mapreduce.TestHFileOutputFormat.testMRIncrementalLoad: NumberFormatException: 
For input string: 18446743988037169444
mapreduce.TestHFileOutputFormat.testMRIncrementalLoadWithSplit 
NumberFormatException: For input string: 18446743988037182259
mapreduce.TestHFileOutputFormat.testExcludeMinorCompaction: 
NumberFormatException: For input string: 18446743988037186552
TestReplication.queueFailover: ??
TestInstantSchemaChange.testInstantSchemaJanitor: Too many open files
TestMasterReplication: hung

TestMasterReplication works locally. The patch should not have impacted it. It 
hungs on trunk as well (see: 
https://builds.apache.org/job/HBase-TRUNK/2553/testReport/org.apache.hadoop.hbase.replication/
 : there is no entry for TestMasterReplication)
TestReplication works locally. This test is usually flaky.
mapred.TestTableMapReduce worked in the previous prebuild.
mapreduce.TestHFileOutputFormat  worked in the previous prebuild.

Let's retry.

 HBaseTestingUtility#getHBaseAdmin() creates a new HBaseAdmin instance at each 
 call
 --

 Key: HBASE-5051
 URL: https://issues.apache.org/jira/browse/HBASE-5051
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5051.patch, 5051.v2.patch, 5051.v2.patch, 5051.v2.patch


 As it's a new instance, it should be closed. As the function name seems to 
 imply that it's an instance managed by HBaseTestingUtility, most of the users 
 don't close it = leak

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5053) HCM Tests leaks connections

2011-12-16 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171011#comment-13171011
 ] 

nkeywal commented on HBASE-5053:


org.apache.hadoop.hbase.client.TestInstantSchemaChange = Too many open files

 HCM Tests leaks connections
 ---

 Key: HBASE-5053
 URL: https://issues.apache.org/jira/browse/HBASE-5053
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.94.0
Reporter: nkeywal
Assignee: nkeywal
Priority: Minor
 Attachments: 5053.patch


 There are simple leaks and one more complex.
 The complex one comes from the fact fact 
 HConnectionManager.HConnectionImplementation keeps a *reference* to the 
 configuration used for the creation. So if this configuration is updated 
 later, the HConnectionKey created initially will differ from the current one. 
 As a consequence, the close() will not find the connection anymore in the 
 list, and the connection won't be deleted.
 I added a warning when a close does not find the connection in the list; but 
 I wonder if we should not copy the HConnectionKey instead of keeping a 
 reference.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




  1   2   3   >