[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2013-05-20 Thread Liang Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13661824#comment-13661824
 ] 

Liang Lee commented on HBASE-5844:
--

Hi,stack ,Could you please provide some document about how to use this pach 
like HBASE-7404?
Where does the core configution HBASE_ZNODE_FILE should be configured?
Thanks
 

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.95.2
Reporter: Nicolas Liochon
Assignee: Nicolas Liochon
 Fix For: 0.95.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2013-05-20 Thread Liang Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13662615#comment-13662615
 ] 

Liang Lee commented on HBASE-5844:
--

And I still have another question:in the hbase-daemon.sh file,in the 
cleanZNode() funtion ,which znode is deleted ?
Under the normal scenario,when RegionServer starts ,it will create temporary 
znode on Zookeeper in the path /habse/rs,and what we delete in the cleanZNode() 
funtion ?
Thank you!

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.95.2
Reporter: Nicolas Liochon
Assignee: Nicolas Liochon
 Fix For: 0.95.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-11-29 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13506832#comment-13506832
 ] 

Jean-Daniel Cryans commented on HBASE-5844:
---

Encountered another problem that I think I can link to this jira, I was trying 
to run HBase from trunk without internet access and like in my Sept 25th 
comment, I get an empty line after start-hbase.sh but now nothing is running. 
The .log file doesn't show anything after logging ulimit and nothing's in the 
.out file. After running some bash -x, I was able to figure out that the nohup 
output was being suppressed. See:

{noformat}
jdcryans-MBPr:hbase-github jdcryans$ ./bin/start-hbase.sh 
jdcryans-MBPr:hbase-github jdcryans$
jdcryans-MBPr:hbase-github jdcryans$ bash -x ./bin/start-hbase.sh 
... some stuff then
+ /Users/jdcryans/git/hbase-github/bin/hbase-daemon.sh start master
jdcryans-MBPr:hbase-github jdcryans$ bash -x 
/Users/jdcryans/git/hbase-github/bin/hbase-daemon.sh start master
... more stuff
+ nohup /Users/jdcryans/git/hbase-github/bin/hbase-daemon.sh --config 
/Users/jdcryans/git/hbase-github/bin/../conf internal_start master
jdcryans-MBPr:hbase-github jdcryans$ nohup 
/Users/jdcryans/git/hbase-github/bin/hbase-daemon.sh --config 
/Users/jdcryans/git/hbase-github/bin/../conf internal_start master
appending output to nohup.out
{noformat}

So now I see that it's writing to nohup.out, which in turn tells me what really 
happened:

{noformat}
Caused by: java.lang.ClassNotFoundException: 
org.apache.zookeeper.KeeperException
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
{noformat}

Reproing can be done by physically deleting any jar listed in 
target/cached_classpath.txt. In my case I think the jar wasn't available 
because I had no internet connection.

I wonder what other errors it could hide like this.

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-09-26 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463591#comment-13463591
 ] 

nkeywal commented on HBASE-5844:


Humm.
The feature is important imho: waiting 30s (at best) before starting a recovery 
is really nice.
In a ideal world, ZooKeeper would make this less useful by detecting the dead 
process sooner, but still it can't be faster than this. 

Note that znode remover should occur when the process finishes, not before 
starting a new one. What JD describes seems a bug to me.







 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-09-26 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463899#comment-13463899
 ] 

nkeywal commented on HBASE-5844:


btw: I'm having a look at this to understand what's happening.

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-09-26 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463951#comment-13463951
 ] 

Jean-Daniel Cryans commented on HBASE-5844:
---

To repro: start a local distributed cluster (without hadoop), remove the RS's 
pid file in /tmp, try starting a region server again. My guess is that since 
the port is occupied then the second won't start but the znode deletion still 
runs after.

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-09-26 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13464093#comment-13464093
 ] 

nkeywal commented on HBASE-5844:


It's strange, I didn't reproduce it. I should, because it seems logical. Will 
look into it and create jiras.
Anyway, there are bugs around this scenario. For example, when it fails we now 
have a new pid file, but this pid does not match the process. This is true in 
0.90 as well. If there is no process, the error for the stop (in 0.96) will be 
??no regionserver to stop because kill -0 of pid 49938 failed with status 1??. 
If another process took this id (yes it should not happen often), the kill will 
succeed.

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-09-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463401#comment-13463401
 ] 

Jean-Daniel Cryans commented on HBASE-5844:
---

One thing that worries about this patch is the situation where the pid file is 
gone and someone tries to start the region server. It happened to me a bunch of 
times. I tried it with you patch and since it removes ephemeral znode it 
_kills_ the region server that's already running and doesn't start a new one 
because the ports are already occupied.

I'm not sure if this is related to this patch, but we're now missing info when 
using the scripts. We used to have:

{noformat}
su-jdcryans-2:0.94 jdcryans$ ./bin/start-hbase.sh 
localhost: starting zookeeper, logging to 
/Users/jdcryans/Work/HBase/0.94/bin/../logs/hbase-jdcryans-zookeeper-h-25-185.sfo.stumble.net.out
starting master, logging to 
/Users/jdcryans/Work/HBase/0.94/bin/../logs/hbase-jdcryans-master-h-25-185.sfo.stumble.net.out
localhost: starting regionserver, logging to 
/Users/jdcryans/Work/HBase/0.94/bin/../logs/hbase-jdcryans-regionserver-h-25-185.sfo.stumble.net.out
{noformat}

Now we have:

{noformat}
su-jdcryans-2:trunk-commit jdcryans$ ./bin/start-hbase.sh 

su-jdcryans-2:trunk-commit jdcryans$ 
{noformat}

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-09-25 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13463409#comment-13463409
 ] 

stack commented on HBASE-5844:
--

Looking at this w/ j-d, now we no longer do nohup so the parent process can 
stick around to watch out for the server crash. This make it so now there are 
two  hbase processes listed per launched daemon.  This is kinda ugly.

When we have this bash script watching the running java process we verge into 
the territory normally occupied by babysitters like supervise.   Our parent 
bash script will always be less than a real babysitter -- supervise, god, etc. 
-- so maybe we should just have this kill znode as an optional script w/ 
prescription for how to set it up -- e.g. run znode remover on daemon crash 
before starting new one (if we want supervise to start a new one).

I'm thinking we should back this out since there are open questions still.

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-05-05 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13268906#comment-13268906
 ] 

Hudson commented on HBASE-5844:
---

Integrated in HBase-TRUNK-security #192 (See 
[https://builds.apache.org/job/HBase-TRUNK-security/192/])
HBASE-5844 Delete the region servers znode after a regions server crash 
(Revision 1334028)

 Result = SUCCESS
stack : 
Files : 
* /hbase/trunk/bin/hbase-daemon.sh
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java


 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-05-04 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13268224#comment-13268224
 ] 

nkeywal commented on HBASE-5844:


Ready to be committed imho.

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-05-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13268530#comment-13268530
 ] 

Hudson commented on HBASE-5844:
---

Integrated in HBase-TRUNK #2848 (See 
[https://builds.apache.org/job/HBase-TRUNK/2848/])
HBASE-5844 Delete the region servers znode after a regions server crash 
(Revision 1334028)

 Result = FAILURE
stack : 
Files : 
* /hbase/trunk/bin/hbase-daemon.sh
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java


 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-05-03 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13267332#comment-13267332
 ] 

nkeywal commented on HBASE-5844:


v4 should be ok.
I will do another jira for the master.

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-05-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13267352#comment-13267352
 ] 

Hadoop QA commented on HBASE-5844:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12525420/5844.v4.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 hadoop23.  The patch compiles against the hadoop 0.23.x profile.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1743//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1743//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/1743//console

This message is automatically generated.

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch, 
 5844.v3.patch, 5844.v4.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-27 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13263778#comment-13263778
 ] 

nkeywal commented on HBASE-5844:


I found the issue, and (hopefully) a fix. I will have a new patch middle of 
next week, I will include the master znode in this one...

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-26 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13262773#comment-13262773
 ] 

stack commented on HBASE-5844:
--

I reverted the patch from trunk.

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13263369#comment-13263369
 ] 

Hudson commented on HBASE-5844:
---

Integrated in HBase-TRUNK-security #186 (See 
[https://builds.apache.org/job/HBase-TRUNK-security/186/])
HBASE-5844 Delete the region servers znode after a regions server crash; 
REVERT (Revision 1330983)

 Result = SUCCESS
stack : 
Files : 
* /hbase/trunk/bin/hbase-daemon.sh
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java


 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13260258#comment-13260258
 ] 

Hudson commented on HBASE-5844:
---

Integrated in HBase-TRUNK-security #182 (See 
[https://builds.apache.org/job/HBase-TRUNK-security/182/])
HBASE-5844 Delete the region servers znode after a regions server crash 
(Revision 1329430)

 Result = SUCCESS
stack : 
Files : 
* /hbase/trunk/bin/hbase-daemon.sh
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java


 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-23 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13259692#comment-13259692
 ] 

nkeywal commented on HBASE-5844:


v2 should be ok. It does not include anymore the fix for HBASE-5666, so it 
cannot be tested locally but I tried it before removing the workaround. 

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Attachments: 5844.v1.patch, 5844.v2.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13259811#comment-13259811
 ] 

stack commented on HBASE-5844:
--

Should master write out its znode name too?  If it crashes this code could 
bring on the second master faster?

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Attachments: 5844.v1.patch, 5844.v2.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-23 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13259904#comment-13259904
 ] 

nkeywal commented on HBASE-5844:


You're right. I propose to commit this patch, I will then generalize the 
solution to master in another jira.

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Attachments: 5844.v1.patch, 5844.v2.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-23 Thread Zhihong Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13259914#comment-13259914
 ] 

Zhihong Yu commented on HBASE-5844:
---

{code}
+  (Environment variable HBASE_ZNODE_FILE is no set).);
{code}
'is no set' - 'is not set'

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Attachments: 5844.v1.patch, 5844.v2.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-23 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13259921#comment-13259921
 ] 

stack commented on HBASE-5844:
--

@N Ok. I'll commit this then in a new JIRA move out the common code to util or 
some place.  I'll handle Ted's comment on commit.

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Attachments: 5844.v1.patch, 5844.v2.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13260046#comment-13260046
 ] 

Hudson commented on HBASE-5844:
---

Integrated in HBase-TRUNK #2801 (See 
[https://builds.apache.org/job/HBase-TRUNK/2801/])
HBASE-5844 Delete the region servers znode after a regions server crash 
(Revision 1329430)

 Result = FAILURE
stack : 
Files : 
* /hbase/trunk/bin/hbase-daemon.sh
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java


 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Fix For: 0.96.0

 Attachments: 5844.v1.patch, 5844.v2.patch, 5844.v3.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258788#comment-13258788
 ] 

nkeywal commented on HBASE-5844:


I didn't know this parameter. It's interesting, because with ZK the default 
timeout is 30 seconds, but with HBase it's now 180s (from hbase-default.xml). 
It was increased to 60s a first time in HBASE-1772. It seems it was increased 
because of the GC.

But it means that deleting immediately the ZK represents a huge mttr 
improvement for the regions server crash case.




 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Attachments: 5844.v1.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258796#comment-13258796
 ] 

stack commented on HBASE-5844:
--

If we go to a count  100 we just continue the startup?  Is that what you want?

{code}
+while (!tracker.checkIfBaseNodeAvailable()  ++count100) {
+  Thread.sleep(100);
+}
{code}

Be like the rest of the code regards spaces; i.e. spaces around operators...


+
+if (fileName==null){


Maybe you don't need deleteMyEphemeralNodeOnDisk if you instead use 
http://docs.oracle.com/javase/6/docs/api/java/io/File.html#deleteOnExit() 
inside in writeMyEphemeralNodeOnDisk?

Patch looks good N.

We upped the timeout because noobs would install hbase then run big mapreduce 
jobs w/o turning jvm and so big GCs.  We figured they'd rather have their 
regionserver ride over the big pauses than have them be 'sensitive' out of the 
box.

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Attachments: 5844.v1.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-21 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258803#comment-13258803
 ] 

nkeywal commented on HBASE-5844:


For the tracker, it's my private workaround for HBASE-5666, it should not have 
been included in this patch. Sorry about this.

I think it's better to delete the file explicitly, just after the znode 
deletion. HRegionServer#deleteMyEphemeralNode is called only once, and I added 
deleteMyEphemeralNodeOnDisk just after this call. If we rely on #deleteOnExit, 
I fear we could have the file deleted with a still alive znode. I'm not sure 
and I have not tried it, but I think it's too easy to enter into the 
jvm-specific-behavior space here.

I will fix the java code and try the whole fix on a real cluster for the v2.

Thanks you for the review.





 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Attachments: 5844.v1.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258988#comment-13258988
 ] 

stack commented on HBASE-5844:
--

Ok on your reasoning for not using deleteOnExit.  Try and have the two methods 
share more code like getting the name of the file w/ the znode name in it.  
Otherwise, sounds good.

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Attachments: 5844.v1.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-20 Thread nkeywal (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258137#comment-13258137
 ] 

nkeywal commented on HBASE-5844:


Patch v1. Tested on a local cluster  pseudo distributed, by stopping the 
server by kill -9. I will do some minor improvements on the java code and then 
test on a real cluster, but I'm interested by a feedback on the script.


 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Attachments: 5844.v1.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-5844) Delete the region servers znode after a regions server crash

2012-04-20 Thread Jean-Daniel Cryans (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-5844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13258422#comment-13258422
 ] 

Jean-Daniel Cryans commented on HBASE-5844:
---

It is deleted automatically, it's an ephemeral znode so it takes 
zk.session.timeout time to see it disappear.

 Delete the region servers znode after a regions server crash
 

 Key: HBASE-5844
 URL: https://issues.apache.org/jira/browse/HBASE-5844
 Project: HBase
  Issue Type: Improvement
  Components: regionserver, scripts
Affects Versions: 0.96.0
Reporter: nkeywal
Assignee: nkeywal
 Attachments: 5844.v1.patch


 today, if the regions server crashes, its znode is not deleted in ZooKeeper. 
 So the recovery process will stop only after a timeout, usually 30s.
 By deleting the znode in start script, we remove this delay and the recovery 
 starts immediately.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira