[jira] [Commented] (HBASE-8105) RegionServer Doesn't Rejoin Cluster after Netsplit

2013-03-15 Thread philo vivero (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603957#comment-13603957
 ] 

philo vivero commented on HBASE-8105:
-

It occurs to me that I might not have understood the first question: "is the RS 
still running?" The RS process is still running on the node, but the logs seem 
to indicate that it's not doing anything until the restart.

Perhaps a good re-wording of this would be "RegionServer Process Doesn't Die on 
Abnormal Loss of Network Connectivity to the Cluster"? But then maybe this is 
considered normal (though I'd expect something in the logs along the lines of 
"Ceasing normal activity, but keeping process alive [for whatever reason]."

It seems the advice to move the discussion to the mailing list would be apropos 
if RegionServer process staying alive under this circumstance is normal.

> RegionServer Doesn't Rejoin Cluster after Netsplit
> --
>
> Key: HBASE-8105
> URL: https://issues.apache.org/jira/browse/HBASE-8105
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.92.1
> Environment: Linux Ubuntu 10.04 LTS
>Reporter: philo vivero
>
> Running a 15-node HBase cluster. Testing various failure scenarios. Segregate 
> one RegionServer from the cluster by firewalling off every port except SSH 
> (because we need to be able to re-enable the node later).
> After the RS is automatically removed from the cluster, we re-enable all 
> ports again, but RS never rejoins the cluster.
> I suspect the possibility this is desired behaviour, but haven't found proof 
> so far. The code doesn't have any comment indicating this is the behaviour 
> desired:
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.92.2/org/apache/hadoop/hbase/regionserver/HRegionServer.java/
> See lines starting at 624, public void run(). It makes it through the first 
> try/catch block, but then loops inside the second try/catch block. Our 
> hypothesis is that it never gets out naturally.
> If we bounce the RegionServer process, then it rejoins the cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8105) RegionServer Doesn't Rejoin Cluster after Netsplit

2013-03-15 Thread philo vivero (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603949#comment-13603949
 ] 

philo vivero commented on HBASE-8105:
-

I do have copious logs for the time of the outage, on the order of megabytes. 
Do you want all of them, or should I grep for particular regexp's that would be 
more useful to you?

In general, it starts with this:

2013-03-12 17:11:43,241 FATAL 
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
rs9,60020,1363131519016: IOE in log roller

Then a bunch of things like this:

2013-03-12 17:11:43,241 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: 
Could not append. Requesting close of hlog
2013-03-12 17:11:43,242 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: 
Could not append. Requesting close of hlog
2013-03-12 17:11:43,242 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: 
Could not append. Requesting close of hlog
2013-03-12 17:11:45,245 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: 
Could not sync. Requesting close of hlog
2013-03-12 17:11:45,246 ERROR org.apache.hadoop.hbase.regionserver.wal.HLog: 
Error while syncing, requesting close of hlog
2013-03-12 17:11:47,254 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: 
Could not sync. Requesting close of hlog
2013-03-12 17:11:47,254 ERROR org.apache.hadoop.hbase.regionserver.wal.HLog: 
Error while syncing, requesting close of hlog

Those last two errors repeat alternately for a long time, then eventually:

2013-03-12 17:26:13,888 ERROR 
org.apache.hadoop.hbase.regionserver.HRegionServer: Close and delete failed
2013-03-12 17:27:25,892 ERROR 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper delete failed 
after 3 retries

Once we allow RS back into the network and bounce the RS, these ERRORs appear:

2013-03-12 17:51:15,908 ERROR 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node 
/hbase/replication/peers already exists and this is not a retry
2013-03-12 17:51:15,914 ERROR 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node 
/hbase/replication/rs already exists and this is not a retry
2013-03-12 17:51:15,925 ERROR 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node 
/hbase/replication/state already exists and this is not a retry

The full logs are 23MB, and "egrep 'ERROR|FATAL'" for the affected time period 
is 112KB. Still want it attached?


> RegionServer Doesn't Rejoin Cluster after Netsplit
> --
>
> Key: HBASE-8105
> URL: https://issues.apache.org/jira/browse/HBASE-8105
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.92.1
> Environment: Linux Ubuntu 10.04 LTS
>Reporter: philo vivero
>
> Running a 15-node HBase cluster. Testing various failure scenarios. Segregate 
> one RegionServer from the cluster by firewalling off every port except SSH 
> (because we need to be able to re-enable the node later).
> After the RS is automatically removed from the cluster, we re-enable all 
> ports again, but RS never rejoins the cluster.
> I suspect the possibility this is desired behaviour, but haven't found proof 
> so far. The code doesn't have any comment indicating this is the behaviour 
> desired:
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.92.2/org/apache/hadoop/hbase/regionserver/HRegionServer.java/
> See lines starting at 624, public void run(). It makes it through the first 
> try/catch block, but then loops inside the second try/catch block. Our 
> hypothesis is that it never gets out naturally.
> If we bounce the RegionServer process, then it rejoins the cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8105) RegionServer Doesn't Rejoin Cluster after Netsplit

2013-03-14 Thread Anoop Sam John (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603082#comment-13603082
 ] 

Anoop Sam John commented on HBASE-8105:
---

Can you attach some logs?

> RegionServer Doesn't Rejoin Cluster after Netsplit
> --
>
> Key: HBASE-8105
> URL: https://issues.apache.org/jira/browse/HBASE-8105
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.92.1
> Environment: Linux Ubuntu 10.04 LTS
>Reporter: philo vivero
>
> Running a 15-node HBase cluster. Testing various failure scenarios. Segregate 
> one RegionServer from the cluster by firewalling off every port except SSH 
> (because we need to be able to re-enable the node later).
> After the RS is automatically removed from the cluster, we re-enable all 
> ports again, but RS never rejoins the cluster.
> I suspect the possibility this is desired behaviour, but haven't found proof 
> so far. The code doesn't have any comment indicating this is the behaviour 
> desired:
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.92.2/org/apache/hadoop/hbase/regionserver/HRegionServer.java/
> See lines starting at 624, public void run(). It makes it through the first 
> try/catch block, but then loops inside the second try/catch block. Our 
> hypothesis is that it never gets out naturally.
> If we bounce the RegionServer process, then it rejoins the cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8105) RegionServer Doesn't Rejoin Cluster after Netsplit

2013-03-14 Thread nkeywal (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13602355#comment-13602355
 ] 

nkeywal commented on HBASE-8105:


I suppose you have a YouAreDeadException in the logs?
This would be expected. The logic is that the region server cannot be trusted 
anymore as it was ejected from the cluster. Then yes, it could abort. On the 
other hand you may want to look at it in details. Personally I would prefer to 
abort to be sure I don't have clients trying to use this dead server.

Note that for questions or discussions, it's better to use the user mailing 
list.

> RegionServer Doesn't Rejoin Cluster after Netsplit
> --
>
> Key: HBASE-8105
> URL: https://issues.apache.org/jira/browse/HBASE-8105
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.92.1
> Environment: Linux Ubuntu 10.04 LTS
>Reporter: philo vivero
>
> Running a 15-node HBase cluster. Testing various failure scenarios. Segregate 
> one RegionServer from the cluster by firewalling off every port except SSH 
> (because we need to be able to re-enable the node later).
> After the RS is automatically removed from the cluster, we re-enable all 
> ports again, but RS never rejoins the cluster.
> I suspect the possibility this is desired behaviour, but haven't found proof 
> so far. The code doesn't have any comment indicating this is the behaviour 
> desired:
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.92.2/org/apache/hadoop/hbase/regionserver/HRegionServer.java/
> See lines starting at 624, public void run(). It makes it through the first 
> try/catch block, but then loops inside the second try/catch block. Our 
> hypothesis is that it never gets out naturally.
> If we bounce the RegionServer process, then it rejoins the cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8105) RegionServer Doesn't Rejoin Cluster after Netsplit

2013-03-14 Thread Time Less (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13602334#comment-13602334
 ] 

Time Less commented on HBASE-8105:
--

The RS runs the whole time, so yes, still running when ports are re-opened.

It definitely would lose its ZK connection. But then, I would expect when
it begins communicating again with ZK, it would note its "I've been ejected
from the cluster" status and rejoin, or RS process die, or something. RS
process keeps running normally, but not part of the cluster seems an
erroneous state.


On Thu, Mar 14, 2013 at 8:06 AM, Jean-Marc Spaggiari (JIRA)  RegionServer Doesn't Rejoin Cluster after Netsplit
> --
>
> Key: HBASE-8105
> URL: https://issues.apache.org/jira/browse/HBASE-8105
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.92.1
> Environment: Linux Ubuntu 10.04 LTS
>Reporter: philo vivero
>
> Running a 15-node HBase cluster. Testing various failure scenarios. Segregate 
> one RegionServer from the cluster by firewalling off every port except SSH 
> (because we need to be able to re-enable the node later).
> After the RS is automatically removed from the cluster, we re-enable all 
> ports again, but RS never rejoins the cluster.
> I suspect the possibility this is desired behaviour, but haven't found proof 
> so far. The code doesn't have any comment indicating this is the behaviour 
> desired:
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.92.2/org/apache/hadoop/hbase/regionserver/HRegionServer.java/
> See lines starting at 624, public void run(). It makes it through the first 
> try/catch block, but then loops inside the second try/catch block. Our 
> hypothesis is that it never gets out naturally.
> If we bounce the RegionServer process, then it rejoins the cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HBASE-8105) RegionServer Doesn't Rejoin Cluster after Netsplit

2013-03-14 Thread Jean-Marc Spaggiari (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-8105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13602324#comment-13602324
 ] 

Jean-Marc Spaggiari commented on HBASE-8105:


Is the RS still running when you re-open the ports? Since it losts the 
connection with ZooKeeper, it might have sent down already, no?

> RegionServer Doesn't Rejoin Cluster after Netsplit
> --
>
> Key: HBASE-8105
> URL: https://issues.apache.org/jira/browse/HBASE-8105
> Project: HBase
>  Issue Type: Bug
>  Components: regionserver
>Affects Versions: 0.92.1
> Environment: Linux Ubuntu 10.04 LTS
>Reporter: philo vivero
>
> Running a 15-node HBase cluster. Testing various failure scenarios. Segregate 
> one RegionServer from the cluster by firewalling off every port except SSH 
> (because we need to be able to re-enable the node later).
> After the RS is automatically removed from the cluster, we re-enable all 
> ports again, but RS never rejoins the cluster.
> I suspect the possibility this is desired behaviour, but haven't found proof 
> so far. The code doesn't have any comment indicating this is the behaviour 
> desired:
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.92.2/org/apache/hadoop/hbase/regionserver/HRegionServer.java/
> See lines starting at 624, public void run(). It makes it through the first 
> try/catch block, but then loops inside the second try/catch block. Our 
> hypothesis is that it never gets out naturally.
> If we bounce the RegionServer process, then it rejoins the cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira