[jira] [Commented] (HBASE-8105) RegionServer Doesn't Rejoin Cluster after Netsplit
[ https://issues.apache.org/jira/browse/HBASE-8105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603957#comment-13603957 ] philo vivero commented on HBASE-8105: - It occurs to me that I might not have understood the first question: "is the RS still running?" The RS process is still running on the node, but the logs seem to indicate that it's not doing anything until the restart. Perhaps a good re-wording of this would be "RegionServer Process Doesn't Die on Abnormal Loss of Network Connectivity to the Cluster"? But then maybe this is considered normal (though I'd expect something in the logs along the lines of "Ceasing normal activity, but keeping process alive [for whatever reason]." It seems the advice to move the discussion to the mailing list would be apropos if RegionServer process staying alive under this circumstance is normal. > RegionServer Doesn't Rejoin Cluster after Netsplit > -- > > Key: HBASE-8105 > URL: https://issues.apache.org/jira/browse/HBASE-8105 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.92.1 > Environment: Linux Ubuntu 10.04 LTS >Reporter: philo vivero > > Running a 15-node HBase cluster. Testing various failure scenarios. Segregate > one RegionServer from the cluster by firewalling off every port except SSH > (because we need to be able to re-enable the node later). > After the RS is automatically removed from the cluster, we re-enable all > ports again, but RS never rejoins the cluster. > I suspect the possibility this is desired behaviour, but haven't found proof > so far. The code doesn't have any comment indicating this is the behaviour > desired: > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.92.2/org/apache/hadoop/hbase/regionserver/HRegionServer.java/ > See lines starting at 624, public void run(). It makes it through the first > try/catch block, but then loops inside the second try/catch block. Our > hypothesis is that it never gets out naturally. > If we bounce the RegionServer process, then it rejoins the cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8105) RegionServer Doesn't Rejoin Cluster after Netsplit
[ https://issues.apache.org/jira/browse/HBASE-8105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603949#comment-13603949 ] philo vivero commented on HBASE-8105: - I do have copious logs for the time of the outage, on the order of megabytes. Do you want all of them, or should I grep for particular regexp's that would be more useful to you? In general, it starts with this: 2013-03-12 17:11:43,241 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server rs9,60020,1363131519016: IOE in log roller Then a bunch of things like this: 2013-03-12 17:11:43,241 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append. Requesting close of hlog 2013-03-12 17:11:43,242 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append. Requesting close of hlog 2013-03-12 17:11:43,242 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append. Requesting close of hlog 2013-03-12 17:11:45,245 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync. Requesting close of hlog 2013-03-12 17:11:45,246 ERROR org.apache.hadoop.hbase.regionserver.wal.HLog: Error while syncing, requesting close of hlog 2013-03-12 17:11:47,254 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync. Requesting close of hlog 2013-03-12 17:11:47,254 ERROR org.apache.hadoop.hbase.regionserver.wal.HLog: Error while syncing, requesting close of hlog Those last two errors repeat alternately for a long time, then eventually: 2013-03-12 17:26:13,888 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Close and delete failed 2013-03-12 17:27:25,892 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper delete failed after 3 retries Once we allow RS back into the network and bounce the RS, these ERRORs appear: 2013-03-12 17:51:15,908 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node /hbase/replication/peers already exists and this is not a retry 2013-03-12 17:51:15,914 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node /hbase/replication/rs already exists and this is not a retry 2013-03-12 17:51:15,925 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node /hbase/replication/state already exists and this is not a retry The full logs are 23MB, and "egrep 'ERROR|FATAL'" for the affected time period is 112KB. Still want it attached? > RegionServer Doesn't Rejoin Cluster after Netsplit > -- > > Key: HBASE-8105 > URL: https://issues.apache.org/jira/browse/HBASE-8105 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.92.1 > Environment: Linux Ubuntu 10.04 LTS >Reporter: philo vivero > > Running a 15-node HBase cluster. Testing various failure scenarios. Segregate > one RegionServer from the cluster by firewalling off every port except SSH > (because we need to be able to re-enable the node later). > After the RS is automatically removed from the cluster, we re-enable all > ports again, but RS never rejoins the cluster. > I suspect the possibility this is desired behaviour, but haven't found proof > so far. The code doesn't have any comment indicating this is the behaviour > desired: > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.92.2/org/apache/hadoop/hbase/regionserver/HRegionServer.java/ > See lines starting at 624, public void run(). It makes it through the first > try/catch block, but then loops inside the second try/catch block. Our > hypothesis is that it never gets out naturally. > If we bounce the RegionServer process, then it rejoins the cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8105) RegionServer Doesn't Rejoin Cluster after Netsplit
[ https://issues.apache.org/jira/browse/HBASE-8105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603082#comment-13603082 ] Anoop Sam John commented on HBASE-8105: --- Can you attach some logs? > RegionServer Doesn't Rejoin Cluster after Netsplit > -- > > Key: HBASE-8105 > URL: https://issues.apache.org/jira/browse/HBASE-8105 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.92.1 > Environment: Linux Ubuntu 10.04 LTS >Reporter: philo vivero > > Running a 15-node HBase cluster. Testing various failure scenarios. Segregate > one RegionServer from the cluster by firewalling off every port except SSH > (because we need to be able to re-enable the node later). > After the RS is automatically removed from the cluster, we re-enable all > ports again, but RS never rejoins the cluster. > I suspect the possibility this is desired behaviour, but haven't found proof > so far. The code doesn't have any comment indicating this is the behaviour > desired: > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.92.2/org/apache/hadoop/hbase/regionserver/HRegionServer.java/ > See lines starting at 624, public void run(). It makes it through the first > try/catch block, but then loops inside the second try/catch block. Our > hypothesis is that it never gets out naturally. > If we bounce the RegionServer process, then it rejoins the cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8105) RegionServer Doesn't Rejoin Cluster after Netsplit
[ https://issues.apache.org/jira/browse/HBASE-8105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13602355#comment-13602355 ] nkeywal commented on HBASE-8105: I suppose you have a YouAreDeadException in the logs? This would be expected. The logic is that the region server cannot be trusted anymore as it was ejected from the cluster. Then yes, it could abort. On the other hand you may want to look at it in details. Personally I would prefer to abort to be sure I don't have clients trying to use this dead server. Note that for questions or discussions, it's better to use the user mailing list. > RegionServer Doesn't Rejoin Cluster after Netsplit > -- > > Key: HBASE-8105 > URL: https://issues.apache.org/jira/browse/HBASE-8105 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.92.1 > Environment: Linux Ubuntu 10.04 LTS >Reporter: philo vivero > > Running a 15-node HBase cluster. Testing various failure scenarios. Segregate > one RegionServer from the cluster by firewalling off every port except SSH > (because we need to be able to re-enable the node later). > After the RS is automatically removed from the cluster, we re-enable all > ports again, but RS never rejoins the cluster. > I suspect the possibility this is desired behaviour, but haven't found proof > so far. The code doesn't have any comment indicating this is the behaviour > desired: > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.92.2/org/apache/hadoop/hbase/regionserver/HRegionServer.java/ > See lines starting at 624, public void run(). It makes it through the first > try/catch block, but then loops inside the second try/catch block. Our > hypothesis is that it never gets out naturally. > If we bounce the RegionServer process, then it rejoins the cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8105) RegionServer Doesn't Rejoin Cluster after Netsplit
[ https://issues.apache.org/jira/browse/HBASE-8105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13602334#comment-13602334 ] Time Less commented on HBASE-8105: -- The RS runs the whole time, so yes, still running when ports are re-opened. It definitely would lose its ZK connection. But then, I would expect when it begins communicating again with ZK, it would note its "I've been ejected from the cluster" status and rejoin, or RS process die, or something. RS process keeps running normally, but not part of the cluster seems an erroneous state. On Thu, Mar 14, 2013 at 8:06 AM, Jean-Marc Spaggiari (JIRA) RegionServer Doesn't Rejoin Cluster after Netsplit > -- > > Key: HBASE-8105 > URL: https://issues.apache.org/jira/browse/HBASE-8105 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.92.1 > Environment: Linux Ubuntu 10.04 LTS >Reporter: philo vivero > > Running a 15-node HBase cluster. Testing various failure scenarios. Segregate > one RegionServer from the cluster by firewalling off every port except SSH > (because we need to be able to re-enable the node later). > After the RS is automatically removed from the cluster, we re-enable all > ports again, but RS never rejoins the cluster. > I suspect the possibility this is desired behaviour, but haven't found proof > so far. The code doesn't have any comment indicating this is the behaviour > desired: > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.92.2/org/apache/hadoop/hbase/regionserver/HRegionServer.java/ > See lines starting at 624, public void run(). It makes it through the first > try/catch block, but then loops inside the second try/catch block. Our > hypothesis is that it never gets out naturally. > If we bounce the RegionServer process, then it rejoins the cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-8105) RegionServer Doesn't Rejoin Cluster after Netsplit
[ https://issues.apache.org/jira/browse/HBASE-8105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13602324#comment-13602324 ] Jean-Marc Spaggiari commented on HBASE-8105: Is the RS still running when you re-open the ports? Since it losts the connection with ZooKeeper, it might have sent down already, no? > RegionServer Doesn't Rejoin Cluster after Netsplit > -- > > Key: HBASE-8105 > URL: https://issues.apache.org/jira/browse/HBASE-8105 > Project: HBase > Issue Type: Bug > Components: regionserver >Affects Versions: 0.92.1 > Environment: Linux Ubuntu 10.04 LTS >Reporter: philo vivero > > Running a 15-node HBase cluster. Testing various failure scenarios. Segregate > one RegionServer from the cluster by firewalling off every port except SSH > (because we need to be able to re-enable the node later). > After the RS is automatically removed from the cluster, we re-enable all > ports again, but RS never rejoins the cluster. > I suspect the possibility this is desired behaviour, but haven't found proof > so far. The code doesn't have any comment indicating this is the behaviour > desired: > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.92.2/org/apache/hadoop/hbase/regionserver/HRegionServer.java/ > See lines starting at 624, public void run(). It makes it through the first > try/catch block, but then loops inside the second try/catch block. Our > hypothesis is that it never gets out naturally. > If we bounce the RegionServer process, then it rejoins the cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira