[jira] Commented: (HBASE-3445) Master crashes on data that was moved from different host
[ https://issues.apache.org/jira/browse/HBASE-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982420#action_12982420 ] stack commented on HBASE-3445: -- James: In the AssignmentManager, where we go to RPC to a remote regionserver, we do following: {code} } catch (ConnectException e) { LOG.info(Failed connect to + server + , message= + e.getMessage() + , region= + region.getEncodedName()); // Presume that regionserver just failed and we haven't got expired // server from zk yet. Let expired server deal with clean up. } catch (java.net.SocketTimeoutException e) { LOG.info(Server + server + returned + e.getMessage() + for + region.getEncodedName()); // Presume retry or server will expire. } catch (EOFException e) { LOG.info(Server + server + returned + e.getMessage() + for + region.getEncodedName()); // Presume retry or server will expire. } catch (RemoteException re) { IOException ioe = re.unwrapRemoteException(); if (ioe instanceof NotServingRegionException) { // Failed to close, so pass through and reassign LOG.debug(Server + server + returned + ioe + for + region.getEncodedName()); } else if (ioe instanceof EOFException) { // Failed to close, so pass through and reassign LOG.debug(Server + server + returned + ioe + for + region.getEncodedName()); } else { this.master.abort(Remote unexpected exception, ioe); } } catch (Throwable t) { {code} I think your adding of timeout to the try/catch in the getCachedConnection is right. Maybe we should add the ConnectException too? Unless you object, I'll add it when I commit your patch. Master crashes on data that was moved from different host - Key: HBASE-3445 URL: https://issues.apache.org/jira/browse/HBASE-3445 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: James Kennedy Priority: Critical Fix For: 0.90.0 Attachments: 3445_0.90.0.patch While testing an upgrade to 0.90.0 RC3 I noticed that if I seeded our test data on one machine and transferred to another machine the HMaster on the new machine dies on startup. Based on the following stack trace it looks as though it is attempting to find the .meta region with the ip address of the original machine. Instead of waiting around for RegionServer's to register with new location data, HMaster throws it's hands up with a FATAL exception. Note that deleting the zookeeper dir makes no difference. Also note that so far I have only reproduced this in my own environment using the hbase-trx extension of HBase and an ApplicationStarter that starts the Master and RegionServer together in the same JVM. While the issue seems likely isolated from those factors it is far from a vanilla HBase environment. I will spend some time trying to reproduce the issue in a proper hbase test. But perhaps someone can beat me to it? How do I simulate the IP switch? May require a data.tar upload. [14/01/11 10:45:20] 6396 [ Thread-298] ERROR server.quorum.QuorumPeerConfig - Invalid configuration, only one server specified (ignoring) [14/01/11 10:45:21] 7178 [ main] INFO ion.service.HBaseRegionService - troove region port: 60010 [14/01/11 10:45:21] 7180 [ main] INFO ion.service.HBaseRegionService - troove region interface: org.apache.hadoop.hbase.ipc.IndexedRegionInterface [14/01/11 10:45:21] 7180 [ main] INFO ion.service.HBaseRegionService - troove root dir: hdfs://localhost:8701/hbase [14/01/11 10:45:21] 7180 [ main] INFO ion.service.HBaseRegionService - troove Initializing region server. [14/01/11 10:45:21] 7631 [ main] INFO ion.service.HBaseRegionService - troove Starting region server thread. [14/01/11 10:46:54] 100764 [HMaster] FATAL he.hadoop.hbase.master.HMaster - Unhandled exception. Starting shutdown. java.net.SocketTimeoutException: 2 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=192.168.1.102/192.168.1.102:60020] at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:311) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:865) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:732) at
[jira] Updated: (HBASE-3445) Master crashes on data that was moved from different host
[ https://issues.apache.org/jira/browse/HBASE-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-3445: - Fix Version/s: (was: 0.90.0) 0.90.1 Moved to 0.90.1 Master crashes on data that was moved from different host - Key: HBASE-3445 URL: https://issues.apache.org/jira/browse/HBASE-3445 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: James Kennedy Priority: Critical Fix For: 0.90.1 Attachments: 3445_0.90.0.patch While testing an upgrade to 0.90.0 RC3 I noticed that if I seeded our test data on one machine and transferred to another machine the HMaster on the new machine dies on startup. Based on the following stack trace it looks as though it is attempting to find the .meta region with the ip address of the original machine. Instead of waiting around for RegionServer's to register with new location data, HMaster throws it's hands up with a FATAL exception. Note that deleting the zookeeper dir makes no difference. Also note that so far I have only reproduced this in my own environment using the hbase-trx extension of HBase and an ApplicationStarter that starts the Master and RegionServer together in the same JVM. While the issue seems likely isolated from those factors it is far from a vanilla HBase environment. I will spend some time trying to reproduce the issue in a proper hbase test. But perhaps someone can beat me to it? How do I simulate the IP switch? May require a data.tar upload. [14/01/11 10:45:20] 6396 [ Thread-298] ERROR server.quorum.QuorumPeerConfig - Invalid configuration, only one server specified (ignoring) [14/01/11 10:45:21] 7178 [ main] INFO ion.service.HBaseRegionService - troove region port: 60010 [14/01/11 10:45:21] 7180 [ main] INFO ion.service.HBaseRegionService - troove region interface: org.apache.hadoop.hbase.ipc.IndexedRegionInterface [14/01/11 10:45:21] 7180 [ main] INFO ion.service.HBaseRegionService - troove root dir: hdfs://localhost:8701/hbase [14/01/11 10:45:21] 7180 [ main] INFO ion.service.HBaseRegionService - troove Initializing region server. [14/01/11 10:45:21] 7631 [ main] INFO ion.service.HBaseRegionService - troove Starting region server thread. [14/01/11 10:46:54] 100764 [HMaster] FATAL he.hadoop.hbase.master.HMaster - Unhandled exception. Starting shutdown. java.net.SocketTimeoutException: 2 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=192.168.1.102/192.168.1.102:60020] at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:311) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:865) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:732) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:258) at $Proxy14.getProtocolVersion(Unknown Source) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:419) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:393) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:444) at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:349) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:954) at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:384) at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:283) at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:478) at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:435) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:382) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HBASE-3445) Master crashes on data that was moved from different host
[ https://issues.apache.org/jira/browse/HBASE-3445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack reassigned HBASE-3445: Assignee: James Kennedy Made James a contributor and assigned him this issue Master crashes on data that was moved from different host - Key: HBASE-3445 URL: https://issues.apache.org/jira/browse/HBASE-3445 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: James Kennedy Assignee: James Kennedy Priority: Critical Fix For: 0.90.1 Attachments: 3445_0.90.0.patch While testing an upgrade to 0.90.0 RC3 I noticed that if I seeded our test data on one machine and transferred to another machine the HMaster on the new machine dies on startup. Based on the following stack trace it looks as though it is attempting to find the .meta region with the ip address of the original machine. Instead of waiting around for RegionServer's to register with new location data, HMaster throws it's hands up with a FATAL exception. Note that deleting the zookeeper dir makes no difference. Also note that so far I have only reproduced this in my own environment using the hbase-trx extension of HBase and an ApplicationStarter that starts the Master and RegionServer together in the same JVM. While the issue seems likely isolated from those factors it is far from a vanilla HBase environment. I will spend some time trying to reproduce the issue in a proper hbase test. But perhaps someone can beat me to it? How do I simulate the IP switch? May require a data.tar upload. [14/01/11 10:45:20] 6396 [ Thread-298] ERROR server.quorum.QuorumPeerConfig - Invalid configuration, only one server specified (ignoring) [14/01/11 10:45:21] 7178 [ main] INFO ion.service.HBaseRegionService - troove region port: 60010 [14/01/11 10:45:21] 7180 [ main] INFO ion.service.HBaseRegionService - troove region interface: org.apache.hadoop.hbase.ipc.IndexedRegionInterface [14/01/11 10:45:21] 7180 [ main] INFO ion.service.HBaseRegionService - troove root dir: hdfs://localhost:8701/hbase [14/01/11 10:45:21] 7180 [ main] INFO ion.service.HBaseRegionService - troove Initializing region server. [14/01/11 10:45:21] 7631 [ main] INFO ion.service.HBaseRegionService - troove Starting region server thread. [14/01/11 10:46:54] 100764 [HMaster] FATAL he.hadoop.hbase.master.HMaster - Unhandled exception. Starting shutdown. java.net.SocketTimeoutException: 2 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=192.168.1.102/192.168.1.102:60020] at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:311) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:865) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:732) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:258) at $Proxy14.getProtocolVersion(Unknown Source) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:419) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:393) at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:444) at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:349) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:954) at org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:384) at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:283) at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:478) at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:435) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:382) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:277) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-3446) ProcessServerShutdown fails if META moves, orphaning lots of regions
ProcessServerShutdown fails if META moves, orphaning lots of regions Key: HBASE-3446 URL: https://issues.apache.org/jira/browse/HBASE-3446 Project: HBase Issue Type: Bug Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Blocker I ran a rolling restart on a 5 node cluster with lots of regions, and afterwards had LOTS of regions left orphaned. The issue appears to be that ProcessServerShutdown failed because the server hosting META was restarted around the same time as another server was being processed -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3446) ProcessServerShutdown fails if META moves, orphaning lots of regions
[ https://issues.apache.org/jira/browse/HBASE-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982453#action_12982453 ] Todd Lipcon commented on HBASE-3446: After digging through the logs, I found the following: 2011-01-16 18:03:26,164 DEBUG org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Offlined and split region usertable,user136857679,1295149082811.9f2822a04028c86813fe71264da5c167.; checking daughter presence 2011-01-16 18:03:26,169 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN org.apache.hadoop.ipc.RemoteException: java.io.IOException: Server not running at org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2360) at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1754) ... at $Proxy6.openScanner(Unknown Source) at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:260) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.isDaughterMissing(ServerShutdownHandler.java:256) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.fixupDaughter(ServerShutdownHandler.java:214) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.fixupDaughters(ServerShutdownHandler.java:196) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.processDeadRegion(ServerShutdownHandler.java:181) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:151) Neither the MetaReader code nor the ServerShutdown handler has any kind of retry/blocking behavior built in here. So many of the regions on the server were left unassigned. ProcessServerShutdown fails if META moves, orphaning lots of regions Key: HBASE-3446 URL: https://issues.apache.org/jira/browse/HBASE-3446 Project: HBase Issue Type: Bug Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Blocker I ran a rolling restart on a 5 node cluster with lots of regions, and afterwards had LOTS of regions left orphaned. The issue appears to be that ProcessServerShutdown failed because the server hosting META was restarted around the same time as another server was being processed -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3387) Pair does not deep check arrays for equality.
[ https://issues.apache.org/jira/browse/HBASE-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982489#action_12982489 ] Nicolas Spiegelberg commented on HBASE-3387: This JIRA patch is a big consistency problem and should be reverted! http://www.ibm.com/developerworks/java/library/j-jtp05273.html Basically, java containers assume a.equals(b) == false if a.hashcode() != b.hashcode(). Furthermore, byte[] a = {0,1,2}, b = {0,1,2}; false == a.equals(b) true == Pair.newPair(a,a).equals(Pair.newPair(b,b)) Was this patch introduced to fix any bug in the existing subsystem? Pair does not deep check arrays for equality. -- Key: HBASE-3387 URL: https://issues.apache.org/jira/browse/HBASE-3387 Project: HBase Issue Type: Bug Components: util Affects Versions: 0.90.1 Environment: Any (discovered in Ubuntu 10.10 using TRUNK). Reporter: Jesse Yates Priority: Minor Fix For: 0.90.1, 0.92.0 Attachments: HBASE-3387.patch Original Estimate: 0h Remaining Estimate: 0h Pair does not deep check arrays for equality. It merely does x.equals(y) for the sent Object. However, with any type of array this is merely going to compare the array pointers, rather than the underlying data structure. It requires a rewriting of the private equals method in Pair to check for elements being an array, then checking the underlying elements. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-3447) Split parents ending up deployed along with daughters
[ https://issues.apache.org/jira/browse/HBASE-3447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HBASE-3447: --- Attachment: unknown-region.log Here's a log for one of the regions. Split parents ending up deployed along with daughters - Key: HBASE-3447 URL: https://issues.apache.org/jira/browse/HBASE-3447 Project: HBase Issue Type: Bug Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Blocker Attachments: unknown-region.log Testing rc3 got several regions in this state as reported by hbck: ERROR: Region UNKNOWN_REGION on haus02.sf.cloudera.com:57020, key=9f2822a04028c86813fe71264da5c167, not on HDFS or in META but deployed on haus02.sf.cloudera.com:57020 (this without any injected failures or anything) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-3448) RegionSplitter : Utility class for manual region splitting
RegionSplitter : Utility class for manual region splitting -- Key: HBASE-3448 URL: https://issues.apache.org/jira/browse/HBASE-3448 Project: HBase Issue Type: New Feature Components: client, scripts, util Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Minor Fix For: 0.90.1 For certain use cases, there are a number of advantages to manually splitting regions instead of having the HBase split code determine this for you automatically. There are currently some API additions to HBaseAdmin and HTable that allow you to manually split on a small scale. This JIRA is about importing a RegionSplitter utility program to help pre-split and perform rolling splits on a live table when needed. Will also add documentation to answer common questions about why you would pre-split. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-3448) RegionSplitter : Utility class for manual region splitting
[ https://issues.apache.org/jira/browse/HBASE-3448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Spiegelberg updated HBASE-3448: --- Attachment: HBASE-3448.patch RegionSplitter : Utility class for manual region splitting -- Key: HBASE-3448 URL: https://issues.apache.org/jira/browse/HBASE-3448 Project: HBase Issue Type: New Feature Components: client, scripts, util Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Minor Fix For: 0.90.1 Attachments: HBASE-3448.patch For certain use cases, there are a number of advantages to manually splitting regions instead of having the HBase split code determine this for you automatically. There are currently some API additions to HBaseAdmin and HTable that allow you to manually split on a small scale. This JIRA is about importing a RegionSplitter utility program to help pre-split and perform rolling splits on a live table when needed. Will also add documentation to answer common questions about why you would pre-split. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3448) RegionSplitter : Utility class for manual region splitting
[ https://issues.apache.org/jira/browse/HBASE-3448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12982507#action_12982507 ] Nicolas Spiegelberg commented on HBASE-3448: https://review.cloudera.org/r/1469/ RegionSplitter : Utility class for manual region splitting -- Key: HBASE-3448 URL: https://issues.apache.org/jira/browse/HBASE-3448 Project: HBase Issue Type: New Feature Components: client, scripts, util Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Minor Fix For: 0.90.1 Attachments: HBASE-3448.patch For certain use cases, there are a number of advantages to manually splitting regions instead of having the HBase split code determine this for you automatically. There are currently some API additions to HBaseAdmin and HTable that allow you to manually split on a small scale. This JIRA is about importing a RegionSplitter utility program to help pre-split and perform rolling splits on a live table when needed. Will also add documentation to answer common questions about why you would pre-split. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.