[jira] Updated: (HBASE-3263) Stack overflow in AssignmentManager
[ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon updated HBASE-3263: --- Attachment: stackoverflow-log.txt Here's a log showing the beginning of the runaway recursion. It goes like this until it gets a stack overflow error. Stack overflow in AssignmentManager --- Key: HBASE-3263 URL: https://issues.apache.org/jira/browse/HBASE-3263 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Blocker Attachments: stackoverflow-log.txt My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-3263) Stack overflow in AssignmentManager
Stack overflow in AssignmentManager --- Key: HBASE-3263 URL: https://issues.apache.org/jira/browse/HBASE-3263 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Blocker Attachments: stackoverflow-log.txt My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-3264) Remove unnecessary Guava Dependency
Remove unnecessary Guava Dependency --- Key: HBASE-3264 URL: https://issues.apache.org/jira/browse/HBASE-3264 Project: HBase Issue Type: Bug Components: mapreduce Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Minor Fix For: 0.90.1 Attachments: HBASE-3264.patch Currently, TableMapReduceUtil uses Guava for trivial functionality and addDependencyJars() currently adds Guava by default. However, this jar is only necessary for the ImportTsv MR job. This is annoying when naively bundling hbase jar with a MR job because you now need a second dependency jar. Should default bundle with only critical dependencies and have jobs that need fancy Guava functionality explicitly include them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager
[ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934769#action_12934769 ] Todd Lipcon commented on HBASE-3263: Shortly after the StackOverflowError it also started spitting this exception: 2010-11-19 12:09:50,366 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of usertable,,1289960558114.03110b4c3c0b24fa1c920ec7669d03a6. to serverName=haus03.sf.cloudera.com,60020,1289890926773, load=(requests=0, regions=11, usedHeap=5403, maxHeap=8185), trying to assign elsewhere instead java.lang.NullPointerException at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:485) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:733) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) at $Proxy8.openRegion(Unknown Source) at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:537) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:830) Stack overflow in AssignmentManager --- Key: HBASE-3263 URL: https://issues.apache.org/jira/browse/HBASE-3263 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Blocker Attachments: stackoverflow-log.txt My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-3264) Remove unnecessary Guava Dependency
[ https://issues.apache.org/jira/browse/HBASE-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolas Spiegelberg updated HBASE-3264: --- Attachment: HBASE-3264.patch Remove unnecessary Guava Dependency --- Key: HBASE-3264 URL: https://issues.apache.org/jira/browse/HBASE-3264 Project: HBase Issue Type: Bug Components: mapreduce Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Minor Fix For: 0.90.1 Attachments: HBASE-3264.patch Currently, TableMapReduceUtil uses Guava for trivial functionality and addDependencyJars() currently adds Guava by default. However, this jar is only necessary for the ImportTsv MR job. This is annoying when naively bundling hbase jar with a MR job because you now need a second dependency jar. Should default bundle with only critical dependencies and have jobs that need fancy Guava functionality explicitly include them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3263) Stack overflow in AssignmentManager
[ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934770#action_12934770 ] Todd Lipcon commented on HBASE-3263: And also thereafter lots of these: java.lang.NullPointerException at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:485) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:733) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) at $Proxy8.getRegionInfo(Unknown Source) at org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:416) at org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:270) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:322) So somehow we borked a null into one of our maps, it seems Stack overflow in AssignmentManager --- Key: HBASE-3263 URL: https://issues.apache.org/jira/browse/HBASE-3263 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Blocker Attachments: stackoverflow-log.txt My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-3265) Regionservers waiting for ROOT while Master waiting for RegionServers
Regionservers waiting for ROOT while Master waiting for RegionServers - Key: HBASE-3265 URL: https://issues.apache.org/jira/browse/HBASE-3265 Project: HBase Issue Type: Bug Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Critical After a cluster disastrophe due to a disconnected switch, I ended up in a state where the master was up with no region servers (see HBASE-3263). When I brought the RS back up, because of the aforementioned bug, the master didn't get itself into a happy state (internal datastructure had some null in it). So I killed the master and started it again. Now, the master is in Waiting for region servers to check in mode, and the region servers are in the following stack: - locked 0x2aaab1bda5d0 (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForRoot(CatalogTracker.java:177) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:537) at java.lang.Thread.run(Thread.java:619) I imagine what happened is that the RS got through tryReportForDuty with the old master, but the old master was unable to assign anything due to bad state. So, when it crashed, all the RS were stuck in waitForRoot(), and when I brought the new one up, no one was reporting for duty. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3264) Remove unnecessary Guava Dependency
[ https://issues.apache.org/jira/browse/HBASE-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934776#action_12934776 ] Nicolas Spiegelberg commented on HBASE-3264: @Todd: you're not precluded from adding Guava or whatever libraries to this, but I don't think the default action should be to add libraries that you're not using. Guava is currently the only dependency under addDependencyJars(Job) that is not essential for basic HBase table operations. Since addDependencyJars(conf, ...) allows concatenation, you can easily append jars that are necessary for your specific config. We need to use that ourselves to add in compression jars for HFileOutputFormat. Note that I used this api to change the ImportTsv job to append the Guava jar, since it is the job that requires it right now. Remove unnecessary Guava Dependency --- Key: HBASE-3264 URL: https://issues.apache.org/jira/browse/HBASE-3264 Project: HBase Issue Type: Bug Components: mapreduce Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Minor Fix For: 0.90.1 Attachments: HBASE-3264.patch Currently, TableMapReduceUtil uses Guava for trivial functionality and addDependencyJars() currently adds Guava by default. However, this jar is only necessary for the ImportTsv MR job. This is annoying when naively bundling hbase jar with a MR job because you now need a second dependency jar. Should default bundle with only critical dependencies and have jobs that need fancy Guava functionality explicitly include them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-3266) Master does not seem to properly scan ZK for running RS during startup
Master does not seem to properly scan ZK for running RS during startup -- Key: HBASE-3266 URL: https://issues.apache.org/jira/browse/HBASE-3266 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Critical I was in the situation described by HBASE-3265, where I had a number of RS waiting on ROOT, but the master hadn't seen any RS checkins, so was waiting on checkins. To get past this, I restarted one of the region servers. The restarted server checked in, and the master began its startup. At this point the master started scanning /hbase/.logs for things to split. It correctly identified that the RS on haus01 was running (this is the one I restarted): 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus01.sf.cloudera.com,60020,1290500443143 belongs to an existing region server but then incorrectly decided that the RS on haus02 was down: 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus02.sf.cloudera.com,60020,1290498411450 doesn't belong to a known region server, splitting However ZK shows that this RS is up: [zk: haus01.sf.cloudera.com:(CONNECTED) 3] ls /hbase/rs [haus04.sf.cloudera.com,60020,1290498411533, haus05.sf.cloudera.com,60020,1290498411520, haus03.sf.cloudera.com,60020,1290498411518, haus01.sf.cloudera.com,60020,1290500443143, haus02.sf.cloudera.com,60020,1290498411450] splitLogsAfterStartup seems to check ServerManager.onlineServers, which best I can tell is derived from heartbeats and not from ZK (sorry if I got some of this wrong, still new to this new codebase) Of course, the master went into an infinite splitting loop at this point since haus02 is up and renewing its DFS lease on its logs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-3267) close_region shell command breaks region
close_region shell command breaks region Key: HBASE-3267 URL: https://issues.apache.org/jira/browse/HBASE-3267 Project: HBase Issue Type: Bug Components: master, regionserver, shell Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Critical It used to be that you could use the close_region command from the shell to close a region on one server and have the master reassign it elsewhere. Now if you close a region, you get the following errors in the master log: 2010-11-23 00:46:34,090 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSING for region ffaa7999e909dbd6544688cc8ab303bd from server haus01.sf.cloudera.com,12020,1290501789693 but region was in the state null and not in expected PENDI 2010-11-23 00:46:34,530 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: master:6-0x12c537d84e10062 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/ffaa7999e909dbd6544688cc8ab303bd 2010-11-23 00:46:34,531 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6-0x12c537d84e10062 Retrieved 128 byte(s) of data from znode /hbase/unassigned/ffaa7999e909dbd6544688cc8ab303bd and set watcher; region=usertable,user1951957302,1290501969 2010-11-23 00:46:34,531 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=haus01.sf.cloudera.com,12020,1290501789693, region=ffaa7999e909dbd6544688cc8ab303bd 2010-11-23 00:46:34,531 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region ffaa7999e909dbd6544688cc8ab303bd from server haus01.sf.cloudera.com,12020,1290501789693 but region was in the state null and not in expected PENDIN and the region just gets stuck closed -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-3268) Auto-tune balance frequency based on cluster size
Auto-tune balance frequency based on cluster size - Key: HBASE-3268 URL: https://issues.apache.org/jira/browse/HBASE-3268 Project: HBase Issue Type: Improvement Components: master Reporter: Todd Lipcon Right now we only balance the cluster once every 5 minutes by default. This is likely to confuse new users. When you start a new region server, you expect it to pick up some load very quickly, but right now you have to wait 5 minutes for it to start doing anything in the worst case. We could/should also add a button/shell command to trigger balance now -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3268) Auto-tune balance frequency based on cluster size
[ https://issues.apache.org/jira/browse/HBASE-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934795#action_12934795 ] Andrew Purtell commented on HBASE-3268: --- +1 Actually I've been considering filing against this as a bug. I have been testing recently some heavy write scenarios that on current 0.90 pile regions on a single RS and can cause it to OOME before balancing happens. Perhaps at least the default should be 1 minute instead of 5? Auto-tune balance frequency based on cluster size - Key: HBASE-3268 URL: https://issues.apache.org/jira/browse/HBASE-3268 Project: HBase Issue Type: Improvement Components: master Reporter: Todd Lipcon Right now we only balance the cluster once every 5 minutes by default. This is likely to confuse new users. When you start a new region server, you expect it to pick up some load very quickly, but right now you have to wait 5 minutes for it to start doing anything in the worst case. We could/should also add a button/shell command to trigger balance now -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3268) Auto-tune balance frequency based on cluster size
[ https://issues.apache.org/jira/browse/HBASE-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934915#action_12934915 ] Jonathan Gray commented on HBASE-3268: -- I think rather than needing to do much more frequent load balances or just in addition to them, we should add more intelligence into non-balancing region assignment (all random now). And we could also lazily move splits off their original server. But for now making it more aggressive at one minute or so should be fine. Auto-tune balance frequency based on cluster size - Key: HBASE-3268 URL: https://issues.apache.org/jira/browse/HBASE-3268 Project: HBase Issue Type: Improvement Components: master Reporter: Todd Lipcon Right now we only balance the cluster once every 5 minutes by default. This is likely to confuse new users. When you start a new region server, you expect it to pick up some load very quickly, but right now you have to wait 5 minutes for it to start doing anything in the worst case. We could/should also add a button/shell command to trigger balance now -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3268) Auto-tune balance frequency based on cluster size
[ https://issues.apache.org/jira/browse/HBASE-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934918#action_12934918 ] Jonathan Gray commented on HBASE-3268: -- Also, when the master gets a new regionserver on an already running cluster, it should automatically trigger a balance. Auto-tune balance frequency based on cluster size - Key: HBASE-3268 URL: https://issues.apache.org/jira/browse/HBASE-3268 Project: HBase Issue Type: Improvement Components: master Reporter: Todd Lipcon Right now we only balance the cluster once every 5 minutes by default. This is likely to confuse new users. When you start a new region server, you expect it to pick up some load very quickly, but right now you have to wait 5 minutes for it to start doing anything in the worst case. We could/should also add a button/shell command to trigger balance now -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3265) Regionservers waiting for ROOT while Master waiting for RegionServers
[ https://issues.apache.org/jira/browse/HBASE-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934921#action_12934921 ] Jonathan Gray commented on HBASE-3265: -- The RSs should also be heartbeating in to the master as well. Can you post full stack dumps from one of the stuck RS and the master? Regionservers waiting for ROOT while Master waiting for RegionServers - Key: HBASE-3265 URL: https://issues.apache.org/jira/browse/HBASE-3265 Project: HBase Issue Type: Bug Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Critical After a cluster disastrophe due to a disconnected switch, I ended up in a state where the master was up with no region servers (see HBASE-3263). When I brought the RS back up, because of the aforementioned bug, the master didn't get itself into a happy state (internal datastructure had some null in it). So I killed the master and started it again. Now, the master is in Waiting for region servers to check in mode, and the region servers are in the following stack: - locked 0x2aaab1bda5d0 (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForRoot(CatalogTracker.java:177) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:537) at java.lang.Thread.run(Thread.java:619) I imagine what happened is that the RS got through tryReportForDuty with the old master, but the old master was unable to assign anything due to bad state. So, when it crashed, all the RS were stuck in waitForRoot(), and when I brought the new one up, no one was reporting for duty. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3266) Master does not seem to properly scan ZK for running RS during startup
[ https://issues.apache.org/jira/browse/HBASE-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934924#action_12934924 ] Jonathan Gray commented on HBASE-3266: -- Yeah, I think as it is currently the HMaster is using the startup/heartbeat messages to determine which RS are online. As I commented in the other jira, we should see why they were not doing so. We should do some reconciliation between what we find in ZK and what we think is online based on RPCs, but not sure exactly what course we would take in a state like this. Master does not seem to properly scan ZK for running RS during startup -- Key: HBASE-3266 URL: https://issues.apache.org/jira/browse/HBASE-3266 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Critical I was in the situation described by HBASE-3265, where I had a number of RS waiting on ROOT, but the master hadn't seen any RS checkins, so was waiting on checkins. To get past this, I restarted one of the region servers. The restarted server checked in, and the master began its startup. At this point the master started scanning /hbase/.logs for things to split. It correctly identified that the RS on haus01 was running (this is the one I restarted): 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus01.sf.cloudera.com,60020,1290500443143 belongs to an existing region server but then incorrectly decided that the RS on haus02 was down: 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus02.sf.cloudera.com,60020,1290498411450 doesn't belong to a known region server, splitting However ZK shows that this RS is up: [zk: haus01.sf.cloudera.com:(CONNECTED) 3] ls /hbase/rs [haus04.sf.cloudera.com,60020,1290498411533, haus05.sf.cloudera.com,60020,1290498411520, haus03.sf.cloudera.com,60020,1290498411518, haus01.sf.cloudera.com,60020,1290500443143, haus02.sf.cloudera.com,60020,1290498411450] splitLogsAfterStartup seems to check ServerManager.onlineServers, which best I can tell is derived from heartbeats and not from ZK (sorry if I got some of this wrong, still new to this new codebase) Of course, the master went into an infinite splitting loop at this point since haus02 is up and renewing its DFS lease on its logs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3267) close_region shell command breaks region
[ https://issues.apache.org/jira/browse/HBASE-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934925#action_12934925 ] Jonathan Gray commented on HBASE-3267: -- The master doesn't expect a region to be properly closed out on an RS w/o being the one to tell it to do so. Let me dig in to the code and see what the easiest solution would be. @Todd... I had plans to start working on new features for 0.92, stop finding bugs! ;) close_region shell command breaks region Key: HBASE-3267 URL: https://issues.apache.org/jira/browse/HBASE-3267 Project: HBase Issue Type: Bug Components: master, regionserver, shell Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Critical It used to be that you could use the close_region command from the shell to close a region on one server and have the master reassign it elsewhere. Now if you close a region, you get the following errors in the master log: 2010-11-23 00:46:34,090 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSING for region ffaa7999e909dbd6544688cc8ab303bd from server haus01.sf.cloudera.com,12020,1290501789693 but region was in the state null and not in expected PENDI 2010-11-23 00:46:34,530 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: master:6-0x12c537d84e10062 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/ffaa7999e909dbd6544688cc8ab303bd 2010-11-23 00:46:34,531 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6-0x12c537d84e10062 Retrieved 128 byte(s) of data from znode /hbase/unassigned/ffaa7999e909dbd6544688cc8ab303bd and set watcher; region=usertable,user1951957302,1290501969 2010-11-23 00:46:34,531 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=haus01.sf.cloudera.com,12020,1290501789693, region=ffaa7999e909dbd6544688cc8ab303bd 2010-11-23 00:46:34,531 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region ffaa7999e909dbd6544688cc8ab303bd from server haus01.sf.cloudera.com,12020,1290501789693 but region was in the state null and not in expected PENDIN and the region just gets stuck closed -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3268) Auto-tune balance frequency based on cluster size
[ https://issues.apache.org/jira/browse/HBASE-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934927#action_12934927 ] Todd Lipcon commented on HBASE-3268: I think the idea of triggering balance when we get a new server is a good one. One thing we want to be a little careful of is the situation when someone flips on 10 new servers at the same time. Rather than triggering a rebalance for each (and thus lots of churn), we want a little bit of lag before the rebalance. Maybe when a new server is added, we trigger the rebalance in 5-10 seconds? Auto-tune balance frequency based on cluster size - Key: HBASE-3268 URL: https://issues.apache.org/jira/browse/HBASE-3268 Project: HBase Issue Type: Improvement Components: master Reporter: Todd Lipcon Right now we only balance the cluster once every 5 minutes by default. This is likely to confuse new users. When you start a new region server, you expect it to pick up some load very quickly, but right now you have to wait 5 minutes for it to start doing anything in the worst case. We could/should also add a button/shell command to trigger balance now -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3268) Auto-tune balance frequency based on cluster size
[ https://issues.apache.org/jira/browse/HBASE-3268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934938#action_12934938 ] Jonathan Gray commented on HBASE-3268: -- Something like that sounds reasonable. I'm trying to figure some of these other issues now so this is up for grabs. Auto-tune balance frequency based on cluster size - Key: HBASE-3268 URL: https://issues.apache.org/jira/browse/HBASE-3268 Project: HBase Issue Type: Improvement Components: master Reporter: Todd Lipcon Right now we only balance the cluster once every 5 minutes by default. This is likely to confuse new users. When you start a new region server, you expect it to pick up some load very quickly, but right now you have to wait 5 minutes for it to start doing anything in the worst case. We could/should also add a button/shell command to trigger balance now -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-3262) TestHMasterRPCException uses non-ephemeral port for master
[ https://issues.apache.org/jira/browse/HBASE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Gray updated HBASE-3262: - Fix Version/s: 0.92.0 Status: Patch Available (was: Open) TestHMasterRPCException uses non-ephemeral port for master -- Key: HBASE-3262 URL: https://issues.apache.org/jira/browse/HBASE-3262 Project: HBase Issue Type: Bug Affects Versions: 0.90.0 Reporter: Jonathan Gray Assignee: Jonathan Gray Fix For: 0.90.0, 0.92.0 Attachments: HBASE-3262-v1.patch TestHMasterRPCException instantiates an HMaster but doesn't use an ephemeral port which can cause the test to fail if port already in use. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3264) Remove unnecessary Guava Dependency
[ https://issues.apache.org/jira/browse/HBASE-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934954#action_12934954 ] Jonathan Gray commented on HBASE-3264: -- I'm fine with adding dependencies/libraries when we are using them, but in general think we should also aim to minimize our dependencies. So I'm +1 for removing an additional dependency from the job if it's trivial to remove it. I'm also +1 on a complete client-dep jar if we could get maven to do that for us. Remove unnecessary Guava Dependency --- Key: HBASE-3264 URL: https://issues.apache.org/jira/browse/HBASE-3264 Project: HBase Issue Type: Bug Components: mapreduce Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Minor Fix For: 0.90.1 Attachments: HBASE-3264.patch Currently, TableMapReduceUtil uses Guava for trivial functionality and addDependencyJars() currently adds Guava by default. However, this jar is only necessary for the ImportTsv MR job. This is annoying when naively bundling hbase jar with a MR job because you now need a second dependency jar. Should default bundle with only critical dependencies and have jobs that need fancy Guava functionality explicitly include them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-3256) Coprocessors: Coprocessor host and observer for HMaster
[ https://issues.apache.org/jira/browse/HBASE-3256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Helmling updated HBASE-3256: - Attachment: HBASE-3256_initial.patch Here's a preview version of the patch adding the MasterObserver interface and related changes and refactorings. The final version of this is waiting on an implementation of the lifecycle hooks for HBASE-3260. Once I complete those changes, I will merge here, add unit tests and put the final version of this patch up on review board. Coprocessors: Coprocessor host and observer for HMaster --- Key: HBASE-3256 URL: https://issues.apache.org/jira/browse/HBASE-3256 Project: HBase Issue Type: Sub-task Reporter: Andrew Purtell Assignee: Gary Helmling Fix For: 0.92.0 Attachments: HBASE-3256_initial.patch Implement a coprocessor host for HMaster. Hook observers into administrative operations performed on tables: create, alter, assignment, load balance, and allow observers to modify base master behavior. Support automatic loading of coprocessor implementation. Consider refactoring the master coprocessor host and regionserver coprocessor host into a common base class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-3269) HBase table truncate semantics seems broken as disable table is now async by default.
HBase table truncate semantics seems broken as disable table is now async by default. --- Key: HBASE-3269 URL: https://issues.apache.org/jira/browse/HBASE-3269 Project: HBase Issue Type: Bug Affects Versions: 0.90.0 Environment: RHEL5 x86_64 Reporter: Suraj Varma Priority: Critical The new async design for disable table seems to have caused a side effect on the truncate command. (IRC chat with jdcryans) Apparent Cause: Disable is now async by default. When truncate is called, the disable operation returns immediately and when the drop is called, the disable operation is still not completed. This results in HMaster.checkTableModifiable() throwing a TableNotDisabledException. With earlier versions, disable returned only after Table was disabled. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-3269) HBase table truncate semantics seems broken as disable table is now async by default.
[ https://issues.apache.org/jira/browse/HBASE-3269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-3269: -- Fix Version/s: 0.92.0 0.90.0 Assignee: stack Assigning to Stack and marking it against 0.90 HBase table truncate semantics seems broken as disable table is now async by default. --- Key: HBASE-3269 URL: https://issues.apache.org/jira/browse/HBASE-3269 Project: HBase Issue Type: Bug Affects Versions: 0.90.0 Environment: RHEL5 x86_64 Reporter: Suraj Varma Assignee: stack Priority: Critical Fix For: 0.90.0, 0.92.0 The new async design for disable table seems to have caused a side effect on the truncate command. (IRC chat with jdcryans) Apparent Cause: Disable is now async by default. When truncate is called, the disable operation returns immediately and when the drop is called, the disable operation is still not completed. This results in HMaster.checkTableModifiable() throwing a TableNotDisabledException. With earlier versions, disable returned only after Table was disabled. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3227) Edit of log messages before branching...
[ https://issues.apache.org/jira/browse/HBASE-3227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934996#action_12934996 ] HBase Review Board commented on HBASE-3227: --- Message from: st...@duboce.net bq. On 2010-11-22 17:29:45, Nicolas wrote: bq. trunk/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java, line 739 bq. http://review.cloudera.org/r/1212/diff/1/?file=17170#file17170line739 bq. bq. I'd suggest keeping the store name in this debug message since we're considering thread pools for compactions... Won't the store name be part of the path on the next line when we do sf.toString() where sf is the file we're compacting all into? - stack --- This is an automatically generated e-mail. To reply, visit: http://review.cloudera.org/r/1212/#review1971 --- Edit of log messages before branching... Key: HBASE-3227 URL: https://issues.apache.org/jira/browse/HBASE-3227 Project: HBase Issue Type: Improvement Reporter: stack Fix For: 0.90.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HBASE-3264) Remove unnecessary Guava Dependency
[ https://issues.apache.org/jira/browse/HBASE-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack resolved HBASE-3264. -- Resolution: Fixed Fix Version/s: (was: 0.90.1) 0.90.0 Hadoop Flags: [Reviewed] Committed. I agree we should use libs instead of writing the stuff ourselves but also that we move to minimize dependencies. In this case, I like the bit of Nicolas footwork that changes a little bit of code so we can cut our client dependencies by 25 (or 33?) percent. Remove unnecessary Guava Dependency --- Key: HBASE-3264 URL: https://issues.apache.org/jira/browse/HBASE-3264 Project: HBase Issue Type: Bug Components: mapreduce Reporter: Nicolas Spiegelberg Assignee: Nicolas Spiegelberg Priority: Minor Fix For: 0.90.0 Attachments: HBASE-3264.patch Currently, TableMapReduceUtil uses Guava for trivial functionality and addDependencyJars() currently adds Guava by default. However, this jar is only necessary for the ImportTsv MR job. This is annoying when naively bundling hbase jar with a MR job because you now need a second dependency jar. Should default bundle with only critical dependencies and have jobs that need fancy Guava functionality explicitly include them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-3270) When we create the .version file, we should create it in a tmp location and then move it into place
When we create the .version file, we should create it in a tmp location and then move it into place --- Key: HBASE-3270 URL: https://issues.apache.org/jira/browse/HBASE-3270 Project: HBase Issue Type: Improvement Components: master Reporter: stack Priority: Minor Todd suggests over in HBASE-3258 that writing hbase.version, we should write it off in a /tmp location and then move it into place after writing it to protect against case where file writer crashes between creation and write. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3262) TestHMasterRPCException uses non-ephemeral port for master
[ https://issues.apache.org/jira/browse/HBASE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935014#action_12935014 ] stack commented on HBASE-3262: -- This patch is fine -- especially if you want to apply this to 0.90 for second RC -- but yeah, best if HTU does this for all HMaster instantiations. TestHMasterRPCException uses non-ephemeral port for master -- Key: HBASE-3262 URL: https://issues.apache.org/jira/browse/HBASE-3262 Project: HBase Issue Type: Bug Affects Versions: 0.90.0 Reporter: Jonathan Gray Assignee: Jonathan Gray Fix For: 0.90.0, 0.92.0 Attachments: HBASE-3262-v1.patch TestHMasterRPCException instantiates an HMaster but doesn't use an ephemeral port which can cause the test to fail if port already in use. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-3267) close_region shell command breaks region
[ https://issues.apache.org/jira/browse/HBASE-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-3267: - Fix Version/s: 0.90.0 Bringing into 0.90.0. close_region shell command breaks region Key: HBASE-3267 URL: https://issues.apache.org/jira/browse/HBASE-3267 Project: HBase Issue Type: Bug Components: master, regionserver, shell Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Critical Fix For: 0.90.0 It used to be that you could use the close_region command from the shell to close a region on one server and have the master reassign it elsewhere. Now if you close a region, you get the following errors in the master log: 2010-11-23 00:46:34,090 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSING for region ffaa7999e909dbd6544688cc8ab303bd from server haus01.sf.cloudera.com,12020,1290501789693 but region was in the state null and not in expected PENDI 2010-11-23 00:46:34,530 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: master:6-0x12c537d84e10062 Received ZooKeeper Event, type=NodeDataChanged, state=SyncConnected, path=/hbase/unassigned/ffaa7999e909dbd6544688cc8ab303bd 2010-11-23 00:46:34,531 DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil: master:6-0x12c537d84e10062 Retrieved 128 byte(s) of data from znode /hbase/unassigned/ffaa7999e909dbd6544688cc8ab303bd and set watcher; region=usertable,user1951957302,1290501969 2010-11-23 00:46:34,531 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=haus01.sf.cloudera.com,12020,1290501789693, region=ffaa7999e909dbd6544688cc8ab303bd 2010-11-23 00:46:34,531 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region ffaa7999e909dbd6544688cc8ab303bd from server haus01.sf.cloudera.com,12020,1290501789693 but region was in the state null and not in expected PENDIN and the region just gets stuck closed -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-3265) Regionservers waiting for ROOT while Master waiting for RegionServers
[ https://issues.apache.org/jira/browse/HBASE-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-3265: - Fix Version/s: 0.90.0 Bringing in for triage Regionservers waiting for ROOT while Master waiting for RegionServers - Key: HBASE-3265 URL: https://issues.apache.org/jira/browse/HBASE-3265 Project: HBase Issue Type: Bug Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Critical Fix For: 0.90.0 After a cluster disastrophe due to a disconnected switch, I ended up in a state where the master was up with no region servers (see HBASE-3263). When I brought the RS back up, because of the aforementioned bug, the master didn't get itself into a happy state (internal datastructure had some null in it). So I killed the master and started it again. Now, the master is in Waiting for region servers to check in mode, and the region servers are in the following stack: - locked 0x2aaab1bda5d0 (a org.apache.hadoop.hbase.zookeeper.RootRegionTracker) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForRoot(CatalogTracker.java:177) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:537) at java.lang.Thread.run(Thread.java:619) I imagine what happened is that the RS got through tryReportForDuty with the old master, but the old master was unable to assign anything due to bad state. So, when it crashed, all the RS were stuck in waitForRoot(), and when I brought the new one up, no one was reporting for duty. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-3266) Master does not seem to properly scan ZK for running RS during startup
[ https://issues.apache.org/jira/browse/HBASE-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-3266: - Fix Version/s: 0.90.0 Bringing into 0.90.0 while we triage. Master does not seem to properly scan ZK for running RS during startup -- Key: HBASE-3266 URL: https://issues.apache.org/jira/browse/HBASE-3266 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Critical Fix For: 0.90.0 I was in the situation described by HBASE-3265, where I had a number of RS waiting on ROOT, but the master hadn't seen any RS checkins, so was waiting on checkins. To get past this, I restarted one of the region servers. The restarted server checked in, and the master began its startup. At this point the master started scanning /hbase/.logs for things to split. It correctly identified that the RS on haus01 was running (this is the one I restarted): 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus01.sf.cloudera.com,60020,1290500443143 belongs to an existing region server but then incorrectly decided that the RS on haus02 was down: 2010-11-23 00:21:25,595 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://haus01.sf.cloudera.com:11020/hbase-normal/.logs/haus02.sf.cloudera.com,60020,1290498411450 doesn't belong to a known region server, splitting However ZK shows that this RS is up: [zk: haus01.sf.cloudera.com:(CONNECTED) 3] ls /hbase/rs [haus04.sf.cloudera.com,60020,1290498411533, haus05.sf.cloudera.com,60020,1290498411520, haus03.sf.cloudera.com,60020,1290498411518, haus01.sf.cloudera.com,60020,1290500443143, haus02.sf.cloudera.com,60020,1290498411450] splitLogsAfterStartup seems to check ServerManager.onlineServers, which best I can tell is derived from heartbeats and not from ZK (sorry if I got some of this wrong, still new to this new codebase) Of course, the master went into an infinite splitting loop at this point since haus02 is up and renewing its DFS lease on its logs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-3263) Stack overflow in AssignmentManager
[ https://issues.apache.org/jira/browse/HBASE-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-3263: - Fix Version/s: 0.90.0 Stack overflow in AssignmentManager --- Key: HBASE-3263 URL: https://issues.apache.org/jira/browse/HBASE-3263 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.0 Reporter: Todd Lipcon Priority: Blocker Fix For: 0.90.0 Attachments: stackoverflow-log.txt My test cluster experienced a switch outage earlier this week which threw the master into a really bad state. In the catch clause of AssignmentManager.assign, we recurse, and if all of the region servers are inaccessible, we do so until we get a stack overflow. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3261) NPE out of HRS.run at startup when clock is out of sync
[ https://issues.apache.org/jira/browse/HBASE-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935027#action_12935027 ] stack commented on HBASE-3261: -- +1 on adding nullcheck. The sequence in which stuff is started was undergoing high velocity change up to the end. Not surprised holes if start sequence aborted midway. NPE out of HRS.run at startup when clock is out of sync --- Key: HBASE-3261 URL: https://issues.apache.org/jira/browse/HBASE-3261 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Fix For: 0.90.0, 0.92.0 This is what I get when I start a region server that's not properly sync'ed: {noformat} Exception in thread regionserver60020 java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:603) at java.lang.Thread.run(Thread.java:637) {noformat} I this case the line was: {noformat} hlogRoller.interruptIfNecessary(); {noformat} I guess we could add a bunch of other null checks. The end result is the same, the RS dies, but I think it's misleading. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3260) Coprocessors: Lifecycle management
[ https://issues.apache.org/jira/browse/HBASE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935031#action_12935031 ] stack commented on HBASE-3260: -- I'm good with borrowing the nomenclature. Can we borrow libraries that will manage the lifecycle for us? Would it make sense implementing CPs atop some lifecycle supporting framework? Would it make sense using, say, any of the DI containers wiring up CPs? The regionserver and master have similar need of a lifecycle as has been discussed elsewhere. Would be grand if same lifecycle nomenclature was used throughout -- for hbase daemons and CP. Coprocessors: Lifecycle management -- Key: HBASE-3260 URL: https://issues.apache.org/jira/browse/HBASE-3260 Project: HBase Issue Type: Sub-task Reporter: Andrew Purtell Fix For: 0.92.0 Attachments: statechart.png Considering extending CPs to the master, we have no equivalent to pre/postOpen and pre/postClose as on the regionserver. We also should consider how to resolve dependencies and initialization ordering if loading coprocessors that depend on others. OSGi (http://en.wikipedia.org/wiki/OSGi) has a lifecycle API and is familiar to many Java programmers, so we propose to borrow its terminology and state machine. A lifecycle layer manages coprocessors as they are dynamically installed, started, stopped, updated and uninstalled. Coprocessors rely on the framework for dependency resolution and class loading. In turn, the framework calls up to lifecycle management methods in the coprocessor as needed. A coprocessor transitions between the below states over its lifetime: ||State||Description|| |UNINSTALLED|The coprocessor implementation is not installed. This is the default implicit state.| |INSTALLED|The coprocessor implementation has been successfully installed| |STARTING|A coprocessor instance is being started.| |ACTIVE|The coprocessor instance has been successfully activated and is running.| |STOPPING|A coprocessor instance is being stopped.| See attached state diagram. Transitions to STOPPING will only happen as the region is being closed. If a coprocessor throws an unhandled exception, this will cause the RegionServer to close the region, stopping all coprocessor instances on it. Transitions from INSTALLED-STARTING and ACTIVE-STOPPING would go through upcall methods into the coprocessor via the CoprocessorLifecycle interface: {code:java} public interface CoprocessorLifecycle { void start(CoprocessorEnvironment env) throws IOException; void stop(CoprocessorEnvironment env) throws IOException; } {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3254) Ability to specify the host published in zookeeper
[ https://issues.apache.org/jira/browse/HBASE-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935040#action_12935040 ] stack commented on HBASE-3254: -- Hey Cheddar. You got a patch that would illustrate how you'd fix this (We should be using hostnames rather than IPs I'd say)? Ability to specify the host published in zookeeper Key: HBASE-3254 URL: https://issues.apache.org/jira/browse/HBASE-3254 Project: HBase Issue Type: Improvement Affects Versions: 0.89.20100924 Reporter: Eric Tschetter We are running HBase on EC2 and I'm trying to get a client external from EC2 to connect to the cluster. But, each of the nodes appears to be publishing its IP address into zookeeper. The problem is that the nodes on EC2 see a 10. IP address that is only resolvable inside of EC2. Specifically for EC2, there is a DNS name that will resolve properly both externally and internally, so it would be nice if I could tell each of the processes what host to publish into zookeeper via a property. As it stands, I have to do ssh tunnelling/muck with the hosts file in order to get my client to connect. This problem could occur anywhere that you have a different DNS entry for public vs. private access. That might only ever happen on EC2, but it might happen elsewhere. I don't really know :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-3249) Typing 'help shutdown' in the shell shouldn't shutdown the cluster
[ https://issues.apache.org/jira/browse/HBASE-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-3249: - Attachment: shutdown.txt I took a quick look. 'help shutdown' is being interpreted as two commands, an 'help' followed by a 'shutdown'. The interpreter is running the 'shutdown' command first and then 'help'. 'help' is a native IRB that we did some hackery to override. The 'shutdown' is part of our command-set injection. I'd need to dig in and spend some time figuring how I hacked this up. For 0.90.0, I propose the following patch where we just remove the shutdown command. Shutdown from inside the shell seems 'odd' to me. Meantime I'll open an issue to look into how this help stuff is being interpreted by IRB. Its kinda ugly. If you do 'help get' you get first complaint that get has not been passed enough parameters and then the help output followed by the total help output. Ugly. We need to fix. But don't think it blocker on 0.90 RC. Typing 'help shutdown' in the shell shouldn't shutdown the cluster -- Key: HBASE-3249 URL: https://issues.apache.org/jira/browse/HBASE-3249 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Fix For: 0.90.0 Attachments: shutdown.txt _hp_ on IRC found out the bad way that typing 'help shutdown' actually gives you the full help... and shuts down the cluster. I don't really understand why we process both commands, putting against 0.90.0 if anyone has an idea. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HBASE-3249) Typing 'help shutdown' in the shell shouldn't shutdown the cluster
[ https://issues.apache.org/jira/browse/HBASE-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack reassigned HBASE-3249: Assignee: stack Typing 'help shutdown' in the shell shouldn't shutdown the cluster -- Key: HBASE-3249 URL: https://issues.apache.org/jira/browse/HBASE-3249 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Assignee: stack Fix For: 0.90.0 Attachments: shutdown.txt _hp_ on IRC found out the bad way that typing 'help shutdown' actually gives you the full help... and shuts down the cluster. I don't really understand why we process both commands, putting against 0.90.0 if anyone has an idea. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3249) Typing 'help shutdown' in the shell shouldn't shutdown the cluster
[ https://issues.apache.org/jira/browse/HBASE-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935047#action_12935047 ] stack commented on HBASE-3249: -- Oh, I'm looking for a +1 Typing 'help shutdown' in the shell shouldn't shutdown the cluster -- Key: HBASE-3249 URL: https://issues.apache.org/jira/browse/HBASE-3249 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Assignee: stack Fix For: 0.90.0 Attachments: shutdown.txt _hp_ on IRC found out the bad way that typing 'help shutdown' actually gives you the full help... and shuts down the cluster. I don't really understand why we process both commands, putting against 0.90.0 if anyone has an idea. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3249) Typing 'help shutdown' in the shell shouldn't shutdown the cluster
[ https://issues.apache.org/jira/browse/HBASE-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935049#action_12935049 ] Jean-Daniel Cryans commented on HBASE-3249: --- +1 Typing 'help shutdown' in the shell shouldn't shutdown the cluster -- Key: HBASE-3249 URL: https://issues.apache.org/jira/browse/HBASE-3249 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Assignee: stack Fix For: 0.90.0 Attachments: shutdown.txt _hp_ on IRC found out the bad way that typing 'help shutdown' actually gives you the full help... and shuts down the cluster. I don't really understand why we process both commands, putting against 0.90.0 if anyone has an idea. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3247) Changes API: API for pulling edits from HBase
[ https://issues.apache.org/jira/browse/HBASE-3247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935055#action_12935055 ] stack commented on HBASE-3247: -- @Steven Yes, we should start with RowLog (http://www.lilyproject.org/maven-site/0.1/apidocs/org/lilycms/rowlog/api/RowLog.html). bq. I'm wondering, as I'm seeing a proliferation of alternative yet overlapping approaches to a certain number of issues (secondary indexes, change listening) which in the end could confuse new users. -1 to proliferation of alternate yet overlapping... things What you fellas suggest for bootstrapping system -- doing a fat bulk load into the search index -- and then cutting over to rowlog for incremental updates? Doesn't there have to exact transition so followers do not miss edits? You fellas have ideas for how to do that? Changes API: API for pulling edits from HBase - Key: HBASE-3247 URL: https://issues.apache.org/jira/browse/HBASE-3247 Project: HBase Issue Type: Task Reporter: stack Talking to Shay from Elastic Search, he was asking where the Changes API is in HBase. Talking more -- there was a bit of beer involved so apologize up front -- he wants to be able to bootstrap an index and thereafter ask HBase for changes since time t. We thought he could tie into the replication stream, but rather he wants to be able to pull rather than have it pushed to him (in case he crashes, etc. so on recovery he can start pulling again from last good edit received). He could do the bootstrap with a Scan. Thereafter, requests to pull from hbase would pass a marker of some sort. HBase would then give out edits that came in after this marker, in batches, along with an updated marker. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HBASE-3249) Typing 'help shutdown' in the shell shouldn't shutdown the cluster
[ https://issues.apache.org/jira/browse/HBASE-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack resolved HBASE-3249. -- Resolution: Fixed Hadoop Flags: [Reviewed] Committed. Thanks for review j-d. Typing 'help shutdown' in the shell shouldn't shutdown the cluster -- Key: HBASE-3249 URL: https://issues.apache.org/jira/browse/HBASE-3249 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Assignee: stack Fix For: 0.90.0 Attachments: shutdown.txt _hp_ on IRC found out the bad way that typing 'help shutdown' actually gives you the full help... and shuts down the cluster. I don't really understand why we process both commands, putting against 0.90.0 if anyone has an idea. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HBASE-3258) EOF when version file is empty
[ https://issues.apache.org/jira/browse/HBASE-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans resolved HBASE-3258. --- Resolution: Fixed Hadoop Flags: [Reviewed] Committed to trunk and 0.90, thanks Stack. EOF when version file is empty -- Key: HBASE-3258 URL: https://issues.apache.org/jira/browse/HBASE-3258 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Blocker Fix For: 0.90.0, 0.92.0 Attachments: HBASE-3258.patch I somehow was able to get an empty hbase.version file on a test machine and when I start HBase I see: {noformat} starting master, logging to /data/jdcryans/git/hbase/bin/../logs/hbase-jdcryans-master-hbasedev.out Exception in thread master-hbasedev:6 java.lang.NullPointerException at org.apache.hadoop.hbase.master.HMaster.stopServiceThreads(HMaster.java:559) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:286) {noformat} And in the master's log: {noformat} 2010-11-22 10:08:43,003 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. java.io.EOFException at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:323) at java.io.DataInputStream.readUTF(DataInputStream.java:572) at org.apache.hadoop.hbase.util.FSUtils.getVersion(FSUtils.java:151) at org.apache.hadoop.hbase.util.FSUtils.checkVersion(FSUtils.java:170) at org.apache.hadoop.hbase.master.MasterFileSystem.checkRootDir(MasterFileSystem.java:226) at org.apache.hadoop.hbase.master.MasterFileSystem.createInitialFileSystemLayout(MasterFileSystem.java:104) at org.apache.hadoop.hbase.master.MasterFileSystem.init(MasterFileSystem.java:89) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:337) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:273) 2010-11-22 10:08:43,006 INFO org.apache.hadoop.hbase.master.HMaster: Aborting {noformat} I thought that that kind of issue was solved a long time ago, but somehow it's there again. I'll fix by handling the EOF and also will look at that ugly NPE. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3259) Can't kill the region servers when they wait on the master or the cluster state znode
[ https://issues.apache.org/jira/browse/HBASE-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935064#action_12935064 ] Jean-Daniel Cryans commented on HBASE-3259: --- Committed to trunk and 0.90, thanks Stack. Can't kill the region servers when they wait on the master or the cluster state znode - Key: HBASE-3259 URL: https://issues.apache.org/jira/browse/HBASE-3259 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Blocker Fix For: 0.90.0, 0.92.0 Attachments: HBASE-3259.patch With a situation like HBASE-3258, it's easy to have the region servers stuck on waiting for either the master or the cluster state znode since it has no timeout. You have to kill -9 them to have them shutting down. This is very bad for usability. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-3262) TestHMasterRPCException uses non-ephemeral port for master
[ https://issues.apache.org/jira/browse/HBASE-3262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Gray updated HBASE-3262: - Resolution: Fixed Hadoop Flags: [Reviewed] Status: Resolved (was: Patch Available) Committed to branch and trunk. TestHMasterRPCException uses non-ephemeral port for master -- Key: HBASE-3262 URL: https://issues.apache.org/jira/browse/HBASE-3262 Project: HBase Issue Type: Bug Affects Versions: 0.90.0 Reporter: Jonathan Gray Assignee: Jonathan Gray Fix For: 0.90.0, 0.92.0 Attachments: HBASE-3262-v1.patch TestHMasterRPCException instantiates an HMaster but doesn't use an ephemeral port which can cause the test to fail if port already in use. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-3271) Allow .META. table to be exported
Allow .META. table to be exported - Key: HBASE-3271 URL: https://issues.apache.org/jira/browse/HBASE-3271 Project: HBase Issue Type: Improvement Components: util Affects Versions: 0.20.6 Reporter: Ted Yu I tried to export .META. table in 0.20.6 and got: [had...@us01-ciqps1-name01 hbase]$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export .META. h-meta 1 0 0 10/11/23 20:59:05 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 2010-11-23 20:59:05.255::INFO: Logging to STDERR via org.mortbay.log.StdErrLog 2010-11-23 20:59:05.255::INFO: verisons=1, starttime=0, endtime=9223372036854775807 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.2.2-888565, built on 12/08/2009 21:51 GMT 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client environment:host.name=us01-ciqps1-name01.carrieriq.com 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client environment:java.version=1.6.0_21 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Sun Microsystems Inc. ... 10/11/23 20:59:05 INFO zookeeper.ClientCnxn: Server connection successful 10/11/23 20:59:05 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode /hbase/root-region-server got 10.202.50.112:60020 10/11/23 20:59:05 DEBUG client.HConnectionManager$TableServers: Found ROOT at 10.202.50.112:60020 10/11/23 20:59:05 DEBUG client.HConnectionManager$TableServers: Cached location for .META.,,1 is us01-ciqps1-grid02.carrieriq.com:60020 Exception in thread main java.io.IOException: Expecting at least one region. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:281) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) at org.apache.hadoop.hbase.mapreduce.Export.main(Export.java:146) Related code is: if (keys == null || keys.getFirst() == null || keys.getFirst().length == 0) { throw new IOException(Expecting at least one region.); } My intention was to save the dangling rows in .META. (for future investigation) which prevented a table from being created. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3261) NPE out of HRS.run at startup when clock is out of sync
[ https://issues.apache.org/jira/browse/HBASE-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935074#action_12935074 ] Jean-Daniel Cryans commented on HBASE-3261: --- The patch attached just adds a bunch of null checks. NPE out of HRS.run at startup when clock is out of sync --- Key: HBASE-3261 URL: https://issues.apache.org/jira/browse/HBASE-3261 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.90.0, 0.92.0 Attachments: HBASE-3261.patch This is what I get when I start a region server that's not properly sync'ed: {noformat} Exception in thread regionserver60020 java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:603) at java.lang.Thread.run(Thread.java:637) {noformat} I this case the line was: {noformat} hlogRoller.interruptIfNecessary(); {noformat} I guess we could add a bunch of other null checks. The end result is the same, the RS dies, but I think it's misleading. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-3261) NPE out of HRS.run at startup when clock is out of sync
[ https://issues.apache.org/jira/browse/HBASE-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-3261: -- Attachment: HBASE-3261.patch NPE out of HRS.run at startup when clock is out of sync --- Key: HBASE-3261 URL: https://issues.apache.org/jira/browse/HBASE-3261 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.90.0, 0.92.0 Attachments: HBASE-3261.patch This is what I get when I start a region server that's not properly sync'ed: {noformat} Exception in thread regionserver60020 java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:603) at java.lang.Thread.run(Thread.java:637) {noformat} I this case the line was: {noformat} hlogRoller.interruptIfNecessary(); {noformat} I guess we could add a bunch of other null checks. The end result is the same, the RS dies, but I think it's misleading. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-3272) Remove no longer used options
Remove no longer used options - Key: HBASE-3272 URL: https://issues.apache.org/jira/browse/HBASE-3272 Project: HBase Issue Type: Bug Reporter: stack From Lars George list up on hbase-dev: {code} Hi, I went through the config values as per the defaults XML file (still going through it again now based on what is actually in the code, i.e. those not in defaults). Here is what I found: hbase.master.balancer.period - Only used in hbase-default.xml? hbase.regions.percheckin, hbase.regions.slop - Some tests still have it but not used anywhere else zookeeper.pause, zookeeper.retries - Never used? Only in hbase-defaults.xml And then there are differences between hardcoded and XML based defaults: hbase.client.pause - XML: 1000, hardcoded: 2000 (HBaseClient) and 30 * 1000 (HBaseAdmin) hbase.client.retries.number - XML: 10, hardcoded 5 (HBaseAdmin) and 2 (HMaster) hbase.hstore.blockingStoreFiles - XML: 7, hardcoded: -1 hbase.hstore.compactionThreshold - XML: 3, hardcoded: 2 hbase.regionserver.global.memstore.lowerLimit - XML: 0.35, hardcoded: 0.25 hbase.regionserver.handler.count - XML: 25, hardcoded: 10 hbase.regionserver.msginterval - XML: 3000, hardcoded: 1000 hbase.rest.port - XML: 8080, hardcoded: 9090 hfile.block.cache.size - XML: 0.2, hardcoded: 0.0 Finally, some keys are already in HConstants, some are in local classes and others used as literals. There is an issue open to fix this though. Just saying. Thoughts? {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3272) Remove no longer used options
[ https://issues.apache.org/jira/browse/HBASE-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935098#action_12935098 ] Jean-Daniel Cryans commented on HBASE-3272: --- +1 Looking at the patch, it made me remember that we wanted to raise the default for blocking store file to something like 12. Do we still want to do this? Remove no longer used options - Key: HBASE-3272 URL: https://issues.apache.org/jira/browse/HBASE-3272 Project: HBase Issue Type: Bug Reporter: stack Assignee: stack Fix For: 0.90.0 Attachments: 3272.txt From Lars George list up on hbase-dev: {code} Hi, I went through the config values as per the defaults XML file (still going through it again now based on what is actually in the code, i.e. those not in defaults). Here is what I found: hbase.master.balancer.period - Only used in hbase-default.xml? hbase.regions.percheckin, hbase.regions.slop - Some tests still have it but not used anywhere else zookeeper.pause, zookeeper.retries - Never used? Only in hbase-defaults.xml And then there are differences between hardcoded and XML based defaults: hbase.client.pause - XML: 1000, hardcoded: 2000 (HBaseClient) and 30 * 1000 (HBaseAdmin) hbase.client.retries.number - XML: 10, hardcoded 5 (HBaseAdmin) and 2 (HMaster) hbase.hstore.blockingStoreFiles - XML: 7, hardcoded: -1 hbase.hstore.compactionThreshold - XML: 3, hardcoded: 2 hbase.regionserver.global.memstore.lowerLimit - XML: 0.35, hardcoded: 0.25 hbase.regionserver.handler.count - XML: 25, hardcoded: 10 hbase.regionserver.msginterval - XML: 3000, hardcoded: 1000 hbase.rest.port - XML: 8080, hardcoded: 9090 hfile.block.cache.size - XML: 0.2, hardcoded: 0.0 Finally, some keys are already in HConstants, some are in local classes and others used as literals. There is an issue open to fix this though. Just saying. Thoughts? {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3254) Ability to specify the host published in zookeeper
[ https://issues.apache.org/jira/browse/HBASE-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935102#action_12935102 ] Eric Tschetter commented on HBASE-3254: --- Not yet. I worked around it by mucking with my hosts file and setting up an SSH-based SOCKS proxy for now. If I get some time, I'll take a stab at a patch. I think that it should be reasonable to just have a System/hbase conf property that the HRegionServer and HMaster look at to get the address that they publish. If that is not set, then just do what it does now. https://issues.apache.org/jira/browse/HBASE-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935040#action_12935040] should be using hostnames rather than IPs I'd say)? EC2 to connect to the cluster. But, each of the nodes appears to be publishing its IP address into zookeeper. The problem is that the nodes on EC2 see a 10. IP address that is only resolvable inside of EC2. externally and internally, so it would be nice if I could tell each of the processes what host to publish into zookeeper via a property. As it stands, I have to do ssh tunnelling/muck with the hosts file in order to get my client to connect. public vs. private access. That might only ever happen on EC2, but it might happen elsewhere. I don't really know :). Ability to specify the host published in zookeeper Key: HBASE-3254 URL: https://issues.apache.org/jira/browse/HBASE-3254 Project: HBase Issue Type: Improvement Affects Versions: 0.89.20100924 Reporter: Eric Tschetter We are running HBase on EC2 and I'm trying to get a client external from EC2 to connect to the cluster. But, each of the nodes appears to be publishing its IP address into zookeeper. The problem is that the nodes on EC2 see a 10. IP address that is only resolvable inside of EC2. Specifically for EC2, there is a DNS name that will resolve properly both externally and internally, so it would be nice if I could tell each of the processes what host to publish into zookeeper via a property. As it stands, I have to do ssh tunnelling/muck with the hosts file in order to get my client to connect. This problem could occur anywhere that you have a different DNS entry for public vs. private access. That might only ever happen on EC2, but it might happen elsewhere. I don't really know :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3260) Coprocessors: Lifecycle management
[ https://issues.apache.org/jira/browse/HBASE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935111#action_12935111 ] Andrew Purtell commented on HBASE-3260: --- We could use something like Guice as a lightweight DI framework within HBase code in general, but I think this is orthogonal to what coprocessors tries to achieve. bq. But each coprocessor is currently a separate island the relationship between them seems more akin to chained servlet filters on a request than independent components. Chained servlet filters is a good analogy. Need to add client-transparent compression support to your webapp? Register a compression filter on the chain. Need to add client-transparent value compression to your table? Register a value compression coprocessor on the region. Coprocessors: Lifecycle management -- Key: HBASE-3260 URL: https://issues.apache.org/jira/browse/HBASE-3260 Project: HBase Issue Type: Sub-task Reporter: Andrew Purtell Fix For: 0.92.0 Attachments: statechart.png Considering extending CPs to the master, we have no equivalent to pre/postOpen and pre/postClose as on the regionserver. We also should consider how to resolve dependencies and initialization ordering if loading coprocessors that depend on others. OSGi (http://en.wikipedia.org/wiki/OSGi) has a lifecycle API and is familiar to many Java programmers, so we propose to borrow its terminology and state machine. A lifecycle layer manages coprocessors as they are dynamically installed, started, stopped, updated and uninstalled. Coprocessors rely on the framework for dependency resolution and class loading. In turn, the framework calls up to lifecycle management methods in the coprocessor as needed. A coprocessor transitions between the below states over its lifetime: ||State||Description|| |UNINSTALLED|The coprocessor implementation is not installed. This is the default implicit state.| |INSTALLED|The coprocessor implementation has been successfully installed| |STARTING|A coprocessor instance is being started.| |ACTIVE|The coprocessor instance has been successfully activated and is running.| |STOPPING|A coprocessor instance is being stopped.| See attached state diagram. Transitions to STOPPING will only happen as the region is being closed. If a coprocessor throws an unhandled exception, this will cause the RegionServer to close the region, stopping all coprocessor instances on it. Transitions from INSTALLED-STARTING and ACTIVE-STOPPING would go through upcall methods into the coprocessor via the CoprocessorLifecycle interface: {code:java} public interface CoprocessorLifecycle { void start(CoprocessorEnvironment env) throws IOException; void stop(CoprocessorEnvironment env) throws IOException; } {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HBASE-3272) Remove no longer used options
[ https://issues.apache.org/jira/browse/HBASE-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack resolved HBASE-3272. -- Resolution: Fixed Hadoop Flags: [Reviewed] Committed to branch and trunk. Remove no longer used options - Key: HBASE-3272 URL: https://issues.apache.org/jira/browse/HBASE-3272 Project: HBase Issue Type: Bug Reporter: stack Assignee: stack Fix For: 0.90.0 Attachments: 3272.txt From Lars George list up on hbase-dev: {code} Hi, I went through the config values as per the defaults XML file (still going through it again now based on what is actually in the code, i.e. those not in defaults). Here is what I found: hbase.master.balancer.period - Only used in hbase-default.xml? hbase.regions.percheckin, hbase.regions.slop - Some tests still have it but not used anywhere else zookeeper.pause, zookeeper.retries - Never used? Only in hbase-defaults.xml And then there are differences between hardcoded and XML based defaults: hbase.client.pause - XML: 1000, hardcoded: 2000 (HBaseClient) and 30 * 1000 (HBaseAdmin) hbase.client.retries.number - XML: 10, hardcoded 5 (HBaseAdmin) and 2 (HMaster) hbase.hstore.blockingStoreFiles - XML: 7, hardcoded: -1 hbase.hstore.compactionThreshold - XML: 3, hardcoded: 2 hbase.regionserver.global.memstore.lowerLimit - XML: 0.35, hardcoded: 0.25 hbase.regionserver.handler.count - XML: 25, hardcoded: 10 hbase.regionserver.msginterval - XML: 3000, hardcoded: 1000 hbase.rest.port - XML: 8080, hardcoded: 9090 hfile.block.cache.size - XML: 0.2, hardcoded: 0.0 Finally, some keys are already in HConstants, some are in local classes and others used as literals. There is an issue open to fix this though. Just saying. Thoughts? {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-3273) Set the ZK default timeout to 3 minutes
Set the ZK default timeout to 3 minutes --- Key: HBASE-3273 URL: https://issues.apache.org/jira/browse/HBASE-3273 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.90.0, 0.92.0 Following HBASE-3272, Stack suggested that we up the ZK timeout and proposed that we set it to 3 minutes (he said that last part in person). This should cover most of the big GC pauses out there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HBASE-3259) Can't kill the region servers when they wait on the master or the cluster state znode
[ https://issues.apache.org/jira/browse/HBASE-3259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack resolved HBASE-3259. -- Resolution: Fixed Hadoop Flags: [Reviewed] Committed. Closing. Can't kill the region servers when they wait on the master or the cluster state znode - Key: HBASE-3259 URL: https://issues.apache.org/jira/browse/HBASE-3259 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Priority: Blocker Fix For: 0.90.0, 0.92.0 Attachments: HBASE-3259.patch With a situation like HBASE-3258, it's easy to have the region servers stuck on waiting for either the master or the cluster state znode since it has no timeout. You have to kill -9 them to have them shutting down. This is very bad for usability. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3246) Add API to Increment client class that increments rather than replaces the amount for a column when done multiple times
[ https://issues.apache.org/jira/browse/HBASE-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935136#action_12935136 ] Jonathan Gray commented on HBASE-3246: -- I would be okay with just removing addColumn and having a single call, incrementColumn(). That method would be additive w/ existing calls for a given column. If you wanted to somehow undo existing increments of columns, you would have to start over with a new Increment. I could also add a separate call, resetColumn() but then we start to get more confusing again. Everyone okay with removing addColumn() and just leaving incrementColumn()? Add API to Increment client class that increments rather than replaces the amount for a column when done multiple times --- Key: HBASE-3246 URL: https://issues.apache.org/jira/browse/HBASE-3246 Project: HBase Issue Type: Improvement Components: client Reporter: Jonathan Gray Assignee: Jonathan Gray Attachments: HBASE-3246-v1.patch In the new Increment class, the API to add columns is {{addColumn()}}. If you do this multiple times for an individual column, the amount to increment by is replaced. I think this is the right way for this method to work and it is javadoc'd with the behavior. We should add a new method, {{incrementColumn()}} which will increment any existing amount for the specified column rather than replacing it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3246) Add API to Increment client class that increments rather than replaces the amount for a column when done multiple times
[ https://issues.apache.org/jira/browse/HBASE-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935137#action_12935137 ] Jonathan Gray commented on HBASE-3246: -- @Stack, and yeah, all of our client classes are not thread-safe. If you wanted to use an Increment across threads you'd have to do your own synchronization, but I think that would be kind of odd. Can add a note that class is not thread-safe in the javadoc. Add API to Increment client class that increments rather than replaces the amount for a column when done multiple times --- Key: HBASE-3246 URL: https://issues.apache.org/jira/browse/HBASE-3246 Project: HBase Issue Type: Improvement Components: client Reporter: Jonathan Gray Assignee: Jonathan Gray Attachments: HBASE-3246-v1.patch In the new Increment class, the API to add columns is {{addColumn()}}. If you do this multiple times for an individual column, the amount to increment by is replaced. I think this is the right way for this method to work and it is javadoc'd with the behavior. We should add a new method, {{incrementColumn()}} which will increment any existing amount for the specified column rather than replacing it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-3273) Set the ZK default timeout to 3 minutes
[ https://issues.apache.org/jira/browse/HBASE-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-3273: - Attachment: doc_of_three_minute.txt Here is a patch for the manual adding zookeeper.session.timeout to the list of configurations we suggest you change with explaination for why its set to a long three minute default timeout. Set the ZK default timeout to 3 minutes --- Key: HBASE-3273 URL: https://issues.apache.org/jira/browse/HBASE-3273 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.90.0, 0.92.0 Attachments: doc_of_three_minute.txt Following HBASE-3272, Stack suggested that we up the ZK timeout and proposed that we set it to 3 minutes (he said that last part in person). This should cover most of the big GC pauses out there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-3273) Set the ZK default timeout to 3 minutes
[ https://issues.apache.org/jira/browse/HBASE-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans updated HBASE-3273: -- Attachment: HBASE-3273.patch Patch that changes the timeout to 3min and that fixes HQuorumPeer to use the new configuration introduced in ZK 3.3.0 Set the ZK default timeout to 3 minutes --- Key: HBASE-3273 URL: https://issues.apache.org/jira/browse/HBASE-3273 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.90.0, 0.92.0 Attachments: doc_of_three_minute.txt, HBASE-3273.patch Following HBASE-3272, Stack suggested that we up the ZK timeout and proposed that we set it to 3 minutes (he said that last part in person). This should cover most of the big GC pauses out there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3273) Set the ZK default timeout to 3 minutes
[ https://issues.apache.org/jira/browse/HBASE-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935156#action_12935156 ] Jean-Daniel Cryans commented on HBASE-3273: --- Regarding the documentation: bq. The default timeout is three minutes I would add: The default timeout is three minutes (specified in milliseconds) bq. This means that if a server crash Shouldn't it be crashes? Also there's a typo later in intriciacies. Set the ZK default timeout to 3 minutes --- Key: HBASE-3273 URL: https://issues.apache.org/jira/browse/HBASE-3273 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.90.0, 0.92.0 Attachments: doc_of_three_minute.txt, HBASE-3273.patch Following HBASE-3272, Stack suggested that we up the ZK timeout and proposed that we set it to 3 minutes (he said that last part in person). This should cover most of the big GC pauses out there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3273) Set the ZK default timeout to 3 minutes
[ https://issues.apache.org/jira/browse/HBASE-3273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935180#action_12935180 ] stack commented on HBASE-3273: -- +1 on your patch and on the doc fixes. Want to make the doc fixes you suggest above when you commit the doc alongside your commit of your patch? Set the ZK default timeout to 3 minutes --- Key: HBASE-3273 URL: https://issues.apache.org/jira/browse/HBASE-3273 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.90.0, 0.92.0 Attachments: doc_of_three_minute.txt, HBASE-3273.patch Following HBASE-3272, Stack suggested that we up the ZK timeout and proposed that we set it to 3 minutes (he said that last part in person). This should cover most of the big GC pauses out there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3246) Add API to Increment client class that increments rather than replaces the amount for a column when done multiple times
[ https://issues.apache.org/jira/browse/HBASE-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935182#action_12935182 ] stack commented on HBASE-3246: -- I'm fine w/ removing addColumn. Lets get the change in before we cut new RC. Add API to Increment client class that increments rather than replaces the amount for a column when done multiple times --- Key: HBASE-3246 URL: https://issues.apache.org/jira/browse/HBASE-3246 Project: HBase Issue Type: Improvement Components: client Reporter: Jonathan Gray Assignee: Jonathan Gray Attachments: HBASE-3246-v1.patch In the new Increment class, the API to add columns is {{addColumn()}}. If you do this multiple times for an individual column, the amount to increment by is replaced. I think this is the right way for this method to work and it is javadoc'd with the behavior. We should add a new method, {{incrementColumn()}} which will increment any existing amount for the specified column rather than replacing it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3260) Coprocessors: Lifecycle management
[ https://issues.apache.org/jira/browse/HBASE-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935183#action_12935183 ] stack commented on HBASE-3260: -- Thanks for entertaining my random ramble Gary and Andrew. Your reasoning seems good to me. Coprocessors: Lifecycle management -- Key: HBASE-3260 URL: https://issues.apache.org/jira/browse/HBASE-3260 Project: HBase Issue Type: Sub-task Reporter: Andrew Purtell Fix For: 0.92.0 Attachments: statechart.png Considering extending CPs to the master, we have no equivalent to pre/postOpen and pre/postClose as on the regionserver. We also should consider how to resolve dependencies and initialization ordering if loading coprocessors that depend on others. OSGi (http://en.wikipedia.org/wiki/OSGi) has a lifecycle API and is familiar to many Java programmers, so we propose to borrow its terminology and state machine. A lifecycle layer manages coprocessors as they are dynamically installed, started, stopped, updated and uninstalled. Coprocessors rely on the framework for dependency resolution and class loading. In turn, the framework calls up to lifecycle management methods in the coprocessor as needed. A coprocessor transitions between the below states over its lifetime: ||State||Description|| |UNINSTALLED|The coprocessor implementation is not installed. This is the default implicit state.| |INSTALLED|The coprocessor implementation has been successfully installed| |STARTING|A coprocessor instance is being started.| |ACTIVE|The coprocessor instance has been successfully activated and is running.| |STOPPING|A coprocessor instance is being stopped.| See attached state diagram. Transitions to STOPPING will only happen as the region is being closed. If a coprocessor throws an unhandled exception, this will cause the RegionServer to close the region, stopping all coprocessor instances on it. Transitions from INSTALLED-STARTING and ACTIVE-STOPPING would go through upcall methods into the coprocessor via the CoprocessorLifecycle interface: {code:java} public interface CoprocessorLifecycle { void start(CoprocessorEnvironment env) throws IOException; void stop(CoprocessorEnvironment env) throws IOException; } {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3271) Allow .META. table to be exported
[ https://issues.apache.org/jira/browse/HBASE-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935184#action_12935184 ] Ted Yu commented on HBASE-3271: --- I used this code: if (keys == null || keys.getFirst() == null || keys.getFirst().length == 0) { HRegionLocation regLoc = table.getRegionLocation(HConstants.EMPTY_BYTE_ARRAY); if (null == regLoc) throw new IOException(Expecting at least one region.); ListInputSplit splits = new ArrayListInputSplit(1); InputSplit split = new TableSplit(table.getTableName(), HConstants.EMPTY_BYTE_ARRAY, HConstants.EMPTY_BYTE_ARRAY, regLoc.getServerAddress().getHostname()); splits.add(split); return splits; } The following command only exports rows in .META. which have 'packageindex' (refer to HBASE-3255): bin/hbase org.apache.hadoop.hbase.mapreduce.Export .META. h-meta 1 0 0 packageindex -rwxrwxrwx 1 hadoop users 90700 Nov 24 03:31 h-meta/part-m-0 Allow .META. table to be exported - Key: HBASE-3271 URL: https://issues.apache.org/jira/browse/HBASE-3271 Project: HBase Issue Type: Improvement Components: util Affects Versions: 0.20.6 Reporter: Ted Yu I tried to export .META. table in 0.20.6 and got: [had...@us01-ciqps1-name01 hbase]$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export .META. h-meta 1 0 0 10/11/23 20:59:05 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 2010-11-23 20:59:05.255::INFO: Logging to STDERR via org.mortbay.log.StdErrLog 2010-11-23 20:59:05.255::INFO: verisons=1, starttime=0, endtime=9223372036854775807 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.2.2-888565, built on 12/08/2009 21:51 GMT 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client environment:host.name=us01-ciqps1-name01.carrieriq.com 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client environment:java.version=1.6.0_21 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Sun Microsystems Inc. ... 10/11/23 20:59:05 INFO zookeeper.ClientCnxn: Server connection successful 10/11/23 20:59:05 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode /hbase/root-region-server got 10.202.50.112:60020 10/11/23 20:59:05 DEBUG client.HConnectionManager$TableServers: Found ROOT at 10.202.50.112:60020 10/11/23 20:59:05 DEBUG client.HConnectionManager$TableServers: Cached location for .META.,,1 is us01-ciqps1-grid02.carrieriq.com:60020 Exception in thread main java.io.IOException: Expecting at least one region. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:281) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) at org.apache.hadoop.hbase.mapreduce.Export.main(Export.java:146) Related code is: if (keys == null || keys.getFirst() == null || keys.getFirst().length == 0) { throw new IOException(Expecting at least one region.); } My intention was to save the dangling rows in .META. (for future investigation) which prevented a table from being created. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3254) Ability to specify the host published in zookeeper
[ https://issues.apache.org/jira/browse/HBASE-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935187#action_12935187 ] stack commented on HBASE-3254: -- A patch whereby optionally servers would register themselves in zk using a suggested hostname seems reasonable (Only tricky part is that a RegionServer will use the 'name' the master tells it use -- see the reportForDuty code in HRegionServer. RegionServer on startup reads its 'address' then volunteers this to the Master but the Master could change it on the RegionServer. Subsequently after first reportForDuty, the regionserver will always checkin using the name the Master told it. In fact, you might be able to exploit this behavior by patching the Master only?) Ability to specify the host published in zookeeper Key: HBASE-3254 URL: https://issues.apache.org/jira/browse/HBASE-3254 Project: HBase Issue Type: Improvement Affects Versions: 0.89.20100924 Reporter: Eric Tschetter We are running HBase on EC2 and I'm trying to get a client external from EC2 to connect to the cluster. But, each of the nodes appears to be publishing its IP address into zookeeper. The problem is that the nodes on EC2 see a 10. IP address that is only resolvable inside of EC2. Specifically for EC2, there is a DNS name that will resolve properly both externally and internally, so it would be nice if I could tell each of the processes what host to publish into zookeeper via a property. As it stands, I have to do ssh tunnelling/muck with the hosts file in order to get my client to connect. This problem could occur anywhere that you have a different DNS entry for public vs. private access. That might only ever happen on EC2, but it might happen elsewhere. I don't really know :). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3261) NPE out of HRS.run at startup when clock is out of sync
[ https://issues.apache.org/jira/browse/HBASE-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935188#action_12935188 ] stack commented on HBASE-3261: -- +1 NPE out of HRS.run at startup when clock is out of sync --- Key: HBASE-3261 URL: https://issues.apache.org/jira/browse/HBASE-3261 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.90.0, 0.92.0 Attachments: HBASE-3261.patch This is what I get when I start a region server that's not properly sync'ed: {noformat} Exception in thread regionserver60020 java.lang.NullPointerException at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:603) at java.lang.Thread.run(Thread.java:637) {noformat} I this case the line was: {noformat} hlogRoller.interruptIfNecessary(); {noformat} I guess we could add a bunch of other null checks. The end result is the same, the RS dies, but I think it's misleading. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3271) Allow .META. table to be exported
[ https://issues.apache.org/jira/browse/HBASE-3271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935190#action_12935190 ] stack commented on HBASE-3271: -- Any chance of your making the code above into a patch and attaching it to this issue so we can review it Ted? (This might be of help: http://www.apache.org/dev/contributors.html#patches). Thanks. Allow .META. table to be exported - Key: HBASE-3271 URL: https://issues.apache.org/jira/browse/HBASE-3271 Project: HBase Issue Type: Improvement Components: util Affects Versions: 0.20.6 Reporter: Ted Yu I tried to export .META. table in 0.20.6 and got: [had...@us01-ciqps1-name01 hbase]$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export .META. h-meta 1 0 0 10/11/23 20:59:05 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 2010-11-23 20:59:05.255::INFO: Logging to STDERR via org.mortbay.log.StdErrLog 2010-11-23 20:59:05.255::INFO: verisons=1, starttime=0, endtime=9223372036854775807 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.2.2-888565, built on 12/08/2009 21:51 GMT 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client environment:host.name=us01-ciqps1-name01.carrieriq.com 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client environment:java.version=1.6.0_21 10/11/23 20:59:05 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Sun Microsystems Inc. ... 10/11/23 20:59:05 INFO zookeeper.ClientCnxn: Server connection successful 10/11/23 20:59:05 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode /hbase/root-region-server got 10.202.50.112:60020 10/11/23 20:59:05 DEBUG client.HConnectionManager$TableServers: Found ROOT at 10.202.50.112:60020 10/11/23 20:59:05 DEBUG client.HConnectionManager$TableServers: Cached location for .META.,,1 is us01-ciqps1-grid02.carrieriq.com:60020 Exception in thread main java.io.IOException: Expecting at least one region. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:281) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) at org.apache.hadoop.hbase.mapreduce.Export.main(Export.java:146) Related code is: if (keys == null || keys.getFirst() == null || keys.getFirst().length == 0) { throw new IOException(Expecting at least one region.); } My intention was to save the dangling rows in .META. (for future investigation) which prevented a table from being created. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2888) Review all our metrics
[ https://issues.apache.org/jira/browse/HBASE-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935193#action_12935193 ] stack commented on HBASE-2888: -- @Alex 1. and 2. sound great. How would 3. differ from current log? All it has in it is 'events', no? Regards your question about process, I'd say no need of an issue per metric., at least not for now. The way it usually runs is we have a fat umbrella issue like this one, a bunch of work gets done under this umbrella -- in this case, a bunch of the above will be addressed by the patch -- but then subsequent amendments or additions get done in separate issues. Hope this helps. Review all our metrics -- Key: HBASE-2888 URL: https://issues.apache.org/jira/browse/HBASE-2888 Project: HBase Issue Type: Improvement Components: master Reporter: Jean-Daniel Cryans Fix For: 0.92.0 HBase publishes a bunch of metrics, some useful some wasteful, that should be improved to deliver a better ops experience. Examples: - Block cache hit ratio converges at some point and stops moving - fsReadLatency goes down when compactions are running - storefileIndexSizeMB is the exact same number once a system is serving production load We could use new metrics too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3269) HBase table truncate semantics seems broken as disable table is now async by default.
[ https://issues.apache.org/jira/browse/HBASE-3269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935196#action_12935196 ] stack commented on HBASE-3269: -- This is the fix: {code} pynchon-432:clean_trunk stack$ svn diff Index: src/main/ruby/hbase/admin.rb === --- src/main/ruby/hbase/admin.rb(revision 1038464) +++ src/main/ruby/hbase/admin.rb(working copy) @@ -83,7 +83,7 @@ # Disables a table def disable(table_name) return unless enabled?(table_name) - @admin.disableTableAsync(table_name) + @admin.disableTable(table_name) end #-- {code} I'm just going to commit. HBase table truncate semantics seems broken as disable table is now async by default. --- Key: HBASE-3269 URL: https://issues.apache.org/jira/browse/HBASE-3269 Project: HBase Issue Type: Bug Affects Versions: 0.90.0 Environment: RHEL5 x86_64 Reporter: Suraj Varma Assignee: stack Priority: Critical Fix For: 0.90.0, 0.92.0 The new async design for disable table seems to have caused a side effect on the truncate command. (IRC chat with jdcryans) Apparent Cause: Disable is now async by default. When truncate is called, the disable operation returns immediately and when the drop is called, the disable operation is still not completed. This results in HMaster.checkTableModifiable() throwing a TableNotDisabledException. With earlier versions, disable returned only after Table was disabled. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HBASE-3269) HBase table truncate semantics seems broken as disable table is now async by default.
[ https://issues.apache.org/jira/browse/HBASE-3269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack resolved HBASE-3269. -- Resolution: Fixed Committed branch and trunk. HBase table truncate semantics seems broken as disable table is now async by default. --- Key: HBASE-3269 URL: https://issues.apache.org/jira/browse/HBASE-3269 Project: HBase Issue Type: Bug Affects Versions: 0.90.0 Environment: RHEL5 x86_64 Reporter: Suraj Varma Assignee: stack Priority: Critical Fix For: 0.90.0, 0.92.0 The new async design for disable table seems to have caused a side effect on the truncate command. (IRC chat with jdcryans) Apparent Cause: Disable is now async by default. When truncate is called, the disable operation returns immediately and when the drop is called, the disable operation is still not completed. This results in HMaster.checkTableModifiable() throwing a TableNotDisabledException. With earlier versions, disable returned only after Table was disabled. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-3269) HBase table truncate semantics seems broken as disable table is now async by default.
[ https://issues.apache.org/jira/browse/HBASE-3269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935199#action_12935199 ] stack commented on HBASE-3269: -- Oh, I checked for other mentions of async I found enable is also async in shell so I changed that too. I committed the change under this issue. HBase table truncate semantics seems broken as disable table is now async by default. --- Key: HBASE-3269 URL: https://issues.apache.org/jira/browse/HBASE-3269 Project: HBase Issue Type: Bug Affects Versions: 0.90.0 Environment: RHEL5 x86_64 Reporter: Suraj Varma Assignee: stack Priority: Critical Fix For: 0.90.0, 0.92.0 The new async design for disable table seems to have caused a side effect on the truncate command. (IRC chat with jdcryans) Apparent Cause: Disable is now async by default. When truncate is called, the disable operation returns immediately and when the drop is called, the disable operation is still not completed. This results in HMaster.checkTableModifiable() throwing a TableNotDisabledException. With earlier versions, disable returned only after Table was disabled. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2888) Review all our metrics
[ https://issues.apache.org/jira/browse/HBASE-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935217#action_12935217 ] Alex Baranau commented on HBASE-2888: - qt. How would 3. differ from current log? The difference is to have this shown somewhere in more user friendly-place e.g. *on web interface*. The point I wanted to stress out is that currently users/ops have to go for this information to the log files which isn't straightforward for them (and leads to questions on ML as I linked to), I think, but might be wrong. Thanks. Review all our metrics -- Key: HBASE-2888 URL: https://issues.apache.org/jira/browse/HBASE-2888 Project: HBase Issue Type: Improvement Components: master Reporter: Jean-Daniel Cryans Fix For: 0.92.0 HBase publishes a bunch of metrics, some useful some wasteful, that should be improved to deliver a better ops experience. Examples: - Block cache hit ratio converges at some point and stops moving - fsReadLatency goes down when compactions are running - storefileIndexSizeMB is the exact same number once a system is serving production load We could use new metrics too. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.