[jira] Commented: (HBASE-50) Snapshot of table
[ https://issues.apache.org/jira/browse/HBASE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880093#action_12880093 ] Li Chongxin commented on HBASE-50: -- bq. Fail with a warning. A nice-to-have would be your suggestion of restoring snapshot into a table named something other than the original table's name (Fixing this issue is low-priority IMO). bq. .. it's a good idea to allow snapshot restore to a new table name while the original table is still online. And the restored snapshot should be able to share HFiles with the original table I will make this issue a low-priority sub-task. One more question, besides metadata and log file, what else data should take care to rename the snapshot to a new table name? Are there any other files (e.g. HFiles) containing table name? bq. ... didn't we discuss that .META. might not be the place to keep snapshot data since regions are deleted when the system is done w/ them (but a snapshot may outlive a particular region). I misunderstood... I thought you were talking about create a new catalog table 'snapshot' to keep the metadata of snapshots, such as creation time. In current design, a region will not be delete if it is still used by a snapshot, even if the system has done with it. This region would be probably marked as 'deleted' in .META. This is discussed in section 6.2, 6.3 and no new catalog table is added. Do you think it is appropriate to keep metadata in .META. for a deleted region? Do we still need a new catalog table? bq. rather than causing all of the RS to roll the logs, they could simply record the log sequence number of the snapshot, right? This will be a bit faster to do and causes even less of a hiccup in concurrent operations (and I don't think it's any more complicated to implement, is it?) Yes, sounds good. The log sequence number should also be included when the logs are split because log files would contain the data both before and after the snapshot, right? bq. Making the client orchestrate the snapshot process seems a little strange - could the client simply initiate it and put the actual snapshot code in the master? I think we should keep the client as thin as we can Ok, This will change the design a little. bq. I'd be interested in a section about failure analysis - what happens when the snapshot coordinator fails in the middle? .. That will be great! Snapshot of table - Key: HBASE-50 URL: https://issues.apache.org/jira/browse/HBASE-50 Project: HBase Issue Type: New Feature Reporter: Billy Pearson Assignee: Li Chongxin Priority: Minor Attachments: HBase Snapshot Design Report V2.pdf, HBase Snapshot Design Report V3.pdf, snapshot-src.zip Havening an option to take a snapshot of a table would be vary useful in production. What I would like to see this option do is do a merge of all the data into one or more files stored in the same folder on the dfs. This way we could save data in case of a software bug in hadoop or user code. The other advantage would be to be able to export a table to multi locations. Say I had a read_only table that must be online. I could take a snapshot of it when needed and export it to a separate data center and have it loaded there and then i would have it online at multi data centers for load balancing and failover. I understand that hadoop takes the need out of havening backup to protect from failed servers, but this does not protect use from software bugs that might delete or alter data in ways we did not plan. We should have a way we can roll back a dataset. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-2745) Create snapshot of an HBase table
Create snapshot of an HBase table - Key: HBASE-2745 URL: https://issues.apache.org/jira/browse/HBASE-2745 Project: HBase Issue Type: Sub-task Components: master, regionserver Reporter: Li Chongxin Assignee: Li Chongxin Create snapshot of an HBase table under directory '.snapshot' -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-2746) Existing functions of HBase should be modified to maintain snapshots data
Existing functions of HBase should be modified to maintain snapshots data - Key: HBASE-2746 URL: https://issues.apache.org/jira/browse/HBASE-2746 Project: HBase Issue Type: Sub-task Components: master, regionserver Reporter: Li Chongxin Assignee: Li Chongxin Existing functions of HBase, e.g. compaction, split, table delete, meta scanner, should be modified for the consideration of snapshot data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Work started: (HBASE-50) Snapshot of table
[ https://issues.apache.org/jira/browse/HBASE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HBASE-50 started by Li Chongxin. Snapshot of table - Key: HBASE-50 URL: https://issues.apache.org/jira/browse/HBASE-50 Project: HBase Issue Type: New Feature Reporter: Billy Pearson Assignee: Li Chongxin Priority: Minor Attachments: HBase Snapshot Design Report V2.pdf, HBase Snapshot Design Report V3.pdf, snapshot-src.zip Havening an option to take a snapshot of a table would be vary useful in production. What I would like to see this option do is do a merge of all the data into one or more files stored in the same folder on the dfs. This way we could save data in case of a software bug in hadoop or user code. The other advantage would be to be able to export a table to multi locations. Say I had a read_only table that must be online. I could take a snapshot of it when needed and export it to a separate data center and have it loaded there and then i would have it online at multi data centers for load balancing and failover. I understand that hadoop takes the need out of havening backup to protect from failed servers, but this does not protect use from software bugs that might delete or alter data in ways we did not plan. We should have a way we can roll back a dataset. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-2748) Restore snapshot to a new table name other than the original table name
Restore snapshot to a new table name other than the original table name --- Key: HBASE-2748 URL: https://issues.apache.org/jira/browse/HBASE-2748 Project: HBase Issue Type: Sub-task Reporter: Li Chongxin Assignee: Li Chongxin Priority: Minor -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-2749) Export and Import a snapshot
Export and Import a snapshot Key: HBASE-2749 URL: https://issues.apache.org/jira/browse/HBASE-2749 Project: HBase Issue Type: Sub-task Reporter: Li Chongxin Assignee: Li Chongxin Priority: Minor -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-2750) Add sanity check for system configs in hbase-daemon wrapper
Add sanity check for system configs in hbase-daemon wrapper --- Key: HBASE-2750 URL: https://issues.apache.org/jira/browse/HBASE-2750 Project: HBase Issue Type: New Feature Components: scripts Affects Versions: 0.21.0 Reporter: Todd Lipcon Assignee: Todd Lipcon Priority: Minor We should add a config variable like MIN_ULIMIT_TO_START in hbase-env.sh. If the daemon script finds ulimit this value, it will print a warning and refuse to start. We can make the default set to 0 so that this doesn't affect non-production clusters, but in the tuning guide recommend that people change it to the expected ulimit. (I've seen it happen all the time where people configure ulimit on some nodes, add a new node to the cluster, and forgot to re-tune it on the new one, and then that new one borks the whole cluster when it joins) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-2743) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta.
[ https://issues.apache.org/jira/browse/HBASE-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-2743: - Attachment: excise_regions.rb plug_hole.rb Testing, the problem is better addressed with two scripts... one to do the offlining, close and delete with another to plug the hole. Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta. --- Key: HBASE-2743 URL: https://issues.apache.org/jira/browse/HBASE-2743 Project: HBase Issue Type: Task Reporter: stack Attachments: excise_regions.rb, excise_regions.rb, plug_hole.rb Script to help out our mozilla buddies. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-2743) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta.
[ https://issues.apache.org/jira/browse/HBASE-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-2743: - Attachment: (was: excise_regions.rb) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta. --- Key: HBASE-2743 URL: https://issues.apache.org/jira/browse/HBASE-2743 Project: HBase Issue Type: Task Reporter: stack Attachments: excise_regions.rb, plug_hole.rb Script to help out our mozilla buddies. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-2743) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta.
[ https://issues.apache.org/jira/browse/HBASE-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-2743: - Attachment: (was: plug_hole.rb) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta. --- Key: HBASE-2743 URL: https://issues.apache.org/jira/browse/HBASE-2743 Project: HBase Issue Type: Task Reporter: stack Attachments: excise_regions.rb, excise_regions.rb, plug_hole.rb Script to help out our mozilla buddies. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-2743) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta.
[ https://issues.apache.org/jira/browse/HBASE-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-2743: - Attachment: excise_regions.rb plug_hole.rb Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta. --- Key: HBASE-2743 URL: https://issues.apache.org/jira/browse/HBASE-2743 Project: HBase Issue Type: Task Reporter: stack Attachments: excise_regions.rb, excise_regions.rb, plug_hole.rb Script to help out our mozilla buddies. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-2743) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta.
[ https://issues.apache.org/jira/browse/HBASE-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-2743: - Attachment: (was: excise_regions.rb) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta. --- Key: HBASE-2743 URL: https://issues.apache.org/jira/browse/HBASE-2743 Project: HBase Issue Type: Task Reporter: stack Attachments: excise_regions.rb, plug_hole.rb Script to help out our mozilla buddies. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-50) Snapshot of table
[ https://issues.apache.org/jira/browse/HBASE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880276#action_12880276 ] Jonathan Gray commented on HBASE-50: +1 on feature branch once stuff is ready for commit Snapshot of table - Key: HBASE-50 URL: https://issues.apache.org/jira/browse/HBASE-50 Project: HBase Issue Type: New Feature Reporter: Billy Pearson Assignee: Li Chongxin Priority: Minor Attachments: HBase Snapshot Design Report V2.pdf, HBase Snapshot Design Report V3.pdf, snapshot-src.zip Havening an option to take a snapshot of a table would be vary useful in production. What I would like to see this option do is do a merge of all the data into one or more files stored in the same folder on the dfs. This way we could save data in case of a software bug in hadoop or user code. The other advantage would be to be able to export a table to multi locations. Say I had a read_only table that must be online. I could take a snapshot of it when needed and export it to a separate data center and have it loaded there and then i would have it online at multi data centers for load balancing and failover. I understand that hadoop takes the need out of havening backup to protect from failed servers, but this does not protect use from software bugs that might delete or alter data in ways we did not plan. We should have a way we can roll back a dataset. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-2743) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta.
[ https://issues.apache.org/jira/browse/HBASE-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-2743: - Attachment: (was: excise_regions.rb) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta. --- Key: HBASE-2743 URL: https://issues.apache.org/jira/browse/HBASE-2743 Project: HBase Issue Type: Task Reporter: stack Attachments: plug_hole.rb Script to help out our mozilla buddies. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (HBASE-2743) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta.
[ https://issues.apache.org/jira/browse/HBASE-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-2743: - Attachment: (was: plug_hole.rb) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta. --- Key: HBASE-2743 URL: https://issues.apache.org/jira/browse/HBASE-2743 Project: HBase Issue Type: Task Reporter: stack Script to help out our mozilla buddies. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2743) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta.
[ https://issues.apache.org/jira/browse/HBASE-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880290#action_12880290 ] stack commented on HBASE-2743: -- I put the scripts here instead: http://github.com/saintstack/hbase_bin_scripts Latest versions have better documentation on their heads. Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta. --- Key: HBASE-2743 URL: https://issues.apache.org/jira/browse/HBASE-2743 Project: HBase Issue Type: Task Reporter: stack Script to help out our mozilla buddies. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (HBASE-2751) Consider closing StoreFiles sometimes
Consider closing StoreFiles sometimes - Key: HBASE-2751 URL: https://issues.apache.org/jira/browse/HBASE-2751 Project: HBase Issue Type: Improvement Reporter: Jean-Daniel Cryans Priority: Minor Fix For: 0.21.0 Having a lot of regions per region server could be considered harmless if most of them aren't used, but that's not really true at the moment. We keep all files opened all the time (except for rolled HLogs). I'm thinking of 2 solutions # Lazy open the store files, or at least close them down after we read the file info. Or we could do this for every file except the most recent one. # Close files when they're not in use. We need some heuristic to determine when is the best moment to declare that a file can be closed. Both solutions go hand in hand, and I think it would be a huge gain in order to lower the ulimit and xceivers-related issues. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2616) TestHRegion.testWritesWhileGetting flaky on trunk
[ https://issues.apache.org/jira/browse/HBASE-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880313#action_12880313 ] Jean-Daniel Cryans commented on HBASE-2616: --- Looks like it was committed, can we close this? TestHRegion.testWritesWhileGetting flaky on trunk - Key: HBASE-2616 URL: https://issues.apache.org/jira/browse/HBASE-2616 Project: HBase Issue Type: Bug Components: regionserver Reporter: Todd Lipcon Assignee: ryan rawson Priority: Critical Fix For: 0.20.5 Attachments: HBASE-2616.patch Saw this failure on my internal hudson: junit.framework.AssertionFailedError: expected:\x00\x00\x00\x96 but was:\x00\x00\x01\x00 at org.apache.hadoop.hbase.HBaseTestCase.assertEquals(HBaseTestCase.java:684) at org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting(TestHRegion.java:2334) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HBASE-2683) Make it obvious in the documentation that ZooKeeper needs permanent storage
[ https://issues.apache.org/jira/browse/HBASE-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Daniel Cryans resolved HBASE-2683. --- Assignee: Jean-Daniel Cryans Fix Version/s: 0.20.5 (was: 0.20.6) Resolution: Fixed Committed a small paragraph to branch and trunk. Make it obvious in the documentation that ZooKeeper needs permanent storage --- Key: HBASE-2683 URL: https://issues.apache.org/jira/browse/HBASE-2683 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Assignee: Jean-Daniel Cryans Fix For: 0.20.5, 0.21.0 If our users let HBase manage ZK, they probably won't bother combing through hbase-default.xml to figure that they need to set hbase.zookeeper.property.dataDir to something else than /tmp. It probably happened to deinspanjer in prod today and that's a show stopper. The fix would be, at least, to improve the Getting Started documentation to include that configuration in the Fully-Distributed Operation section. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2741) NPE in ServerManager when a region is closing
[ https://issues.apache.org/jira/browse/HBASE-2741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880361#action_12880361 ] Jean-Daniel Cryans commented on HBASE-2741: --- Debugging this with Karthik's help, we found out that the new HBaseExecutorService wasn't multi-cluster friendly because it was named master, instead of using something less static like host:port. As a matter of fact, in my log I can also see: {code} 2010-06-18 15:35:08,205 DEBUG [main] executor.HBaseExecutorService$HBaseExecutorServiceType(88): Executor service MASTER_CLOSEREGION already running on master {code} This was in fact detecting the other master's service. NPE in ServerManager when a region is closing - Key: HBASE-2741 URL: https://issues.apache.org/jira/browse/HBASE-2741 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Assignee: Karthik Ranganathan Fix For: 0.21.0 While running TestReplication I bumped into: {code} 2010-06-16 16:44:07,576 DEBUG [IPC Server handler 3 on 62423] master.RegionManager(357): Created UNASSIGNED zNode test,,1276731846828.de5dcd3df0fbc58207ce6ccff9ff2870. in state M2ZK_REGION_OFFLINE 2010-06-16 16:44:07,577 INFO [RegionServer:0] regionserver.HRegionServer(511): MSG_REGION_OPEN: test,,1276731846828.de5dcd3df0fbc58207ce6ccff9ff2870. 2010-06-16 16:44:07,577 INFO [RegionServer:0.worker] regionserver.HRegionServer$Worker(1358): Worker: MSG_REGION_OPEN: test,,1276731846828.de5dcd3df0fbc58207ce6ccff9ff2870. 2010-06-16 16:44:07,578 DEBUG [RegionServer:0.worker] regionserver.RSZookeeperUpdater(157): Updating ZNode /1/UNASSIGNED/de5dcd3df0fbc58207ce6ccff9ff2870 with [RS2ZK_REGION_OPENING] expected version = 0 2010-06-16 16:44:07,580 DEBUG [main-EventThread] master.HMaster(1142): Event NodeDataChanged with state SyncConnected with path /1/UNASSIGNED/de5dcd3df0fbc58207ce6ccff9ff2870 2010-06-16 16:44:07,580 DEBUG [main-EventThread] master.ZKMasterAddressWatcher(64): Got event NodeDataChanged with path /1/UNASSIGNED/de5dcd3df0fbc58207ce6ccff9ff2870 2010-06-16 16:44:07,580 DEBUG [main-EventThread] master.ZKUnassignedWatcher(71): ZK-EVENT-PROCESS: Got zkEvent NodeDataChanged state:SyncConnected path:/1/UNASSIGNED/de5dcd3df0fbc58207ce6ccff9ff2870 2010-06-16 16:44:07,580 INFO [main-EventThread] regionserver.HRegionServer(379): Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /1/UNASSIGNED/de5dcd3df0fbc58207ce6ccff9ff2870 2010-06-16 16:44:07,581 DEBUG [RegionServer:0.worker] regionserver.HRegion(294): Creating region test,,1276731846828.de5dcd3df0fbc58207ce6ccff9ff2870. 2010-06-16 16:44:07,582 DEBUG [MASTER_CLOSEREGION-master-1] handler.MasterOpenRegionHandler(70): Event = RS2ZK_REGION_OPENING, region = de5dcd3df0fbc58207ce6ccff9ff2870 2010-06-16 16:44:07,582 DEBUG [MASTER_CLOSEREGION-master-1] handler.MasterOpenRegionHandler(81): NO-OP call to handling region opening event 2010-06-16 16:44:07,589 INFO [RegionServer:0.worker] regionserver.HRegion(369): region test,,1276731846828.de5dcd3df0fbc58207ce6ccff9ff2870. available; sequence id is 1 2010-06-16 16:44:07,590 DEBUG [RegionServer:0.worker] regionserver.RSZookeeperUpdater(157): Updating ZNode /1/UNASSIGNED/de5dcd3df0fbc58207ce6ccff9ff2870 with [RS2ZK_REGION_OPENED] expected version = 1 2010-06-16 16:44:07,591 DEBUG [main-EventThread] master.HMaster(1142): Event NodeDataChanged with state SyncConnected with path /1/UNASSIGNED/de5dcd3df0fbc58207ce6ccff9ff2870 2010-06-16 16:44:07,591 DEBUG [main-EventThread] master.ZKMasterAddressWatcher(64): Got event NodeDataChanged with path /1/UNASSIGNED/de5dcd3df0fbc58207ce6ccff9ff2870 2010-06-16 16:44:07,592 DEBUG [main-EventThread] master.ZKUnassignedWatcher(71): ZK-EVENT-PROCESS: Got zkEvent NodeDataChanged state:SyncConnected path:/1/UNASSIGNED/de5dcd3df0fbc58207ce6ccff9ff2870 2010-06-16 16:44:07,591 INFO [main-EventThread] regionserver.HRegionServer(379): Got ZooKeeper event, state: SyncConnected, type: NodeDataChanged, path: /1/UNASSIGNED/de5dcd3df0fbc58207ce6ccff9ff2870 2010-06-16 16:44:07,593 DEBUG [MASTER_CLOSEREGION-master-1] handler.MasterOpenRegionHandler(70): Event = RS2ZK_REGION_OPENED, region = de5dcd3df0fbc58207ce6ccff9ff2870 2010-06-16 16:44:07,594 DEBUG [MASTER_CLOSEREGION-master-1] handler.MasterOpenRegionHandler(96): RS 10.10.1.130,62425,1276731832950 has opened region de5dcd3df0fbc58207ce6ccff9ff2870 2010-06-16 16:44:07,594 ERROR [MASTER_CLOSEREGION-master-1] server.NIOServerCnxn$Factory$1(81): Thread Thread[MASTER_CLOSEREGION-master-1,5,main] died java.lang.NullPointerException at org.apache.hadoop.hbase.master.ServerManager.processRegionOpen(ServerManager.java:607) at
[jira] Updated: (HBASE-2737) CME in ZKW introduced in HBASE-2694
[ https://issues.apache.org/jira/browse/HBASE-2737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Ranganathan updated HBASE-2737: --- Attachment: HBASE-2737-0.21.patch Making the register and unregister methods synchronized. Unit tests are passing. This change is so simple I am not putting it up on review board. CME in ZKW introduced in HBASE-2694 --- Key: HBASE-2737 URL: https://issues.apache.org/jira/browse/HBASE-2737 Project: HBase Issue Type: Bug Reporter: Jean-Daniel Cryans Assignee: Karthik Ranganathan Fix For: 0.21.0 Attachments: HBASE-2737-0.21.patch Saw this while tail'ing a log for something else: {code} 2010-06-15 17:30:03,769 ERROR [main-EventThread] zookeeper.ClientCnxn$EventThread(490): Error while calling watcher java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.process(ZooKeeperWrapper.java:235) {code} Looks like the listeners list's iterator is used in an unprotected manner. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2752) Don't retry forever when waiting on too many store files
[ https://issues.apache.org/jira/browse/HBASE-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880390#action_12880390 ] Jean-Daniel Cryans commented on HBASE-2752: --- I like it. Some comments: - requeueCount in FQE could be a boolean, that's how it's used. - isMaximumWait isn't documented With that fixed and some cluster load testing, I'm +1 for commit. Don't retry forever when waiting on too many store files Key: HBASE-2752 URL: https://issues.apache.org/jira/browse/HBASE-2752 Project: HBase Issue Type: Improvement Reporter: Jean-Daniel Cryans Assignee: stack Priority: Critical Fix For: 0.20.5, 0.21.0 Attachments: 2752.txt HBASE-2087 introduced a way to not block all flushes when on region has too many store files. Unfortunately, that undid the behavior that if we waited for longer than 90 secs then that we would still flush the region... which means that when a region blocks inserts because its memstore is too big it's actually holding off writes for a very long time, occupying handlers, etc. We need to add more smarts in MemStoreFlusher so that we detect when a region was held up for too long. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2752) Don't retry forever when waiting on too many store files
[ https://issues.apache.org/jira/browse/HBASE-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880422#action_12880422 ] Dave Latham commented on HBASE-2752: Thanks for the quick work. It's really aprpeciated. I'll try to get this patch tested on a cluster. Minor nits: * The log on Cache flush failed should use toStringBinary for the region name. * blockingWaitTime / 100 seems somewhat arbitrary for check interval, but probably fine for now. Don't retry forever when waiting on too many store files Key: HBASE-2752 URL: https://issues.apache.org/jira/browse/HBASE-2752 Project: HBase Issue Type: Improvement Reporter: Jean-Daniel Cryans Assignee: stack Priority: Critical Fix For: 0.20.5, 0.21.0 Attachments: 2752.txt HBASE-2087 introduced a way to not block all flushes when on region has too many store files. Unfortunately, that undid the behavior that if we waited for longer than 90 secs then that we would still flush the region... which means that when a region blocks inserts because its memstore is too big it's actually holding off writes for a very long time, occupying handlers, etc. We need to add more smarts in MemStoreFlusher so that we detect when a region was held up for too long. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2752) Don't retry forever when waiting on too many store files
[ https://issues.apache.org/jira/browse/HBASE-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880426#action_12880426 ] stack commented on HBASE-2752: -- Thanks j-d for review. I added in your first suggestion. For the second, I kept count. I think it'll be of use when we have a jsp page that dumps current state of the flush queue. I've been running it up on cluster. I see some of these during a big upload: {code} 2010-06-18 18:02:17,864 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Waited 90495ms on a compaction to clean up 'too many store files'; waited long enough... proceeding with flush {code} ...so it looks like we got the 0.20.3 behavior back where we'll go ahead and flush regardless if we've waited N ms (I left the interval at the 0.20.3 90 seconds which seems a bit long but...). I'm going to commit and roll an RC Don't retry forever when waiting on too many store files Key: HBASE-2752 URL: https://issues.apache.org/jira/browse/HBASE-2752 Project: HBase Issue Type: Improvement Reporter: Jean-Daniel Cryans Assignee: stack Priority: Critical Fix For: 0.20.5, 0.21.0 Attachments: 2752.txt HBASE-2087 introduced a way to not block all flushes when on region has too many store files. Unfortunately, that undid the behavior that if we waited for longer than 90 secs then that we would still flush the region... which means that when a region blocks inserts because its memstore is too big it's actually holding off writes for a very long time, occupying handlers, etc. We need to add more smarts in MemStoreFlusher so that we detect when a region was held up for too long. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HBASE-2752) Don't retry forever when waiting on too many store files
[ https://issues.apache.org/jira/browse/HBASE-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880427#action_12880427 ] stack commented on HBASE-2752: -- Applied to branch and trunk (Lets talk jgray). Don't retry forever when waiting on too many store files Key: HBASE-2752 URL: https://issues.apache.org/jira/browse/HBASE-2752 Project: HBase Issue Type: Improvement Reporter: Jean-Daniel Cryans Assignee: stack Priority: Critical Fix For: 0.20.5, 0.21.0 Attachments: 2752.txt HBASE-2087 introduced a way to not block all flushes when on region has too many store files. Unfortunately, that undid the behavior that if we waited for longer than 90 secs then that we would still flush the region... which means that when a region blocks inserts because its memstore is too big it's actually holding off writes for a very long time, occupying handlers, etc. We need to add more smarts in MemStoreFlusher so that we detect when a region was held up for too long. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.