[jira] Commented: (HBASE-50) Snapshot of table

2010-06-18 Thread Li Chongxin (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880093#action_12880093
 ] 

Li Chongxin commented on HBASE-50:
--

bq. Fail with a warning. A nice-to-have would be your suggestion of restoring 
snapshot into a table named something other than the original table's name 
(Fixing this issue is low-priority IMO).
bq. .. it's a good idea to allow snapshot restore to a new table name while the 
original table is still online. And the restored snapshot should be able to 
share HFiles with the original table

I will make this issue a low-priority sub-task. One more question, besides 
metadata and log file, what else data should take care to rename the snapshot 
to a new table name? Are there any other files (e.g. HFiles) containing table 
name?

bq. ... didn't we discuss that .META. might not be the place to keep snapshot 
data since regions are deleted when the system is done w/ them (but a snapshot 
may outlive a particular region).

I misunderstood... I thought you were talking about create a new catalog table 
'snapshot' to keep the metadata of snapshots, such as creation time.
In current design, a region will not be delete if it is still used by a 
snapshot, even if the system has done with it. This region would be probably 
marked as 'deleted' in .META. This is discussed in section 6.2, 6.3 and no new 
catalog table is added. Do you think it is appropriate to keep metadata in 
.META. for a deleted region? Do we still need a new catalog table?

bq. rather than causing all of the RS to roll the logs, they could simply 
record the log sequence number of the snapshot, right? This will be a bit 
faster to do and causes even less of a hiccup in concurrent operations (and I 
don't think it's any more complicated to implement, is it?)

Yes, sounds good. The log sequence number should also be included when the logs 
are split because log files would contain the data both before and after the 
snapshot, right?

bq. Making the client orchestrate the snapshot process seems a little strange - 
could the client simply initiate it and put the actual snapshot code in the 
master? I think we should keep the client as thin as we can

Ok, This will change the design a little.

bq. I'd be interested in a section about failure analysis - what happens when 
the snapshot coordinator fails in the middle? ..

That will be great!

 Snapshot of table
 -

 Key: HBASE-50
 URL: https://issues.apache.org/jira/browse/HBASE-50
 Project: HBase
  Issue Type: New Feature
Reporter: Billy Pearson
Assignee: Li Chongxin
Priority: Minor
 Attachments: HBase Snapshot Design Report V2.pdf, HBase Snapshot 
 Design Report V3.pdf, snapshot-src.zip


 Havening an option to take a snapshot of a table would be vary useful in 
 production.
 What I would like to see this option do is do a merge of all the data into 
 one or more files stored in the same folder on the dfs. This way we could 
 save data in case of a software bug in hadoop or user code. 
 The other advantage would be to be able to export a table to multi locations. 
 Say I had a read_only table that must be online. I could take a snapshot of 
 it when needed and export it to a separate data center and have it loaded 
 there and then i would have it online at multi data centers for load 
 balancing and failover.
 I understand that hadoop takes the need out of havening backup to protect 
 from failed servers, but this does not protect use from software bugs that 
 might delete or alter data in ways we did not plan. We should have a way we 
 can roll back a dataset.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HBASE-2745) Create snapshot of an HBase table

2010-06-18 Thread Li Chongxin (JIRA)
Create snapshot of an HBase table
-

 Key: HBASE-2745
 URL: https://issues.apache.org/jira/browse/HBASE-2745
 Project: HBase
  Issue Type: Sub-task
  Components: master, regionserver
Reporter: Li Chongxin
Assignee: Li Chongxin


Create snapshot of an HBase table under directory '.snapshot'

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HBASE-2746) Existing functions of HBase should be modified to maintain snapshots data

2010-06-18 Thread Li Chongxin (JIRA)
Existing functions of HBase should be modified to maintain snapshots data
-

 Key: HBASE-2746
 URL: https://issues.apache.org/jira/browse/HBASE-2746
 Project: HBase
  Issue Type: Sub-task
  Components: master, regionserver
Reporter: Li Chongxin
Assignee: Li Chongxin


Existing functions of HBase, e.g. compaction, split, table delete, meta 
scanner, should be modified for the consideration of snapshot data

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Work started: (HBASE-50) Snapshot of table

2010-06-18 Thread Li Chongxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HBASE-50 started by Li Chongxin.

 Snapshot of table
 -

 Key: HBASE-50
 URL: https://issues.apache.org/jira/browse/HBASE-50
 Project: HBase
  Issue Type: New Feature
Reporter: Billy Pearson
Assignee: Li Chongxin
Priority: Minor
 Attachments: HBase Snapshot Design Report V2.pdf, HBase Snapshot 
 Design Report V3.pdf, snapshot-src.zip


 Havening an option to take a snapshot of a table would be vary useful in 
 production.
 What I would like to see this option do is do a merge of all the data into 
 one or more files stored in the same folder on the dfs. This way we could 
 save data in case of a software bug in hadoop or user code. 
 The other advantage would be to be able to export a table to multi locations. 
 Say I had a read_only table that must be online. I could take a snapshot of 
 it when needed and export it to a separate data center and have it loaded 
 there and then i would have it online at multi data centers for load 
 balancing and failover.
 I understand that hadoop takes the need out of havening backup to protect 
 from failed servers, but this does not protect use from software bugs that 
 might delete or alter data in ways we did not plan. We should have a way we 
 can roll back a dataset.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HBASE-2748) Restore snapshot to a new table name other than the original table name

2010-06-18 Thread Li Chongxin (JIRA)
Restore snapshot to a new table name other than the original table name
---

 Key: HBASE-2748
 URL: https://issues.apache.org/jira/browse/HBASE-2748
 Project: HBase
  Issue Type: Sub-task
Reporter: Li Chongxin
Assignee: Li Chongxin
Priority: Minor




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HBASE-2749) Export and Import a snapshot

2010-06-18 Thread Li Chongxin (JIRA)
Export and Import a snapshot


 Key: HBASE-2749
 URL: https://issues.apache.org/jira/browse/HBASE-2749
 Project: HBase
  Issue Type: Sub-task
Reporter: Li Chongxin
Assignee: Li Chongxin
Priority: Minor




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HBASE-2750) Add sanity check for system configs in hbase-daemon wrapper

2010-06-18 Thread Todd Lipcon (JIRA)
Add sanity check for system configs in hbase-daemon wrapper
---

 Key: HBASE-2750
 URL: https://issues.apache.org/jira/browse/HBASE-2750
 Project: HBase
  Issue Type: New Feature
  Components: scripts
Affects Versions: 0.21.0
Reporter: Todd Lipcon
Assignee: Todd Lipcon
Priority: Minor


We should add a config variable like MIN_ULIMIT_TO_START in hbase-env.sh. If 
the daemon script finds ulimit  this value, it will print a warning and refuse 
to start. We can make the default set to 0 so that this doesn't affect 
non-production clusters, but in the tuning guide recommend that people change 
it to the expected ulimit.

(I've seen it happen all the time where people configure ulimit on some nodes, 
add a new node to the cluster, and forgot to re-tune it on the new one, and 
then that new one borks the whole cluster when it joins)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-2743) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta.

2010-06-18 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-2743:
-

Attachment: excise_regions.rb
plug_hole.rb

Testing, the problem is better addressed with two scripts... one to do the 
offlining, close and delete with another to plug the hole.

 Script to drop N regions from a table and then patch hole the hole by 
 inserting a new hole spanning region to meta.
 ---

 Key: HBASE-2743
 URL: https://issues.apache.org/jira/browse/HBASE-2743
 Project: HBase
  Issue Type: Task
Reporter: stack
 Attachments: excise_regions.rb, excise_regions.rb, plug_hole.rb


 Script to help out our mozilla buddies.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-2743) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta.

2010-06-18 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-2743:
-

Attachment: (was: excise_regions.rb)

 Script to drop N regions from a table and then patch hole the hole by 
 inserting a new hole spanning region to meta.
 ---

 Key: HBASE-2743
 URL: https://issues.apache.org/jira/browse/HBASE-2743
 Project: HBase
  Issue Type: Task
Reporter: stack
 Attachments: excise_regions.rb, plug_hole.rb


 Script to help out our mozilla buddies.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-2743) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta.

2010-06-18 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-2743:
-

Attachment: (was: plug_hole.rb)

 Script to drop N regions from a table and then patch hole the hole by 
 inserting a new hole spanning region to meta.
 ---

 Key: HBASE-2743
 URL: https://issues.apache.org/jira/browse/HBASE-2743
 Project: HBase
  Issue Type: Task
Reporter: stack
 Attachments: excise_regions.rb, excise_regions.rb, plug_hole.rb


 Script to help out our mozilla buddies.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-2743) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta.

2010-06-18 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-2743:
-

Attachment: excise_regions.rb
plug_hole.rb

 Script to drop N regions from a table and then patch hole the hole by 
 inserting a new hole spanning region to meta.
 ---

 Key: HBASE-2743
 URL: https://issues.apache.org/jira/browse/HBASE-2743
 Project: HBase
  Issue Type: Task
Reporter: stack
 Attachments: excise_regions.rb, excise_regions.rb, plug_hole.rb


 Script to help out our mozilla buddies.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-2743) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta.

2010-06-18 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-2743:
-

Attachment: (was: excise_regions.rb)

 Script to drop N regions from a table and then patch hole the hole by 
 inserting a new hole spanning region to meta.
 ---

 Key: HBASE-2743
 URL: https://issues.apache.org/jira/browse/HBASE-2743
 Project: HBase
  Issue Type: Task
Reporter: stack
 Attachments: excise_regions.rb, plug_hole.rb


 Script to help out our mozilla buddies.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-50) Snapshot of table

2010-06-18 Thread Jonathan Gray (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880276#action_12880276
 ] 

Jonathan Gray commented on HBASE-50:


+1 on feature branch once stuff is ready for commit

 Snapshot of table
 -

 Key: HBASE-50
 URL: https://issues.apache.org/jira/browse/HBASE-50
 Project: HBase
  Issue Type: New Feature
Reporter: Billy Pearson
Assignee: Li Chongxin
Priority: Minor
 Attachments: HBase Snapshot Design Report V2.pdf, HBase Snapshot 
 Design Report V3.pdf, snapshot-src.zip


 Havening an option to take a snapshot of a table would be vary useful in 
 production.
 What I would like to see this option do is do a merge of all the data into 
 one or more files stored in the same folder on the dfs. This way we could 
 save data in case of a software bug in hadoop or user code. 
 The other advantage would be to be able to export a table to multi locations. 
 Say I had a read_only table that must be online. I could take a snapshot of 
 it when needed and export it to a separate data center and have it loaded 
 there and then i would have it online at multi data centers for load 
 balancing and failover.
 I understand that hadoop takes the need out of havening backup to protect 
 from failed servers, but this does not protect use from software bugs that 
 might delete or alter data in ways we did not plan. We should have a way we 
 can roll back a dataset.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-2743) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta.

2010-06-18 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-2743:
-

Attachment: (was: excise_regions.rb)

 Script to drop N regions from a table and then patch hole the hole by 
 inserting a new hole spanning region to meta.
 ---

 Key: HBASE-2743
 URL: https://issues.apache.org/jira/browse/HBASE-2743
 Project: HBase
  Issue Type: Task
Reporter: stack
 Attachments: plug_hole.rb


 Script to help out our mozilla buddies.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-2743) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta.

2010-06-18 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-2743:
-

Attachment: (was: plug_hole.rb)

 Script to drop N regions from a table and then patch hole the hole by 
 inserting a new hole spanning region to meta.
 ---

 Key: HBASE-2743
 URL: https://issues.apache.org/jira/browse/HBASE-2743
 Project: HBase
  Issue Type: Task
Reporter: stack

 Script to help out our mozilla buddies.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2743) Script to drop N regions from a table and then patch hole the hole by inserting a new hole spanning region to meta.

2010-06-18 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880290#action_12880290
 ] 

stack commented on HBASE-2743:
--

I put the scripts here instead: http://github.com/saintstack/hbase_bin_scripts  
Latest versions have better documentation on their heads.

 Script to drop N regions from a table and then patch hole the hole by 
 inserting a new hole spanning region to meta.
 ---

 Key: HBASE-2743
 URL: https://issues.apache.org/jira/browse/HBASE-2743
 Project: HBase
  Issue Type: Task
Reporter: stack

 Script to help out our mozilla buddies.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HBASE-2751) Consider closing StoreFiles sometimes

2010-06-18 Thread Jean-Daniel Cryans (JIRA)
Consider closing StoreFiles sometimes
-

 Key: HBASE-2751
 URL: https://issues.apache.org/jira/browse/HBASE-2751
 Project: HBase
  Issue Type: Improvement
Reporter: Jean-Daniel Cryans
Priority: Minor
 Fix For: 0.21.0


Having a lot of regions per region server could be considered harmless if most 
of them aren't used, but that's not really true at the moment. We keep all 
files opened all the time (except for rolled HLogs). I'm thinking of 2 solutions

 # Lazy open the store files, or at least close them down after we read the 
file info. Or we could do this for every file except the most recent one.
 # Close files when they're not in use. We need some heuristic to determine 
when is the best moment to declare that a file can be closed. 

Both solutions go hand in hand, and I think it would be a huge gain in order to 
lower the ulimit and xceivers-related issues.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2616) TestHRegion.testWritesWhileGetting flaky on trunk

2010-06-18 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880313#action_12880313
 ] 

Jean-Daniel Cryans commented on HBASE-2616:
---

Looks like it was committed, can we close this?

 TestHRegion.testWritesWhileGetting flaky on trunk
 -

 Key: HBASE-2616
 URL: https://issues.apache.org/jira/browse/HBASE-2616
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: Todd Lipcon
Assignee: ryan rawson
Priority: Critical
 Fix For: 0.20.5

 Attachments: HBASE-2616.patch


 Saw this failure on my internal hudson:
 junit.framework.AssertionFailedError: expected:\x00\x00\x00\x96 but 
 was:\x00\x00\x01\x00
   at 
 org.apache.hadoop.hbase.HBaseTestCase.assertEquals(HBaseTestCase.java:684)
   at 
 org.apache.hadoop.hbase.regionserver.TestHRegion.testWritesWhileGetting(TestHRegion.java:2334)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HBASE-2683) Make it obvious in the documentation that ZooKeeper needs permanent storage

2010-06-18 Thread Jean-Daniel Cryans (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-2683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Daniel Cryans resolved HBASE-2683.
---

 Assignee: Jean-Daniel Cryans
Fix Version/s: 0.20.5
   (was: 0.20.6)
   Resolution: Fixed

Committed a small paragraph to branch and trunk.

 Make it obvious in the documentation that ZooKeeper needs permanent storage
 ---

 Key: HBASE-2683
 URL: https://issues.apache.org/jira/browse/HBASE-2683
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: Jean-Daniel Cryans
 Fix For: 0.20.5, 0.21.0


 If our users let HBase manage ZK, they probably won't bother combing through 
 hbase-default.xml to figure that they need to set 
 hbase.zookeeper.property.dataDir to something else than /tmp. It probably 
 happened to deinspanjer in prod today and that's a show stopper.
 The fix would be, at least, to improve the Getting Started documentation to 
 include that configuration in the Fully-Distributed Operation section.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2741) NPE in ServerManager when a region is closing

2010-06-18 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880361#action_12880361
 ] 

Jean-Daniel Cryans commented on HBASE-2741:
---

Debugging this with Karthik's help, we found out that the new 
HBaseExecutorService wasn't multi-cluster friendly because it was named 
master, instead of using something less static like host:port. As a matter of 
fact, in my log I can also see:

{code}
2010-06-18 15:35:08,205 DEBUG [main] 
executor.HBaseExecutorService$HBaseExecutorServiceType(88): Executor service 
MASTER_CLOSEREGION already running on master
{code}

This was in fact detecting the other master's service.

 NPE in ServerManager when a region is closing
 -

 Key: HBASE-2741
 URL: https://issues.apache.org/jira/browse/HBASE-2741
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: Karthik Ranganathan
 Fix For: 0.21.0


 While running TestReplication I bumped into:
 {code}
 2010-06-16 16:44:07,576 DEBUG [IPC Server handler 3 on 62423] 
 master.RegionManager(357): Created UNASSIGNED zNode 
 test,,1276731846828.de5dcd3df0fbc58207ce6ccff9ff2870. in state 
 M2ZK_REGION_OFFLINE
 2010-06-16 16:44:07,577 INFO  [RegionServer:0] 
 regionserver.HRegionServer(511): MSG_REGION_OPEN: 
 test,,1276731846828.de5dcd3df0fbc58207ce6ccff9ff2870.
 2010-06-16 16:44:07,577 INFO  [RegionServer:0.worker] 
 regionserver.HRegionServer$Worker(1358): Worker: MSG_REGION_OPEN: 
 test,,1276731846828.de5dcd3df0fbc58207ce6ccff9ff2870.
 2010-06-16 16:44:07,578 DEBUG [RegionServer:0.worker] 
 regionserver.RSZookeeperUpdater(157): Updating ZNode 
 /1/UNASSIGNED/de5dcd3df0fbc58207ce6ccff9ff2870 with [RS2ZK_REGION_OPENING] 
 expected version = 0
 2010-06-16 16:44:07,580 DEBUG [main-EventThread] master.HMaster(1142): Event 
 NodeDataChanged with state SyncConnected with path 
 /1/UNASSIGNED/de5dcd3df0fbc58207ce6ccff9ff2870
 2010-06-16 16:44:07,580 DEBUG [main-EventThread] 
 master.ZKMasterAddressWatcher(64): Got event NodeDataChanged with path 
 /1/UNASSIGNED/de5dcd3df0fbc58207ce6ccff9ff2870
 2010-06-16 16:44:07,580 DEBUG [main-EventThread] 
 master.ZKUnassignedWatcher(71): ZK-EVENT-PROCESS: Got zkEvent NodeDataChanged 
 state:SyncConnected path:/1/UNASSIGNED/de5dcd3df0fbc58207ce6ccff9ff2870
 2010-06-16 16:44:07,580 INFO  [main-EventThread] 
 regionserver.HRegionServer(379): Got ZooKeeper event, state: SyncConnected, 
 type: NodeDataChanged, path: /1/UNASSIGNED/de5dcd3df0fbc58207ce6ccff9ff2870
 2010-06-16 16:44:07,581 DEBUG [RegionServer:0.worker] 
 regionserver.HRegion(294): Creating region 
 test,,1276731846828.de5dcd3df0fbc58207ce6ccff9ff2870.
 2010-06-16 16:44:07,582 DEBUG [MASTER_CLOSEREGION-master-1] 
 handler.MasterOpenRegionHandler(70): Event = RS2ZK_REGION_OPENING, region = 
 de5dcd3df0fbc58207ce6ccff9ff2870
 2010-06-16 16:44:07,582 DEBUG [MASTER_CLOSEREGION-master-1] 
 handler.MasterOpenRegionHandler(81): NO-OP call to handling region opening 
 event
 2010-06-16 16:44:07,589 INFO  [RegionServer:0.worker] 
 regionserver.HRegion(369): region 
 test,,1276731846828.de5dcd3df0fbc58207ce6ccff9ff2870. available; sequence id 
 is 1
 2010-06-16 16:44:07,590 DEBUG [RegionServer:0.worker] 
 regionserver.RSZookeeperUpdater(157): Updating ZNode 
 /1/UNASSIGNED/de5dcd3df0fbc58207ce6ccff9ff2870 with [RS2ZK_REGION_OPENED] 
 expected version = 1
 2010-06-16 16:44:07,591 DEBUG [main-EventThread] master.HMaster(1142): Event 
 NodeDataChanged with state SyncConnected with path 
 /1/UNASSIGNED/de5dcd3df0fbc58207ce6ccff9ff2870
 2010-06-16 16:44:07,591 DEBUG [main-EventThread] 
 master.ZKMasterAddressWatcher(64): Got event NodeDataChanged with path 
 /1/UNASSIGNED/de5dcd3df0fbc58207ce6ccff9ff2870
 2010-06-16 16:44:07,592 DEBUG [main-EventThread] 
 master.ZKUnassignedWatcher(71): ZK-EVENT-PROCESS: Got zkEvent NodeDataChanged 
 state:SyncConnected path:/1/UNASSIGNED/de5dcd3df0fbc58207ce6ccff9ff2870
 2010-06-16 16:44:07,591 INFO  [main-EventThread] 
 regionserver.HRegionServer(379): Got ZooKeeper event, state: SyncConnected, 
 type: NodeDataChanged, path: /1/UNASSIGNED/de5dcd3df0fbc58207ce6ccff9ff2870
 2010-06-16 16:44:07,593 DEBUG [MASTER_CLOSEREGION-master-1] 
 handler.MasterOpenRegionHandler(70): Event = RS2ZK_REGION_OPENED, region = 
 de5dcd3df0fbc58207ce6ccff9ff2870
 2010-06-16 16:44:07,594 DEBUG [MASTER_CLOSEREGION-master-1] 
 handler.MasterOpenRegionHandler(96): RS 10.10.1.130,62425,1276731832950 has 
 opened region de5dcd3df0fbc58207ce6ccff9ff2870
 2010-06-16 16:44:07,594 ERROR [MASTER_CLOSEREGION-master-1] 
 server.NIOServerCnxn$Factory$1(81): Thread 
 Thread[MASTER_CLOSEREGION-master-1,5,main] died
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.master.ServerManager.processRegionOpen(ServerManager.java:607)
 at 
 

[jira] Updated: (HBASE-2737) CME in ZKW introduced in HBASE-2694

2010-06-18 Thread Karthik Ranganathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-2737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Ranganathan updated HBASE-2737:
---

Attachment: HBASE-2737-0.21.patch

Making the register and unregister methods synchronized. Unit tests are 
passing. This change is so simple I am not putting it up on review board.

 CME in ZKW introduced in HBASE-2694
 ---

 Key: HBASE-2737
 URL: https://issues.apache.org/jira/browse/HBASE-2737
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: Karthik Ranganathan
 Fix For: 0.21.0

 Attachments: HBASE-2737-0.21.patch


 Saw this while tail'ing a log for something else:
 {code}
 2010-06-15 17:30:03,769 ERROR [main-EventThread] 
 zookeeper.ClientCnxn$EventThread(490): Error while calling watcher
 java.util.ConcurrentModificationException
 at 
 java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
 at java.util.AbstractList$Itr.next(AbstractList.java:343)
 at 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper.process(ZooKeeperWrapper.java:235)
 {code}
 Looks like the listeners list's iterator is used in an unprotected manner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2752) Don't retry forever when waiting on too many store files

2010-06-18 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880390#action_12880390
 ] 

Jean-Daniel Cryans commented on HBASE-2752:
---

I like it. Some comments:

 - requeueCount in FQE could be a boolean, that's how it's used.
 - isMaximumWait isn't documented

With that fixed and some cluster load testing, I'm +1 for commit.

 Don't retry forever when waiting on too many store files
 

 Key: HBASE-2752
 URL: https://issues.apache.org/jira/browse/HBASE-2752
 Project: HBase
  Issue Type: Improvement
Reporter: Jean-Daniel Cryans
Assignee: stack
Priority: Critical
 Fix For: 0.20.5, 0.21.0

 Attachments: 2752.txt


 HBASE-2087 introduced a way to not block all flushes when on region has too 
 many store files. Unfortunately, that undid the behavior that if we waited 
 for longer than 90 secs then that we would still flush the region... which 
 means that when a  region blocks inserts because its memstore is too big it's 
 actually holding off writes for a very long time, occupying handlers, etc.
 We need to add more smarts in MemStoreFlusher so that we detect when a region 
 was held up for too long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2752) Don't retry forever when waiting on too many store files

2010-06-18 Thread Dave Latham (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880422#action_12880422
 ] 

Dave Latham commented on HBASE-2752:


Thanks for the quick work.  It's really aprpeciated.  I'll try to get this 
patch tested on a cluster.

Minor nits:
* The log on Cache flush failed should use toStringBinary for the region name.
* blockingWaitTime / 100 seems somewhat arbitrary for check interval, but 
probably fine for now.


 Don't retry forever when waiting on too many store files
 

 Key: HBASE-2752
 URL: https://issues.apache.org/jira/browse/HBASE-2752
 Project: HBase
  Issue Type: Improvement
Reporter: Jean-Daniel Cryans
Assignee: stack
Priority: Critical
 Fix For: 0.20.5, 0.21.0

 Attachments: 2752.txt


 HBASE-2087 introduced a way to not block all flushes when on region has too 
 many store files. Unfortunately, that undid the behavior that if we waited 
 for longer than 90 secs then that we would still flush the region... which 
 means that when a  region blocks inserts because its memstore is too big it's 
 actually holding off writes for a very long time, occupying handlers, etc.
 We need to add more smarts in MemStoreFlusher so that we detect when a region 
 was held up for too long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2752) Don't retry forever when waiting on too many store files

2010-06-18 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880426#action_12880426
 ] 

stack commented on HBASE-2752:
--

Thanks j-d for review.  I added in your first suggestion.  For the second, I 
kept count.  I think it'll be of use when we have a jsp page that dumps current 
state of the flush queue.

I've been running it up on cluster.  I see some of these during a big upload:

{code}
2010-06-18 18:02:17,864 INFO 
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Waited 90495ms on a 
compaction to clean up 'too many store files'; waited long enough... proceeding 
with flush
{code}

...so it looks like we got the 0.20.3 behavior back where we'll go ahead and 
flush regardless if we've waited N ms (I left the interval at the 0.20.3 90 
seconds which seems a bit long but...).

I'm going to commit and roll an RC

 Don't retry forever when waiting on too many store files
 

 Key: HBASE-2752
 URL: https://issues.apache.org/jira/browse/HBASE-2752
 Project: HBase
  Issue Type: Improvement
Reporter: Jean-Daniel Cryans
Assignee: stack
Priority: Critical
 Fix For: 0.20.5, 0.21.0

 Attachments: 2752.txt


 HBASE-2087 introduced a way to not block all flushes when on region has too 
 many store files. Unfortunately, that undid the behavior that if we waited 
 for longer than 90 secs then that we would still flush the region... which 
 means that when a  region blocks inserts because its memstore is too big it's 
 actually holding off writes for a very long time, occupying handlers, etc.
 We need to add more smarts in MemStoreFlusher so that we detect when a region 
 was held up for too long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2752) Don't retry forever when waiting on too many store files

2010-06-18 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12880427#action_12880427
 ] 

stack commented on HBASE-2752:
--

Applied to branch and trunk (Lets talk jgray).

 Don't retry forever when waiting on too many store files
 

 Key: HBASE-2752
 URL: https://issues.apache.org/jira/browse/HBASE-2752
 Project: HBase
  Issue Type: Improvement
Reporter: Jean-Daniel Cryans
Assignee: stack
Priority: Critical
 Fix For: 0.20.5, 0.21.0

 Attachments: 2752.txt


 HBASE-2087 introduced a way to not block all flushes when on region has too 
 many store files. Unfortunately, that undid the behavior that if we waited 
 for longer than 90 secs then that we would still flush the region... which 
 means that when a  region blocks inserts because its memstore is too big it's 
 actually holding off writes for a very long time, occupying handlers, etc.
 We need to add more smarts in MemStoreFlusher so that we detect when a region 
 was held up for too long.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.