[jira] Updated: (HBASE-50) Snapshot of table

2010-06-25 Thread Li Chongxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Chongxin updated HBASE-50:
-

Attachment: Snapshot Class Diagram.png

 Snapshot of table
 -

 Key: HBASE-50
 URL: https://issues.apache.org/jira/browse/HBASE-50
 Project: HBase
  Issue Type: New Feature
Reporter: Billy Pearson
Assignee: Li Chongxin
Priority: Minor
 Attachments: HBase Snapshot Design Report V2.pdf, HBase Snapshot 
 Design Report V3.pdf, Snapshot Class Diagram.png, snapshot-src.zip


 Havening an option to take a snapshot of a table would be vary useful in 
 production.
 What I would like to see this option do is do a merge of all the data into 
 one or more files stored in the same folder on the dfs. This way we could 
 save data in case of a software bug in hadoop or user code. 
 The other advantage would be to be able to export a table to multi locations. 
 Say I had a read_only table that must be online. I could take a snapshot of 
 it when needed and export it to a separate data center and have it loaded 
 there and then i would have it online at multi data centers for load 
 balancing and failover.
 I understand that hadoop takes the need out of havening backup to protect 
 from failed servers, but this does not protect use from software bugs that 
 might delete or alter data in ways we did not plan. We should have a way we 
 can roll back a dataset.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-50) Snapshot of table

2010-06-25 Thread Li Chongxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Chongxin updated HBASE-50:
-

Attachment: HBase Snapshot Implementation Plan.pdf

HBase Snapshot Implementation Plan describes the classes and methods that are 
going to be created and modified to support snapshot. Go over the document with 
the class diagram. Any comments are welcome!

 Snapshot of table
 -

 Key: HBASE-50
 URL: https://issues.apache.org/jira/browse/HBASE-50
 Project: HBase
  Issue Type: New Feature
Reporter: Billy Pearson
Assignee: Li Chongxin
Priority: Minor
 Attachments: HBase Snapshot Design Report V2.pdf, HBase Snapshot 
 Design Report V3.pdf, HBase Snapshot Implementation Plan.pdf, Snapshot Class 
 Diagram.png


 Havening an option to take a snapshot of a table would be vary useful in 
 production.
 What I would like to see this option do is do a merge of all the data into 
 one or more files stored in the same folder on the dfs. This way we could 
 save data in case of a software bug in hadoop or user code. 
 The other advantage would be to be able to export a table to multi locations. 
 Say I had a read_only table that must be online. I could take a snapshot of 
 it when needed and export it to a separate data center and have it loaded 
 there and then i would have it online at multi data centers for load 
 balancing and failover.
 I understand that hadoop takes the need out of havening backup to protect 
 from failed servers, but this does not protect use from software bugs that 
 might delete or alter data in ways we did not plan. We should have a way we 
 can roll back a dataset.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-50) Snapshot of table

2010-06-25 Thread Li Chongxin (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Chongxin updated HBASE-50:
-

Attachment: (was: snapshot-src.zip)

 Snapshot of table
 -

 Key: HBASE-50
 URL: https://issues.apache.org/jira/browse/HBASE-50
 Project: HBase
  Issue Type: New Feature
Reporter: Billy Pearson
Assignee: Li Chongxin
Priority: Minor
 Attachments: HBase Snapshot Design Report V2.pdf, HBase Snapshot 
 Design Report V3.pdf, HBase Snapshot Implementation Plan.pdf, Snapshot Class 
 Diagram.png


 Havening an option to take a snapshot of a table would be vary useful in 
 production.
 What I would like to see this option do is do a merge of all the data into 
 one or more files stored in the same folder on the dfs. This way we could 
 save data in case of a software bug in hadoop or user code. 
 The other advantage would be to be able to export a table to multi locations. 
 Say I had a read_only table that must be online. I could take a snapshot of 
 it when needed and export it to a separate data center and have it loaded 
 there and then i would have it online at multi data centers for load 
 balancing and failover.
 I understand that hadoop takes the need out of havening backup to protect 
 from failed servers, but this does not protect use from software bugs that 
 might delete or alter data in ways we did not plan. We should have a way we 
 can roll back a dataset.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2789) Propagate HBase config from Master to region servers

2010-06-25 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882551#action_12882551
 ] 

Ted Yu commented on HBASE-2789:
---

Yes.

 Propagate HBase config from Master to region servers
 

 Key: HBASE-2789
 URL: https://issues.apache.org/jira/browse/HBASE-2789
 Project: HBase
  Issue Type: Improvement
  Components: master
Affects Versions: 0.20.3
Reporter: Ted Yu

 If HBase config is modified when HBase cluster is running, the changes 
 wouldn't propagate to region servers after restarting cluster.
 This is different from hadoop behavior where changes get automatically copied 
 to data nodes.
 This feature is desirable when enabling JMX, e.g.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

2010-06-25 Thread Jean-Daniel Cryans (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882638#action_12882638
 ] 

Jean-Daniel Cryans commented on HBASE-2707:
---

So actually the code of process is looks like:

{code}
LOG.info(Log split complete, meta reassignment and scanning:);
if (this.isRootServer) {
  LOG.info(ProcessServerShutdown reassigning ROOT region);
  master.getRegionManager().reassignRootRegion();
  isRootServer = false;  // prevent double reassignment... heh.
}

for (MetaRegion metaRegion : metaRegions) {
  LOG.info(ProcessServerShutdown setting to unassigned:  + 
metaRegion.toString());
  master.getRegionManager().setUnassigned(metaRegion.getRegionInfo(), true);
}
// one the meta regions are online, forget about them.  Since there are 
explicit
// checks below to make sure meta/root are online, this is likely to occur.
metaRegions.clear();

if (!rootAvailable()) {
  // Return true so that worker does not put this request back on the
  // toDoQueue.
  // rootAvailable() has already put it on the delayedToDoQueue
  return true;
}

if (!rootRescanned) {
  // Scan the ROOT region
  Boolean result = new ScanRootRegion(
  new MetaRegion(master.getRegionManager().getRootRegionLocation(),
  HRegionInfo.ROOT_REGIONINFO), this.master).doWithRetries();
  if (result == null) {
// Master is closing - give up
return true;
  }

  if (LOG.isDebugEnabled()) {
LOG.debug(Process server shutdown scanning root region on  +
  master.getRegionManager().getRootRegionLocation().getBindAddress() +
   finished  + Thread.currentThread().getName());
  }
  rootRescanned = true;
}
{code}

So if the RS had -ROOT-, it will be reassigned right away and then the method 
returns if !rootAvailable. Later when we come back and root was assigned, 
process server shutdown will finish its job. This is how the code you pasted 
succeeds.

 Can't recover from a dead ROOT server if any exceptions happens during log 
 splitting
 

 Key: HBASE-2707
 URL: https://issues.apache.org/jira/browse/HBASE-2707
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: stack
Priority: Blocker
 Fix For: 0.21.0

 Attachments: HBASE-2707.patch


 There's an almost easy way to get stuck after a RS holding ROOT dies, usually 
 from a GC-like event. It happens frequently to my TestReplication in 
 HBASE-2223.
 Some logs:
 {code}
 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. 
 Removing old log dir 
 hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
 2010-06-10 11:35:52,095 WARN  [master] 
 master.RegionServerOperationQueue(183): Failed processing: 
 ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed 
 todo queue
 java.io.IOException: Cannot delete: 
 hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
 at 
 org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
 at 
 org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
 Caused by: java.io.IOException: java.io.IOException: 
 /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
 2010-06-10 11:35:52,097 DEBUG [master] 
 master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
 delayedToDoQueue items
 2010-06-10 11:35:53,098 DEBUG [master] 
 master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
 delayedToDoQueue items
 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] 
 master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average 
 load 14.0[10.10.1.63,55846,1276194933831]
 2010-06-10 11:35:54,099 DEBUG [master] 
 master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
 delayedToDoQueue items
 2010-06-10 11:35:55,101 DEBUG [master] 
 master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
 delayedToDoQueue items
 {code}
 The last lines are my own debug. Since we don't process the delayed todo if 
 ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

2010-06-25 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882649#action_12882649
 ] 

stack commented on HBASE-2707:
--

So its broken then?  We assign -ROOT- but don't recover its edits?

 Can't recover from a dead ROOT server if any exceptions happens during log 
 splitting
 

 Key: HBASE-2707
 URL: https://issues.apache.org/jira/browse/HBASE-2707
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: stack
Priority: Blocker
 Fix For: 0.21.0

 Attachments: HBASE-2707.patch


 There's an almost easy way to get stuck after a RS holding ROOT dies, usually 
 from a GC-like event. It happens frequently to my TestReplication in 
 HBASE-2223.
 Some logs:
 {code}
 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. 
 Removing old log dir 
 hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
 2010-06-10 11:35:52,095 WARN  [master] 
 master.RegionServerOperationQueue(183): Failed processing: 
 ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed 
 todo queue
 java.io.IOException: Cannot delete: 
 hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
 at 
 org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
 at 
 org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
 Caused by: java.io.IOException: java.io.IOException: 
 /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
 2010-06-10 11:35:52,097 DEBUG [master] 
 master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
 delayedToDoQueue items
 2010-06-10 11:35:53,098 DEBUG [master] 
 master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
 delayedToDoQueue items
 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] 
 master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average 
 load 14.0[10.10.1.63,55846,1276194933831]
 2010-06-10 11:35:54,099 DEBUG [master] 
 master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
 delayedToDoQueue items
 2010-06-10 11:35:55,101 DEBUG [master] 
 master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
 delayedToDoQueue items
 {code}
 The last lines are my own debug. Since we don't process the delayed todo if 
 ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

2010-06-25 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882653#action_12882653
 ] 

stack commented on HBASE-2707:
--

Hmm... chatted with J-D and he points out that the above runs AFTER logs are 
split so I had it incorrect.  Above should be good.

 Can't recover from a dead ROOT server if any exceptions happens during log 
 splitting
 

 Key: HBASE-2707
 URL: https://issues.apache.org/jira/browse/HBASE-2707
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: stack
Priority: Blocker
 Fix For: 0.21.0

 Attachments: HBASE-2707.patch


 There's an almost easy way to get stuck after a RS holding ROOT dies, usually 
 from a GC-like event. It happens frequently to my TestReplication in 
 HBASE-2223.
 Some logs:
 {code}
 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. 
 Removing old log dir 
 hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
 2010-06-10 11:35:52,095 WARN  [master] 
 master.RegionServerOperationQueue(183): Failed processing: 
 ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed 
 todo queue
 java.io.IOException: Cannot delete: 
 hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
 at 
 org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
 at 
 org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
 Caused by: java.io.IOException: java.io.IOException: 
 /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
 2010-06-10 11:35:52,097 DEBUG [master] 
 master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
 delayedToDoQueue items
 2010-06-10 11:35:53,098 DEBUG [master] 
 master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
 delayedToDoQueue items
 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] 
 master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average 
 load 14.0[10.10.1.63,55846,1276194933831]
 2010-06-10 11:35:54,099 DEBUG [master] 
 master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
 delayedToDoQueue items
 2010-06-10 11:35:55,101 DEBUG [master] 
 master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
 delayedToDoQueue items
 {code}
 The last lines are my own debug. Since we don't process the delayed todo if 
 ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HBASE-2790) Purge apache-forrest from TRUNK

2010-06-25 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack resolved HBASE-2790.
--

 Assignee: stack
Fix Version/s: 0.21.0
   Resolution: Fixed

Committed.  Removed the top-level docs dir (its generated).  While here, 
removed building of test and source jars into -bin.tgz bundle.

 Purge apache-forrest from TRUNK
 ---

 Key: HBASE-2790
 URL: https://issues.apache.org/jira/browse/HBASE-2790
 Project: HBase
  Issue Type: Task
Reporter: stack
Assignee: stack
 Fix For: 0.21.0


 Remove all of the apache-forrest dirs from TRUNK.  We don't do apache-forrest 
 any more.  We use maven generating out site.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HBASE-2791) Stop dumping exceptions coming from ZK and do nothing about them

2010-06-25 Thread Jean-Daniel Cryans (JIRA)
Stop dumping exceptions coming from ZK and do nothing about them


 Key: HBASE-2791
 URL: https://issues.apache.org/jira/browse/HBASE-2791
 Project: HBase
  Issue Type: Improvement
Reporter: Jean-Daniel Cryans
 Fix For: 0.21.0


I think this is part of the Master/ZooKeeper refactoring project but I'm 
putting it up here to be sure we cover it. Currently in ZKW (and other places 
around the code base) we do ZK operations and we don't really handle the 
exceptions, for example in ZKW.setClusterState:

{code}
} catch (InterruptedException e) {
  LOG.warn( + instanceName +  + Failed to set state node in 
ZooKeeper, e);
} catch (KeeperException e) {
  if(e.code() == KeeperException.Code.NODEEXISTS) {
LOG.debug( + instanceName +  + State node exists.);
  } else {
LOG.warn( + instanceName +  + Failed to set state node in 
ZooKeeper, e);
  }
{code}

This has been always like that since we started using ZK.

What if the session was expired? What if it was only the connection that had a 
blip? Do we handle it correctly? We need to have this discussion.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (HBASE-2792) Create a better way to chain log cleaners

2010-06-25 Thread Jean-Daniel Cryans (JIRA)
Create a better way to chain log cleaners
-

 Key: HBASE-2792
 URL: https://issues.apache.org/jira/browse/HBASE-2792
 Project: HBase
  Issue Type: Improvement
Reporter: Jean-Daniel Cryans
 Fix For: 0.21.0


From Stack's review of HBASE-2223:

{quote}
Why this implementation have to know about other implementations?  Can't we do 
a chain of decision classes? Any class can say no?  As soon as any decision 
class says no, we exit the chain So in this case, first on the chain would 
be the ttl decision... then would be this one... and third would be the 
snapshotting decision. You don't have to do the chain as part of this patch but 
please open an issue to implement.
{quote}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

2010-06-25 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-2707:
-

Attachment: 2707-test.txt

Test that puts off the processing of the shutdown of the server that was 
carrying root.  This test never completes.  With the patch in place, it does.

 Can't recover from a dead ROOT server if any exceptions happens during log 
 splitting
 

 Key: HBASE-2707
 URL: https://issues.apache.org/jira/browse/HBASE-2707
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: stack
Priority: Blocker
 Fix For: 0.21.0

 Attachments: 2707-test.txt, HBASE-2707.patch


 There's an almost easy way to get stuck after a RS holding ROOT dies, usually 
 from a GC-like event. It happens frequently to my TestReplication in 
 HBASE-2223.
 Some logs:
 {code}
 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. 
 Removing old log dir 
 hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
 2010-06-10 11:35:52,095 WARN  [master] 
 master.RegionServerOperationQueue(183): Failed processing: 
 ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed 
 todo queue
 java.io.IOException: Cannot delete: 
 hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
 at 
 org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
 at 
 org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
 Caused by: java.io.IOException: java.io.IOException: 
 /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
 2010-06-10 11:35:52,097 DEBUG [master] 
 master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
 delayedToDoQueue items
 2010-06-10 11:35:53,098 DEBUG [master] 
 master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
 delayedToDoQueue items
 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] 
 master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average 
 load 14.0[10.10.1.63,55846,1276194933831]
 2010-06-10 11:35:54,099 DEBUG [master] 
 master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
 delayedToDoQueue items
 2010-06-10 11:35:55,101 DEBUG [master] 
 master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
 delayedToDoQueue items
 {code}
 The last lines are my own debug. Since we don't process the delayed todo if 
 ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (HBASE-2707) Can't recover from a dead ROOT server if any exceptions happens during log splitting

2010-06-25 Thread stack (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack resolved HBASE-2707.
--

Hadoop Flags: [Reviewed]
  Resolution: Fixed

Committed.  Thanks for review J-D (I removed DELAY altogether).

 Can't recover from a dead ROOT server if any exceptions happens during log 
 splitting
 

 Key: HBASE-2707
 URL: https://issues.apache.org/jira/browse/HBASE-2707
 Project: HBase
  Issue Type: Bug
Reporter: Jean-Daniel Cryans
Assignee: stack
Priority: Blocker
 Fix For: 0.21.0

 Attachments: 2707-test.txt, HBASE-2707.patch


 There's an almost easy way to get stuck after a RS holding ROOT dies, usually 
 from a GC-like event. It happens frequently to my TestReplication in 
 HBASE-2223.
 Some logs:
 {code}
 2010-06-10 11:35:52,090 INFO  [master] wal.HLog(1175): Spliting is done. 
 Removing old log dir 
 hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
 2010-06-10 11:35:52,095 WARN  [master] 
 master.RegionServerOperationQueue(183): Failed processing: 
 ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed 
 todo queue
 java.io.IOException: Cannot delete: 
 hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
 at 
 org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
 at 
 org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
 at 
 org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
 Caused by: java.io.IOException: java.io.IOException: 
 /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
 2010-06-10 11:35:52,097 DEBUG [master] 
 master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
 delayedToDoQueue items
 2010-06-10 11:35:53,098 DEBUG [master] 
 master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
 delayedToDoQueue items
 2010-06-10 11:35:53,523 INFO  [main.serverMonitor] 
 master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average 
 load 14.0[10.10.1.63,55846,1276194933831]
 2010-06-10 11:35:54,099 DEBUG [master] 
 master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
 delayedToDoQueue items
 2010-06-10 11:35:55,101 DEBUG [master] 
 master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process 
 delayedToDoQueue items
 {code}
 The last lines are my own debug. Since we don't process the delayed todo if 
 ROOT isn't online, we'll never reassign the regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HBASE-50) Snapshot of table

2010-06-25 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-50?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12882779#action_12882779
 ] 

stack commented on HBASE-50:


I'll make a branch to host Li's work going forward.

 Snapshot of table
 -

 Key: HBASE-50
 URL: https://issues.apache.org/jira/browse/HBASE-50
 Project: HBase
  Issue Type: New Feature
Reporter: Billy Pearson
Assignee: Li Chongxin
Priority: Minor
 Attachments: HBase Snapshot Design Report V2.pdf, HBase Snapshot 
 Design Report V3.pdf, HBase Snapshot Implementation Plan.pdf, Snapshot Class 
 Diagram.png


 Havening an option to take a snapshot of a table would be vary useful in 
 production.
 What I would like to see this option do is do a merge of all the data into 
 one or more files stored in the same folder on the dfs. This way we could 
 save data in case of a software bug in hadoop or user code. 
 The other advantage would be to be able to export a table to multi locations. 
 Say I had a read_only table that must be online. I could take a snapshot of 
 it when needed and export it to a separate data center and have it loaded 
 there and then i would have it online at multi data centers for load 
 balancing and failover.
 I understand that hadoop takes the need out of havening backup to protect 
 from failed servers, but this does not protect use from software bugs that 
 might delete or alter data in ways we did not plan. We should have a way we 
 can roll back a dataset.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.