[jira] [Updated] (HBASE-5689) Skipping RecoveredEdits may cause data loss

2012-04-05 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5689:


Attachment: HBASE-5689v2.patch

Sorry for the later response.

In the patchV2, I add some doc for getCompletedRecoveredEditsFilePath() as the 
comment by Stack. All the other is same with patchV1.

Please review again.
Thanks!

 Skipping RecoveredEdits may cause data loss
 ---

 Key: HBASE-5689
 URL: https://issues.apache.org/jira/browse/HBASE-5689
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.0
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.94.0

 Attachments: 5689-testcase.patch, HBASE-5689.patch, HBASE-5689.patch, 
 HBASE-5689v2.patch


 Let's see the following scenario:
 1.Region is on the server A
 2.put KV(r1-v1) to the region
 3.move region from server A to server B
 4.put KV(r2-v2) to the region
 5.move region from server B to server A
 6.put KV(r3-v3) to the region
 7.kill -9 server B and start it
 8.kill -9 server A and start it 
 9.scan the region, we could only get two KV(r1-v1,r2-v2), the third 
 KV(r3-v3) is lost.
 Let's analyse the upper scenario from the code:
 1.the edit logs of KV(r1-v1) and KV(r3-v3) are both recorded in the same 
 hlog file on server A.
 2.when we split server B's hlog file in the process of ServerShutdownHandler, 
 we create one RecoveredEdits file f1 for the region.
 2.when we split server A's hlog file in the process of ServerShutdownHandler, 
 we create another RecoveredEdits file f2 for the region.
 3.however, RecoveredEdits file f2 will be skiped when initializing region
 HRegion#replayRecoveredEditsIfAny
 {code}
  for (Path edits: files) {
   if (edits == null || !this.fs.exists(edits)) {
 LOG.warn(Null or non-existent edits file:  + edits);
 continue;
   }
   if (isZeroLengthThenDelete(this.fs, edits)) continue;
   if (checkSafeToSkip) {
 Path higher = files.higher(edits);
 long maxSeqId = Long.MAX_VALUE;
 if (higher != null) {
   // Edit file name pattern, HLog.EDITFILES_NAME_PATTERN: -?[0-9]+
   String fileName = higher.getName();
   maxSeqId = Math.abs(Long.parseLong(fileName));
 }
 if (maxSeqId = minSeqId) {
   String msg = Maximum possible sequenceid for this log is  + 
 maxSeqId
   + , skipped the whole file, path= + edits;
   LOG.debug(msg);
   continue;
 } else {
   checkSafeToSkip = false;
 }
   }
 {code}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5672) TestLruBlockCache#testBackgroundEvictionThread fails occasionally

2012-04-05 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5672:


Attachment: HBASE-5672v2.patch

In the patchV2, I add a flag in the thread.

Please review again

Thanks.

 TestLruBlockCache#testBackgroundEvictionThread fails occasionally
 -

 Key: HBASE-5672
 URL: https://issues.apache.org/jira/browse/HBASE-5672
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: HBASE-5672.patch, HBASE-5672v2.patch


 We find TestLruBlockCache#testBackgroundEvictionThread fails occasionally.
 I think it's a problem of the test case.
 Because runEviction() only do evictionThread.evict():
 {code}
 public void evict() {
   synchronized(this) {
 this.notify(); // FindBugs NN_NAKED_NOTIFY
   }
 }
 {code}
 However when we call evictionThread.evict(), the evictionThread may haven't 
 been in run() in the TestLruBlockCache#testBackgroundEvictionThread.
 If we run the test many times, we could find failture easily.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5689) Skipping RecoveredEdits may cause data loss

2012-04-05 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5689:


Attachment: HBASE-5689v3.patch

In the patchV3, append the optimize suggested by Ming Ma 
:replayRecoveredEditsIfAny simply walk through all the recoverededit files and 
skip those files that have max sequence id = current sequence id of the 
region.

 Skipping RecoveredEdits may cause data loss
 ---

 Key: HBASE-5689
 URL: https://issues.apache.org/jira/browse/HBASE-5689
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Affects Versions: 0.94.0
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.94.0

 Attachments: 5689-testcase.patch, HBASE-5689.patch, HBASE-5689.patch, 
 HBASE-5689v2.patch, HBASE-5689v3.patch


 Let's see the following scenario:
 1.Region is on the server A
 2.put KV(r1-v1) to the region
 3.move region from server A to server B
 4.put KV(r2-v2) to the region
 5.move region from server B to server A
 6.put KV(r3-v3) to the region
 7.kill -9 server B and start it
 8.kill -9 server A and start it 
 9.scan the region, we could only get two KV(r1-v1,r2-v2), the third 
 KV(r3-v3) is lost.
 Let's analyse the upper scenario from the code:
 1.the edit logs of KV(r1-v1) and KV(r3-v3) are both recorded in the same 
 hlog file on server A.
 2.when we split server B's hlog file in the process of ServerShutdownHandler, 
 we create one RecoveredEdits file f1 for the region.
 2.when we split server A's hlog file in the process of ServerShutdownHandler, 
 we create another RecoveredEdits file f2 for the region.
 3.however, RecoveredEdits file f2 will be skiped when initializing region
 HRegion#replayRecoveredEditsIfAny
 {code}
  for (Path edits: files) {
   if (edits == null || !this.fs.exists(edits)) {
 LOG.warn(Null or non-existent edits file:  + edits);
 continue;
   }
   if (isZeroLengthThenDelete(this.fs, edits)) continue;
   if (checkSafeToSkip) {
 Path higher = files.higher(edits);
 long maxSeqId = Long.MAX_VALUE;
 if (higher != null) {
   // Edit file name pattern, HLog.EDITFILES_NAME_PATTERN: -?[0-9]+
   String fileName = higher.getName();
   maxSeqId = Math.abs(Long.parseLong(fileName));
 }
 if (maxSeqId = minSeqId) {
   String msg = Maximum possible sequenceid for this log is  + 
 maxSeqId
   + , skipped the whole file, path= + edits;
   LOG.debug(msg);
   continue;
 } else {
   checkSafeToSkip = false;
 }
   }
 {code}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5689) Skip RecoveredEdits may cause data loss

2012-03-31 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5689:


Attachment: HBASE-5689.patch

In the patch,
I make the region's MaximumEditLogSeqNum in the RecoveredEdit file as the file 
name.

 Skip RecoveredEdits may cause data loss
 ---

 Key: HBASE-5689
 URL: https://issues.apache.org/jira/browse/HBASE-5689
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: 5689-testcase.patch, HBASE-5689.patch


 Let's see the following scenario:
 1.Region is on the server A
 2.put KV(r1-v1) to the region
 3.move region from server A to server B
 4.put KV(r2-v2) to the region
 5.move region from server B to server A
 6.put KV(r3-v3) to the region
 7.kill -9 server B and start it
 8.kill -9 server A and start it 
 9.scan the region, we could only get two KV(r1-v1,r2-v2), the third 
 KV(r3-v3) is lost.
 Let's analyse the upper scenario from the code:
 1.the edit logs of KV(r1-v1) and KV(r3-v3) are both recorded in the same 
 hlog file on server A.
 2.when we split server B's hlog file in the process of ServerShutdownHandler, 
 we create one RecoveredEdits file f1 for the region.
 2.when we split server A's hlog file in the process of ServerShutdownHandler, 
 we create another RecoveredEdits file f2 for the region.
 3.however, RecoveredEdits file f2 will be skiped when initializing region
 HRegion#replayRecoveredEditsIfAny
 {code}
  for (Path edits: files) {
   if (edits == null || !this.fs.exists(edits)) {
 LOG.warn(Null or non-existent edits file:  + edits);
 continue;
   }
   if (isZeroLengthThenDelete(this.fs, edits)) continue;
   if (checkSafeToSkip) {
 Path higher = files.higher(edits);
 long maxSeqId = Long.MAX_VALUE;
 if (higher != null) {
   // Edit file name pattern, HLog.EDITFILES_NAME_PATTERN: -?[0-9]+
   String fileName = higher.getName();
   maxSeqId = Math.abs(Long.parseLong(fileName));
 }
 if (maxSeqId = minSeqId) {
   String msg = Maximum possible sequenceid for this log is  + 
 maxSeqId
   + , skipped the whole file, path= + edits;
   LOG.debug(msg);
   continue;
 } else {
   checkSafeToSkip = false;
 }
   }
 {code}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5689) Skip RecoveredEdits may cause data loss

2012-03-30 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5689:


Attachment: 5689-testcase.patch

I have written a test case for the issue to make a data loss scenario when skip 
RecoveredEdits.

 Skip RecoveredEdits may cause data loss
 ---

 Key: HBASE-5689
 URL: https://issues.apache.org/jira/browse/HBASE-5689
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: 5689-testcase.patch


 Let's see the following scenario:
 1.Region is on the server A
 2.put KV(r1-v1) to the region
 3.move region from server A to server B
 4.put KV(r2-v2) to the region
 5.move region from server B to server A
 6.put KV(r3-v3) to the region
 7.kill -9 server B and start it
 8.kill -9 server A and start it 
 9.scan the region, we could only get two KV(r1-v1,r2-v2), the third 
 KV(r3-v3) is lost.
 Let's analyse the upper scenario from the code:
 1.the edit logs of KV(r1-v1) and KV(r3-v3) are both recorded in the same 
 hlog file on server A.
 2.when we split server B's hlog file in the process of ServerShutdownHandler, 
 we create one RecoveredEdits file f1 for the region.
 2.when we split server A's hlog file in the process of ServerShutdownHandler, 
 we create another RecoveredEdits file f2 for the region.
 3.however, RecoveredEdits file f2 will be skiped when initializing region
 HRegion#replayRecoveredEditsIfAny
 {code}
  for (Path edits: files) {
   if (edits == null || !this.fs.exists(edits)) {
 LOG.warn(Null or non-existent edits file:  + edits);
 continue;
   }
   if (isZeroLengthThenDelete(this.fs, edits)) continue;
   if (checkSafeToSkip) {
 Path higher = files.higher(edits);
 long maxSeqId = Long.MAX_VALUE;
 if (higher != null) {
   // Edit file name pattern, HLog.EDITFILES_NAME_PATTERN: -?[0-9]+
   String fileName = higher.getName();
   maxSeqId = Math.abs(Long.parseLong(fileName));
 }
 if (maxSeqId = minSeqId) {
   String msg = Maximum possible sequenceid for this log is  + 
 maxSeqId
   + , skipped the whole file, path= + edits;
   LOG.debug(msg);
   continue;
 } else {
   checkSafeToSkip = false;
 }
   }
 {code}
  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5672) TestLruBlockCache#testBackgroundEvictionThread fails occasionally

2012-03-28 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5672:


Description: 
We find TestLruBlockCache#testBackgroundEvictionThread fails occasionally.

I think it's a problem of the test case.
Because runEviction() only do evictionThread.evict():
{code}
public void evict() {
  synchronized(this) {
this.notify(); // FindBugs NN_NAKED_NOTIFY
  }
}
{code}
However when we call evictionThread.evict(), the evictionThread may haven't 
been in run() in the TestLruBlockCache#testBackgroundEvictionThread.

If we run the test many times, we could find failture easily.

  was:
We find TestLruBlockCache#testBackgroundEvictionThread fails occasionally.

I think it's a problem of the test case.
Because runEviction() only do evictionThread.evict():
{code}
public void evict() {
  synchronized(this) {
this.notify(); // FindBugs NN_NAKED_NOTIFY
  }
}
{code}
However when we call evictionThread.evict(), the evictionThread may haven't 
been in run() in the TestLruBlockCache#testBackgroundEvictionThread.


 TestLruBlockCache#testBackgroundEvictionThread fails occasionally
 -

 Key: HBASE-5672
 URL: https://issues.apache.org/jira/browse/HBASE-5672
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
Assignee: chunhui shen

 We find TestLruBlockCache#testBackgroundEvictionThread fails occasionally.
 I think it's a problem of the test case.
 Because runEviction() only do evictionThread.evict():
 {code}
 public void evict() {
   synchronized(this) {
 this.notify(); // FindBugs NN_NAKED_NOTIFY
   }
 }
 {code}
 However when we call evictionThread.evict(), the evictionThread may haven't 
 been in run() in the TestLruBlockCache#testBackgroundEvictionThread.
 If we run the test many times, we could find failture easily.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5672) TestLruBlockCache#testBackgroundEvictionThread fails occasionally

2012-03-28 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5672:


Attachment: HBASE-5672.patch

No failture for 1000 times with the patch.

 TestLruBlockCache#testBackgroundEvictionThread fails occasionally
 -

 Key: HBASE-5672
 URL: https://issues.apache.org/jira/browse/HBASE-5672
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: HBASE-5672.patch


 We find TestLruBlockCache#testBackgroundEvictionThread fails occasionally.
 I think it's a problem of the test case.
 Because runEviction() only do evictionThread.evict():
 {code}
 public void evict() {
   synchronized(this) {
 this.notify(); // FindBugs NN_NAKED_NOTIFY
   }
 }
 {code}
 However when we call evictionThread.evict(), the evictionThread may haven't 
 been in run() in the TestLruBlockCache#testBackgroundEvictionThread.
 If we run the test many times, we could find failture easily.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5624) Aborting regionserver when splitting region, may cause daughter region not assigned by ServerShutdownHandler.

2012-03-22 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5624:


Attachment: HBASE-5624.patch

 Aborting regionserver when splitting region, may cause daughter region not 
 assigned by ServerShutdownHandler.
 -

 Key: HBASE-5624
 URL: https://issues.apache.org/jira/browse/HBASE-5624
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: HBASE-5624.patch


 If one region is splitting when regionserver is stoping.
 The following code may executed in SplitTransaction#openDaughters.
 {code}
 // TODO: Is this check needed here?
 if (stopped || stopping) {
   // add 2nd daughter first (see HBASE-4335)
   MetaEditor.addDaughter(server.getCatalogTracker(),
   b.getRegionInfo(), null);
   MetaEditor.addDaughter(server.getCatalogTracker(),
   a.getRegionInfo(), null);
   LOG.info(Not opening daughters  +
   b.getRegionInfo().getRegionNameAsString() +
and  +
   a.getRegionInfo().getRegionNameAsString() +
because stopping= + stopping + , stopped= + stopped);
 } 
 {code}
 So, for the two daughter regions, their location are both null in .META.
 When ServerShutdownHandler process the dead server, it will not assign these 
 two daughter regions since their location(info:server) are null in .META. by 
 MetaReader.getServerUserRegions().

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5568) Multi concurrent flushcache() for one region could cause data loss

2012-03-16 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5568:


Attachment: HBASE-5568v2.patch

 Multi concurrent flushcache() for one region could cause data loss
 --

 Key: HBASE-5568
 URL: https://issues.apache.org/jira/browse/HBASE-5568
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: chunhui shen
Assignee: chunhui shen
 Fix For: 0.90.7, 0.92.2, 0.94.0, 0.96.0

 Attachments: HBASE-5568-90.patch, HBASE-5568.patch, HBASE-5568.patch, 
 HBASE-5568v2.patch, HBASE-5568v2.patch


 We could call HRegion#flushcache() concurrently now through 
 HRegionServer#splitRegion or HRegionServer#flushRegion by HBaseAdmin.
 However, we find if HRegion#internalFlushcache() is called concurrently by 
 multi thread, HRegion.memstoreSize will be calculated wrong.
 At the end of HRegion#internalFlushcache(), we will do 
 this.addAndGetGlobalMemstoreSize(-flushsize), but the flushsize may not the 
 actual memsize which flushed to hdfs. It cause HRegion.memstoreSize is 
 negative and prevent next flush if we close this region.
 Logs in RS for region e9d827913a056e696c39bc569ea3
 2012-03-11 16:31:36,690 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Started memstore flush for 
 writetest1,,1331454657410.e9d827913a056e696c39bc569ea3
 f99f., current region memstore size 128.0m
 2012-03-11 16:31:37,999 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf1/8162481165586107427, entries=153106, sequenceid=619316544, 
 memsize=59.6m, filesize=31.2m
 2012-03-11 16:31:38,830 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Started memstore flush for 
 writetest1,,1331454657410.e9d827913a056e696c39bc569ea3
 f99f., current region memstore size 134.8m
 2012-03-11 16:31:39,458 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf2/3425971951499794221, entries=230183, sequenceid=619316544, 
 memsize=68.5m, filesize=26.6m
 2012-03-11 16:31:39,459 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Finished memstore flush of ~128.1m for region 
 writetest1,,1331454657410.e9d827913a
 056e696c39bc569ea3f99f. in 2769ms, sequenceid=619316544, compaction 
 requested=false
 2012-03-11 16:31:39,459 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Started memstore flush for 
 writetest1,,1331454657410.e9d827913a056e696c39bc569ea3
 f99f., current region memstore size 6.8m
 2012-03-11 16:31:39,529 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf1/1811012969998104626, entries=8002, sequenceid=619332759, 
 memsize=3.1m, filesize=1.6m
 2012-03-11 16:31:39,640 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf2/770333473623552048, entries=12231, sequenceid=619332759, 
 memsize=3.6m, filesize=1.4m
 2012-03-11 16:31:39,641 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Finished memstore flush of ~134.8m for region 
 writetest1,,1331454657410.e9d827913a
 056e696c39bc569ea3f99f. in 811ms, sequenceid=619332759, compaction 
 requested=true
 2012-03-11 16:31:39,707 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf1/5656568849587368557, entries=119, sequenceid=619332979, 
 memsize=47.4k, filesize=25.6k
 2012-03-11 16:31:39,775 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf2/794343845650987521, entries=157, sequenceid=619332979, 
 memsize=47.8k, filesize=19.3k
 2012-03-11 16:31:39,777 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Finished memstore flush of ~6.8m for region 
 writetest1,,1331454657410.e9d827913a05
 6e696c39bc569ea3f99f. in 318ms, sequenceid=619332979, compaction 
 requested=true

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5571) Table will be disabling forever

2012-03-15 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5571:


Attachment: HBASE-5571v3.patch

patchv3 change unassign() in nodeDeleted() to invokeUnassign().

 Table will be disabling forever
 ---

 Key: HBASE-5571
 URL: https://issues.apache.org/jira/browse/HBASE-5571
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: BASE-5571v2.patch, HBASE-5571.patch, HBASE-5571v3.patch


 If we restart master when it is disabling one table, the table will be 
 disabling forever.
 In current logic, Region CLOSE RPC will always returned 
 NotServingRegionException because RS has already closed the region before we 
 restart master. So table will be disabling forever because the region will in 
 RIT all along.
 In another case, when AssignmentManager#rebuildUserRegions(), it will put 
 parent regions to AssignmentManager.regions, so we can't close these parent 
 regions until it is purged by CatalogJanitor if we execute disabling the 
 table.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5568) Multi concurrent flushcache() for one region could cause data loss

2012-03-15 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5568:


Attachment: HBASE-5568v2.patch

patch v2 for test case:
change TestStore#testDeleteExpiredStoreFiles#ttl from 1 to 4 

 Multi concurrent flushcache() for one region could cause data loss
 --

 Key: HBASE-5568
 URL: https://issues.apache.org/jira/browse/HBASE-5568
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: chunhui shen
Assignee: chunhui shen
 Fix For: 0.90.7, 0.92.2, 0.94.0, 0.96.0

 Attachments: HBASE-5568-90.patch, HBASE-5568.patch, HBASE-5568.patch, 
 HBASE-5568v2.patch


 We could call HRegion#flushcache() concurrently now through 
 HRegionServer#splitRegion or HRegionServer#flushRegion by HBaseAdmin.
 However, we find if HRegion#internalFlushcache() is called concurrently by 
 multi thread, HRegion.memstoreSize will be calculated wrong.
 At the end of HRegion#internalFlushcache(), we will do 
 this.addAndGetGlobalMemstoreSize(-flushsize), but the flushsize may not the 
 actual memsize which flushed to hdfs. It cause HRegion.memstoreSize is 
 negative and prevent next flush if we close this region.
 Logs in RS for region e9d827913a056e696c39bc569ea3
 2012-03-11 16:31:36,690 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Started memstore flush for 
 writetest1,,1331454657410.e9d827913a056e696c39bc569ea3
 f99f., current region memstore size 128.0m
 2012-03-11 16:31:37,999 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf1/8162481165586107427, entries=153106, sequenceid=619316544, 
 memsize=59.6m, filesize=31.2m
 2012-03-11 16:31:38,830 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Started memstore flush for 
 writetest1,,1331454657410.e9d827913a056e696c39bc569ea3
 f99f., current region memstore size 134.8m
 2012-03-11 16:31:39,458 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf2/3425971951499794221, entries=230183, sequenceid=619316544, 
 memsize=68.5m, filesize=26.6m
 2012-03-11 16:31:39,459 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Finished memstore flush of ~128.1m for region 
 writetest1,,1331454657410.e9d827913a
 056e696c39bc569ea3f99f. in 2769ms, sequenceid=619316544, compaction 
 requested=false
 2012-03-11 16:31:39,459 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Started memstore flush for 
 writetest1,,1331454657410.e9d827913a056e696c39bc569ea3
 f99f., current region memstore size 6.8m
 2012-03-11 16:31:39,529 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf1/1811012969998104626, entries=8002, sequenceid=619332759, 
 memsize=3.1m, filesize=1.6m
 2012-03-11 16:31:39,640 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf2/770333473623552048, entries=12231, sequenceid=619332759, 
 memsize=3.6m, filesize=1.4m
 2012-03-11 16:31:39,641 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Finished memstore flush of ~134.8m for region 
 writetest1,,1331454657410.e9d827913a
 056e696c39bc569ea3f99f. in 811ms, sequenceid=619332759, compaction 
 requested=true
 2012-03-11 16:31:39,707 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf1/5656568849587368557, entries=119, sequenceid=619332979, 
 memsize=47.4k, filesize=25.6k
 2012-03-11 16:31:39,775 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf2/794343845650987521, entries=157, sequenceid=619332979, 
 memsize=47.8k, filesize=19.3k
 2012-03-11 16:31:39,777 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Finished memstore flush of ~6.8m for region 
 writetest1,,1331454657410.e9d827913a05
 6e696c39bc569ea3f99f. in 318ms, sequenceid=619332979, compaction 
 requested=true

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5568) Multi concurrent flushcache() for one region could cause data loss

2012-03-14 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5568:


Attachment: HBASE-5568-90.patch

Submit a patch for 90 version

 Multi concurrent flushcache() for one region could cause data loss
 --

 Key: HBASE-5568
 URL: https://issues.apache.org/jira/browse/HBASE-5568
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: chunhui shen
Assignee: chunhui shen
 Fix For: 0.92.2, 0.94.0, 0.96.0

 Attachments: HBASE-5568-90.patch, HBASE-5568.patch


 We could call HRegion#flushcache() concurrently now through 
 HRegionServer#splitRegion or HRegionServer#flushRegion by HBaseAdmin.
 However, we find if HRegion#internalFlushcache() is called concurrently by 
 multi thread, HRegion.memstoreSize will be calculated wrong.
 At the end of HRegion#internalFlushcache(), we will do 
 this.addAndGetGlobalMemstoreSize(-flushsize), but the flushsize may not the 
 actual memsize which flushed to hdfs. It cause HRegion.memstoreSize is 
 negative and prevent next flush if we close this region.
 Logs in RS for region e9d827913a056e696c39bc569ea3
 2012-03-11 16:31:36,690 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Started memstore flush for 
 writetest1,,1331454657410.e9d827913a056e696c39bc569ea3
 f99f., current region memstore size 128.0m
 2012-03-11 16:31:37,999 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf1/8162481165586107427, entries=153106, sequenceid=619316544, 
 memsize=59.6m, filesize=31.2m
 2012-03-11 16:31:38,830 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Started memstore flush for 
 writetest1,,1331454657410.e9d827913a056e696c39bc569ea3
 f99f., current region memstore size 134.8m
 2012-03-11 16:31:39,458 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf2/3425971951499794221, entries=230183, sequenceid=619316544, 
 memsize=68.5m, filesize=26.6m
 2012-03-11 16:31:39,459 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Finished memstore flush of ~128.1m for region 
 writetest1,,1331454657410.e9d827913a
 056e696c39bc569ea3f99f. in 2769ms, sequenceid=619316544, compaction 
 requested=false
 2012-03-11 16:31:39,459 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Started memstore flush for 
 writetest1,,1331454657410.e9d827913a056e696c39bc569ea3
 f99f., current region memstore size 6.8m
 2012-03-11 16:31:39,529 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf1/1811012969998104626, entries=8002, sequenceid=619332759, 
 memsize=3.1m, filesize=1.6m
 2012-03-11 16:31:39,640 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf2/770333473623552048, entries=12231, sequenceid=619332759, 
 memsize=3.6m, filesize=1.4m
 2012-03-11 16:31:39,641 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Finished memstore flush of ~134.8m for region 
 writetest1,,1331454657410.e9d827913a
 056e696c39bc569ea3f99f. in 811ms, sequenceid=619332759, compaction 
 requested=true
 2012-03-11 16:31:39,707 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf1/5656568849587368557, entries=119, sequenceid=619332979, 
 memsize=47.4k, filesize=25.6k
 2012-03-11 16:31:39,775 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf2/794343845650987521, entries=157, sequenceid=619332979, 
 memsize=47.8k, filesize=19.3k
 2012-03-11 16:31:39,777 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Finished memstore flush of ~6.8m for region 
 writetest1,,1331454657410.e9d827913a05
 6e696c39bc569ea3f99f. in 318ms, sequenceid=619332979, compaction 
 requested=true

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5571) Table will be disabling forever

2012-03-13 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5571:


Attachment: HBASE-5571.patch

 Table will be disabling forever
 ---

 Key: HBASE-5571
 URL: https://issues.apache.org/jira/browse/HBASE-5571
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: HBASE-5571.patch


 If we restart master when it is disabling one table, the table will be 
 disabling forever.
 In current logic, Region CLOSE RPC will always returned 
 NotServingRegionException because RS has already closed the region before we 
 restart master. So table will be disabling forever because the region will in 
 RIT all along.
 In another case, when AssignmentManager#rebuildUserRegions(), it will put 
 parent regions to AssignmentManager.regions, so we can't close these parent 
 regions until it is purged by CatalogJanitor if we execute disabling the 
 table.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-03-13 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5270:


Attachment: HBASE-5270-92v11.patch

patchv11 for 0.92

 Handle potential data loss due to concurrent processing of processFaileOver 
 and ServerShutdownHandler
 -

 Key: HBASE-5270
 URL: https://issues.apache.org/jira/browse/HBASE-5270
 Project: HBase
  Issue Type: Sub-task
  Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
 Fix For: 0.92.2

 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 
 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 
 5270-testcasev2.patch, HBASE-5270-92v11.patch, HBASE-5270v11.patch, 
 hbase-5270.patch, hbase-5270v10.patch, hbase-5270v2.patch, 
 hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, 
 hbase-5270v7.patch, hbase-5270v8.patch, hbase-5270v9.patch, sampletest.txt


 This JIRA continues the effort from HBASE-5179. Starting with Stack's 
 comments about patches for 0.92 and TRUNK:
 Reviewing 0.92v17
 isDeadServerInProgress is a new public method in ServerManager but it does 
 not seem to be used anywhere.
 Does isDeadRootServerInProgress need to be public? Ditto for meta version.
 This method param names are not right 'definitiveRootServer'; what is meant 
 by definitive? Do they need this qualifier?
 Is there anything in place to stop us expiring a server twice if its carrying 
 root and meta?
 What is difference between asking assignment manager isCarryingRoot and this 
 variable that is passed in? Should be doc'd at least. Ditto for meta.
 I think I've asked for this a few times - onlineServers needs to be 
 explained... either in javadoc or in comment. This is the param passed into 
 joinCluster. How does it arise? I think I know but am unsure. God love the 
 poor noob that comes awandering this code trying to make sense of it all.
 It looks like we get the list by trawling zk for regionserver znodes that 
 have not checked in. Don't we do this operation earlier in master setup? Are 
 we doing it again here?
 Though distributed split log is configured, we will do in master single 
 process splitting under some conditions with this patch. Its not explained in 
 code why we would do this. Why do we think master log splitting 'high 
 priority' when it could very well be slower. Should we only go this route if 
 distributed splitting is not going on. Do we know if concurrent distributed 
 log splitting and master splitting works?
 Why would we have dead servers in progress here in master startup? Because a 
 servershutdownhandler fired?
 This patch is different to the patch for 0.90. Should go into trunk first 
 with tests, then 0.92. Should it be in this issue? This issue is really hard 
 to follow now. Maybe this issue is for 0.90.x and new issue for more work on 
 this trunk patch?
 This patch needs to have the v18 differences applied.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5571) Table will be disabling forever

2012-03-13 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5571:


Attachment: BASE-5571v2.patch

 Table will be disabling forever
 ---

 Key: HBASE-5571
 URL: https://issues.apache.org/jira/browse/HBASE-5571
 Project: HBase
  Issue Type: Bug
  Components: master, regionserver
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: BASE-5571v2.patch, HBASE-5571.patch


 If we restart master when it is disabling one table, the table will be 
 disabling forever.
 In current logic, Region CLOSE RPC will always returned 
 NotServingRegionException because RS has already closed the region before we 
 restart master. So table will be disabling forever because the region will in 
 RIT all along.
 In another case, when AssignmentManager#rebuildUserRegions(), it will put 
 parent regions to AssignmentManager.regions, so we can't close these parent 
 regions until it is purged by CatalogJanitor if we execute disabling the 
 table.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5563) HRegionInfo#compareTo add the comparison of regionId

2012-03-12 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5563:


Attachment: HBASE-5563v2.patch

 HRegionInfo#compareTo add the comparison of regionId
 

 Key: HBASE-5563
 URL: https://issues.apache.org/jira/browse/HBASE-5563
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: HBASE-5563.patch, HBASE-5563v2.patch


 In the one region multi assigned case,  we could find that two regions have 
 the same table name, same startKey, same endKey, and different regionId, so 
 these two regions are same in TreeMap but different in HashMap.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5568) Multi concurrent flushcache() for one region could cause data loss

2012-03-12 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5568:


Attachment: HBASE-5568.patch

 Multi concurrent flushcache() for one region could cause data loss
 --

 Key: HBASE-5568
 URL: https://issues.apache.org/jira/browse/HBASE-5568
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: HBASE-5568.patch


 We could call HRegion#flushcache() concurrently now through 
 HRegionServer#splitRegion or HRegionServer#flushRegion by HBaseAdmin.
 However, we find if HRegion#internalFlushcache() is called concurrently by 
 multi thread, HRegion.memstoreSize will be calculated wrong.
 At the end of HRegion#internalFlushcache(), we will do 
 this.addAndGetGlobalMemstoreSize(-flushsize), but the flushsize may not the 
 actual memsize which flushed to hdfs. It cause HRegion.memstoreSize is 
 negative and prevent next flush if we close this region.
 Logs in RS for region e9d827913a056e696c39bc569ea3
 2012-03-11 16:31:36,690 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Started memstore flush for 
 writetest1,,1331454657410.e9d827913a056e696c39bc569ea3
 f99f., current region memstore size 128.0m
 2012-03-11 16:31:37,999 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf1/8162481165586107427, entries=153106, sequenceid=619316544, 
 memsize=59.6m, filesize=31.2m
 2012-03-11 16:31:38,830 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Started memstore flush for 
 writetest1,,1331454657410.e9d827913a056e696c39bc569ea3
 f99f., current region memstore size 134.8m
 2012-03-11 16:31:39,458 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf2/3425971951499794221, entries=230183, sequenceid=619316544, 
 memsize=68.5m, filesize=26.6m
 2012-03-11 16:31:39,459 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Finished memstore flush of ~128.1m for region 
 writetest1,,1331454657410.e9d827913a
 056e696c39bc569ea3f99f. in 2769ms, sequenceid=619316544, compaction 
 requested=false
 2012-03-11 16:31:39,459 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Started memstore flush for 
 writetest1,,1331454657410.e9d827913a056e696c39bc569ea3
 f99f., current region memstore size 6.8m
 2012-03-11 16:31:39,529 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf1/1811012969998104626, entries=8002, sequenceid=619332759, 
 memsize=3.1m, filesize=1.6m
 2012-03-11 16:31:39,640 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf2/770333473623552048, entries=12231, sequenceid=619332759, 
 memsize=3.6m, filesize=1.4m
 2012-03-11 16:31:39,641 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Finished memstore flush of ~134.8m for region 
 writetest1,,1331454657410.e9d827913a
 056e696c39bc569ea3f99f. in 811ms, sequenceid=619332759, compaction 
 requested=true
 2012-03-11 16:31:39,707 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf1/5656568849587368557, entries=119, sequenceid=619332979, 
 memsize=47.4k, filesize=25.6k
 2012-03-11 16:31:39,775 INFO org.apache.hadoop.hbase.regionserver.Store: 
 Added 
 hdfs://dw74.kgb.sqa.cm4:9700/hbase-func1/writetest1/e9d827913a056e696c39bc569e
 a3f99f/cf2/794343845650987521, entries=157, sequenceid=619332979, 
 memsize=47.8k, filesize=19.3k
 2012-03-11 16:31:39,777 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Finished memstore flush of ~6.8m for region 
 writetest1,,1331454657410.e9d827913a05
 6e696c39bc569ea3f99f. in 318ms, sequenceid=619332979, compaction 
 requested=true

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-03-12 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5270:


Attachment: HBASE-5270v11.patch

minor items addressed in patchv11
https://reviews.apache.org/r/4021/

 Handle potential data loss due to concurrent processing of processFaileOver 
 and ServerShutdownHandler
 -

 Key: HBASE-5270
 URL: https://issues.apache.org/jira/browse/HBASE-5270
 Project: HBase
  Issue Type: Sub-task
  Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
 Fix For: 0.92.2

 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 
 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 
 5270-testcasev2.patch, HBASE-5270v11.patch, hbase-5270.patch, 
 hbase-5270v10.patch, hbase-5270v2.patch, hbase-5270v4.patch, 
 hbase-5270v5.patch, hbase-5270v6.patch, hbase-5270v7.patch, 
 hbase-5270v8.patch, hbase-5270v9.patch, sampletest.txt


 This JIRA continues the effort from HBASE-5179. Starting with Stack's 
 comments about patches for 0.92 and TRUNK:
 Reviewing 0.92v17
 isDeadServerInProgress is a new public method in ServerManager but it does 
 not seem to be used anywhere.
 Does isDeadRootServerInProgress need to be public? Ditto for meta version.
 This method param names are not right 'definitiveRootServer'; what is meant 
 by definitive? Do they need this qualifier?
 Is there anything in place to stop us expiring a server twice if its carrying 
 root and meta?
 What is difference between asking assignment manager isCarryingRoot and this 
 variable that is passed in? Should be doc'd at least. Ditto for meta.
 I think I've asked for this a few times - onlineServers needs to be 
 explained... either in javadoc or in comment. This is the param passed into 
 joinCluster. How does it arise? I think I know but am unsure. God love the 
 poor noob that comes awandering this code trying to make sense of it all.
 It looks like we get the list by trawling zk for regionserver znodes that 
 have not checked in. Don't we do this operation earlier in master setup? Are 
 we doing it again here?
 Though distributed split log is configured, we will do in master single 
 process splitting under some conditions with this patch. Its not explained in 
 code why we would do this. Why do we think master log splitting 'high 
 priority' when it could very well be slower. Should we only go this route if 
 distributed splitting is not going on. Do we know if concurrent distributed 
 log splitting and master splitting works?
 Why would we have dead servers in progress here in master startup? Because a 
 servershutdownhandler fired?
 This patch is different to the patch for 0.90. Should go into trunk first 
 with tests, then 0.92. Should it be in this issue? This issue is really hard 
 to follow now. Maybe this issue is for 0.90.x and new issue for more work on 
 this trunk patch?
 This patch needs to have the v18 differences applied.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5563) HRegionInfo#compareTo add the comparison of regionId

2012-03-11 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5563:


Attachment: HBASE-5563.patch

 HRegionInfo#compareTo add the comparison of regionId
 

 Key: HBASE-5563
 URL: https://issues.apache.org/jira/browse/HBASE-5563
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: HBASE-5563.patch


 In the one region multi assigned case,  we could find that two regions have 
 the same table name, same startKey, same endKey, and different regionId, so 
 these two regions are same in TreeMap but different in HashMap.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5528) Retry splitting log if failed in the process of ServerShutdownHandler, and abort master when retries exhausted

2012-03-06 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5528:


Attachment: hbase-5528v3.patch

 Retry splitting log if failed in the process of ServerShutdownHandler, and 
 abort master when retries exhausted
 --

 Key: HBASE-5528
 URL: https://issues.apache.org/jira/browse/HBASE-5528
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: hbase-5528.patch, hbase-5528v2.patch, hbase-5528v3.patch


 We will retry splitting log if failed in splitLogAfterStartup when master 
 starts.
 However, there is no retry for failed splitting log in the process of 
 ServerShutdownHandler.
 Also, if we finally failed to split log, we should abort master even if 
 filesystem is ok to prevent data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5528) Change retrying splitting log forever if throws IOException to numbered times, and abort master when retries exhausted

2012-03-06 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5528:


Description: 
In current log-splitting retry logic, it will retry forever if throws 
IOException, I think we'd better change it to numbered times, and abort master 
when retries exhausted.


  was:
We will retry splitting log if failed in splitLogAfterStartup when master 
starts.
However, there is no retry for failed splitting log in the process of 
ServerShutdownHandler.

Also, if we finally failed to split log, we should abort master even if 
filesystem is ok to prevent data loss.

Summary: Change retrying splitting log forever  if throws IOException 
to numbered times, and abort master when retries exhausted  (was: Retry 
splitting log if failed in the process of ServerShutdownHandler, and abort 
master when retries exhausted)

 Change retrying splitting log forever  if throws IOException to numbered 
 times, and abort master when retries exhausted
 ---

 Key: HBASE-5528
 URL: https://issues.apache.org/jira/browse/HBASE-5528
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: hbase-5528.patch, hbase-5528v2.patch, hbase-5528v3.patch


 In current log-splitting retry logic, it will retry forever if throws 
 IOException, I think we'd better change it to numbered times, and abort 
 master when retries exhausted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5528) Retry splitting log if failed in the process of ServerShutdownHandler, and abort master when retries exhausted

2012-03-05 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5528:


Attachment: hbase-5528.patch

 Retry splitting log if failed in the process of ServerShutdownHandler, and 
 abort master when retries exhausted
 --

 Key: HBASE-5528
 URL: https://issues.apache.org/jira/browse/HBASE-5528
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: hbase-5528.patch


 We will retry splitting log if failed in splitLogAfterStartup when master 
 starts.
 However, there is no retry for failed splitting log in the process of 
 ServerShutdownHandler.
 Also, if we finally failed to split log, we should abort master even if 
 filesystem is ok to prevent data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5528) Retry splitting log if failed in the process of ServerShutdownHandler, and abort master when retries exhausted

2012-03-05 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5528:


Attachment: hbase-5528v2.patch

Patch v2,add Runtime.getRuntime().halt(1) after filesystem check failure

 Retry splitting log if failed in the process of ServerShutdownHandler, and 
 abort master when retries exhausted
 --

 Key: HBASE-5528
 URL: https://issues.apache.org/jira/browse/HBASE-5528
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: hbase-5528.patch, hbase-5528v2.patch


 We will retry splitting log if failed in splitLogAfterStartup when master 
 starts.
 However, there is no retry for failed splitting log in the process of 
 ServerShutdownHandler.
 Also, if we finally failed to split log, we should abort master even if 
 filesystem is ok to prevent data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-03-04 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5270:


Attachment: hbase-5270v10.patch

 Handle potential data loss due to concurrent processing of processFaileOver 
 and ServerShutdownHandler
 -

 Key: HBASE-5270
 URL: https://issues.apache.org/jira/browse/HBASE-5270
 Project: HBase
  Issue Type: Sub-task
  Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
 Fix For: 0.92.2

 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 
 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 
 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v10.patch, 
 hbase-5270v2.patch, hbase-5270v4.patch, hbase-5270v5.patch, 
 hbase-5270v6.patch, hbase-5270v7.patch, hbase-5270v8.patch, 
 hbase-5270v9.patch, sampletest.txt


 This JIRA continues the effort from HBASE-5179. Starting with Stack's 
 comments about patches for 0.92 and TRUNK:
 Reviewing 0.92v17
 isDeadServerInProgress is a new public method in ServerManager but it does 
 not seem to be used anywhere.
 Does isDeadRootServerInProgress need to be public? Ditto for meta version.
 This method param names are not right 'definitiveRootServer'; what is meant 
 by definitive? Do they need this qualifier?
 Is there anything in place to stop us expiring a server twice if its carrying 
 root and meta?
 What is difference between asking assignment manager isCarryingRoot and this 
 variable that is passed in? Should be doc'd at least. Ditto for meta.
 I think I've asked for this a few times - onlineServers needs to be 
 explained... either in javadoc or in comment. This is the param passed into 
 joinCluster. How does it arise? I think I know but am unsure. God love the 
 poor noob that comes awandering this code trying to make sense of it all.
 It looks like we get the list by trawling zk for regionserver znodes that 
 have not checked in. Don't we do this operation earlier in master setup? Are 
 we doing it again here?
 Though distributed split log is configured, we will do in master single 
 process splitting under some conditions with this patch. Its not explained in 
 code why we would do this. Why do we think master log splitting 'high 
 priority' when it could very well be slower. Should we only go this route if 
 distributed splitting is not going on. Do we know if concurrent distributed 
 log splitting and master splitting works?
 Why would we have dead servers in progress here in master startup? Because a 
 servershutdownhandler fired?
 This patch is different to the patch for 0.90. Should go into trunk first 
 with tests, then 0.92. Should it be in this issue? This issue is really hard 
 to follow now. Maybe this issue is for 0.90.x and new issue for more work on 
 this trunk patch?
 This patch needs to have the v18 differences applied.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5501) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-03-01 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5501:


Attachment: hbase-5501.patch

In the hbase-5501.patch, we prevent processing of SSH before finish assign 
ROOTandMETA,and use the same onlineServers for splitLog and joinCluster during 
initializing

 Handle potential data loss due to concurrent processing of processFaileOver 
 and ServerShutdownHandler
 -

 Key: HBASE-5501
 URL: https://issues.apache.org/jira/browse/HBASE-5501
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: hbase-5501.patch


 In a live cluster, we do the following step
 1.kill the master;
 1.start the master, and master is initializing;
 3.master complete splitLog
 4.kill the META server
 5.master start assigning ROOT and META
 6.Now meta region data will loss since we may assign meta region before SSH 
 finish split log for dead META server.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-03-01 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5270:


Attachment: hbase-5270v9.patch

 Handle potential data loss due to concurrent processing of processFaileOver 
 and ServerShutdownHandler
 -

 Key: HBASE-5270
 URL: https://issues.apache.org/jira/browse/HBASE-5270
 Project: HBase
  Issue Type: Sub-task
  Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
 Fix For: 0.92.1, 0.94.0

 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 
 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 
 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch, 
 hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, 
 hbase-5270v7.patch, hbase-5270v8.patch, hbase-5270v9.patch, sampletest.txt


 This JIRA continues the effort from HBASE-5179. Starting with Stack's 
 comments about patches for 0.92 and TRUNK:
 Reviewing 0.92v17
 isDeadServerInProgress is a new public method in ServerManager but it does 
 not seem to be used anywhere.
 Does isDeadRootServerInProgress need to be public? Ditto for meta version.
 This method param names are not right 'definitiveRootServer'; what is meant 
 by definitive? Do they need this qualifier?
 Is there anything in place to stop us expiring a server twice if its carrying 
 root and meta?
 What is difference between asking assignment manager isCarryingRoot and this 
 variable that is passed in? Should be doc'd at least. Ditto for meta.
 I think I've asked for this a few times - onlineServers needs to be 
 explained... either in javadoc or in comment. This is the param passed into 
 joinCluster. How does it arise? I think I know but am unsure. God love the 
 poor noob that comes awandering this code trying to make sense of it all.
 It looks like we get the list by trawling zk for regionserver znodes that 
 have not checked in. Don't we do this operation earlier in master setup? Are 
 we doing it again here?
 Though distributed split log is configured, we will do in master single 
 process splitting under some conditions with this patch. Its not explained in 
 code why we would do this. Why do we think master log splitting 'high 
 priority' when it could very well be slower. Should we only go this route if 
 distributed splitting is not going on. Do we know if concurrent distributed 
 log splitting and master splitting works?
 Why would we have dead servers in progress here in master startup? Because a 
 servershutdownhandler fired?
 This patch is different to the patch for 0.90. Should go into trunk first 
 with tests, then 0.92. Should it be in this issue? This issue is really hard 
 to follow now. Maybe this issue is for 0.90.x and new issue for more work on 
 this trunk patch?
 This patch needs to have the v18 differences applied.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5454) Refuse operations from Admin before master is initialized

2012-02-29 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5454:


Summary: Refuse operations from Admin before master is initialized  (was: 
Refuse operations from Admin befor master is initialized)

 Refuse operations from Admin before master is initialized
 -

 Key: HBASE-5454
 URL: https://issues.apache.org/jira/browse/HBASE-5454
 Project: HBase
  Issue Type: Improvement
Reporter: chunhui shen
 Attachments: hbase-5454.patch, hbase-5454v2.patch


 In our testing environment,
 When master is initializing, we found conflict problems between 
 master#assignAllUserRegions and EnableTable event, causing assigning region 
 throw exception so that master abort itself.
 We think we'd better refuse operations from Admin, such as CreateTable, 
 EnableTable,etc, It could reduce error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-27 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5270:


Attachment: hbase-5270v7.patch

 Handle potential data loss due to concurrent processing of processFaileOver 
 and ServerShutdownHandler
 -

 Key: HBASE-5270
 URL: https://issues.apache.org/jira/browse/HBASE-5270
 Project: HBase
  Issue Type: Sub-task
  Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
 Fix For: 0.92.1, 0.94.0

 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 
 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 
 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch, 
 hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, 
 hbase-5270v7.patch, sampletest.txt


 This JIRA continues the effort from HBASE-5179. Starting with Stack's 
 comments about patches for 0.92 and TRUNK:
 Reviewing 0.92v17
 isDeadServerInProgress is a new public method in ServerManager but it does 
 not seem to be used anywhere.
 Does isDeadRootServerInProgress need to be public? Ditto for meta version.
 This method param names are not right 'definitiveRootServer'; what is meant 
 by definitive? Do they need this qualifier?
 Is there anything in place to stop us expiring a server twice if its carrying 
 root and meta?
 What is difference between asking assignment manager isCarryingRoot and this 
 variable that is passed in? Should be doc'd at least. Ditto for meta.
 I think I've asked for this a few times - onlineServers needs to be 
 explained... either in javadoc or in comment. This is the param passed into 
 joinCluster. How does it arise? I think I know but am unsure. God love the 
 poor noob that comes awandering this code trying to make sense of it all.
 It looks like we get the list by trawling zk for regionserver znodes that 
 have not checked in. Don't we do this operation earlier in master setup? Are 
 we doing it again here?
 Though distributed split log is configured, we will do in master single 
 process splitting under some conditions with this patch. Its not explained in 
 code why we would do this. Why do we think master log splitting 'high 
 priority' when it could very well be slower. Should we only go this route if 
 distributed splitting is not going on. Do we know if concurrent distributed 
 log splitting and master splitting works?
 Why would we have dead servers in progress here in master startup? Because a 
 servershutdownhandler fired?
 This patch is different to the patch for 0.90. Should go into trunk first 
 with tests, then 0.92. Should it be in this issue? This issue is really hard 
 to follow now. Maybe this issue is for 0.90.x and new issue for more work on 
 this trunk patch?
 This patch needs to have the v18 differences applied.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-27 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5270:


Attachment: hbase-5270v8.patch

 Handle potential data loss due to concurrent processing of processFaileOver 
 and ServerShutdownHandler
 -

 Key: HBASE-5270
 URL: https://issues.apache.org/jira/browse/HBASE-5270
 Project: HBase
  Issue Type: Sub-task
  Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
 Fix For: 0.92.1, 0.94.0

 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 
 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 
 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch, 
 hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, 
 hbase-5270v7.patch, hbase-5270v8.patch, sampletest.txt


 This JIRA continues the effort from HBASE-5179. Starting with Stack's 
 comments about patches for 0.92 and TRUNK:
 Reviewing 0.92v17
 isDeadServerInProgress is a new public method in ServerManager but it does 
 not seem to be used anywhere.
 Does isDeadRootServerInProgress need to be public? Ditto for meta version.
 This method param names are not right 'definitiveRootServer'; what is meant 
 by definitive? Do they need this qualifier?
 Is there anything in place to stop us expiring a server twice if its carrying 
 root and meta?
 What is difference between asking assignment manager isCarryingRoot and this 
 variable that is passed in? Should be doc'd at least. Ditto for meta.
 I think I've asked for this a few times - onlineServers needs to be 
 explained... either in javadoc or in comment. This is the param passed into 
 joinCluster. How does it arise? I think I know but am unsure. God love the 
 poor noob that comes awandering this code trying to make sense of it all.
 It looks like we get the list by trawling zk for regionserver znodes that 
 have not checked in. Don't we do this operation earlier in master setup? Are 
 we doing it again here?
 Though distributed split log is configured, we will do in master single 
 process splitting under some conditions with this patch. Its not explained in 
 code why we would do this. Why do we think master log splitting 'high 
 priority' when it could very well be slower. Should we only go this route if 
 distributed splitting is not going on. Do we know if concurrent distributed 
 log splitting and master splitting works?
 Why would we have dead servers in progress here in master startup? Because a 
 servershutdownhandler fired?
 This patch is different to the patch for 0.90. Should go into trunk first 
 with tests, then 0.92. Should it be in this issue? This issue is really hard 
 to follow now. Maybe this issue is for 0.90.x and new issue for more work on 
 this trunk patch?
 This patch needs to have the v18 differences applied.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-23 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5270:


Attachment: hbase-5270v5.patch

 Handle potential data loss due to concurrent processing of processFaileOver 
 and ServerShutdownHandler
 -

 Key: HBASE-5270
 URL: https://issues.apache.org/jira/browse/HBASE-5270
 Project: HBase
  Issue Type: Sub-task
  Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
 Fix For: 0.92.1, 0.94.0

 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 
 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 
 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch, 
 hbase-5270v4.patch, hbase-5270v5.patch, sampletest.txt


 This JIRA continues the effort from HBASE-5179. Starting with Stack's 
 comments about patches for 0.92 and TRUNK:
 Reviewing 0.92v17
 isDeadServerInProgress is a new public method in ServerManager but it does 
 not seem to be used anywhere.
 Does isDeadRootServerInProgress need to be public? Ditto for meta version.
 This method param names are not right 'definitiveRootServer'; what is meant 
 by definitive? Do they need this qualifier?
 Is there anything in place to stop us expiring a server twice if its carrying 
 root and meta?
 What is difference between asking assignment manager isCarryingRoot and this 
 variable that is passed in? Should be doc'd at least. Ditto for meta.
 I think I've asked for this a few times - onlineServers needs to be 
 explained... either in javadoc or in comment. This is the param passed into 
 joinCluster. How does it arise? I think I know but am unsure. God love the 
 poor noob that comes awandering this code trying to make sense of it all.
 It looks like we get the list by trawling zk for regionserver znodes that 
 have not checked in. Don't we do this operation earlier in master setup? Are 
 we doing it again here?
 Though distributed split log is configured, we will do in master single 
 process splitting under some conditions with this patch. Its not explained in 
 code why we would do this. Why do we think master log splitting 'high 
 priority' when it could very well be slower. Should we only go this route if 
 distributed splitting is not going on. Do we know if concurrent distributed 
 log splitting and master splitting works?
 Why would we have dead servers in progress here in master startup? Because a 
 servershutdownhandler fired?
 This patch is different to the patch for 0.90. Should go into trunk first 
 with tests, then 0.92. Should it be in this issue? This issue is really hard 
 to follow now. Maybe this issue is for 0.90.x and new issue for more work on 
 this trunk patch?
 This patch needs to have the v18 differences applied.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-23 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5270:


Attachment: hbase-5270v6.patch

I‘m sorry for the mistake of ConcurrentHashSet.
Thank Ted.

 Handle potential data loss due to concurrent processing of processFaileOver 
 and ServerShutdownHandler
 -

 Key: HBASE-5270
 URL: https://issues.apache.org/jira/browse/HBASE-5270
 Project: HBase
  Issue Type: Sub-task
  Components: master
Reporter: Zhihong Yu
Assignee: chunhui shen
 Fix For: 0.92.1, 0.94.0

 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 
 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 
 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch, 
 hbase-5270v4.patch, hbase-5270v5.patch, hbase-5270v6.patch, sampletest.txt


 This JIRA continues the effort from HBASE-5179. Starting with Stack's 
 comments about patches for 0.92 and TRUNK:
 Reviewing 0.92v17
 isDeadServerInProgress is a new public method in ServerManager but it does 
 not seem to be used anywhere.
 Does isDeadRootServerInProgress need to be public? Ditto for meta version.
 This method param names are not right 'definitiveRootServer'; what is meant 
 by definitive? Do they need this qualifier?
 Is there anything in place to stop us expiring a server twice if its carrying 
 root and meta?
 What is difference between asking assignment manager isCarryingRoot and this 
 variable that is passed in? Should be doc'd at least. Ditto for meta.
 I think I've asked for this a few times - onlineServers needs to be 
 explained... either in javadoc or in comment. This is the param passed into 
 joinCluster. How does it arise? I think I know but am unsure. God love the 
 poor noob that comes awandering this code trying to make sense of it all.
 It looks like we get the list by trawling zk for regionserver znodes that 
 have not checked in. Don't we do this operation earlier in master setup? Are 
 we doing it again here?
 Though distributed split log is configured, we will do in master single 
 process splitting under some conditions with this patch. Its not explained in 
 code why we would do this. Why do we think master log splitting 'high 
 priority' when it could very well be slower. Should we only go this route if 
 distributed splitting is not going on. Do we know if concurrent distributed 
 log splitting and master splitting works?
 Why would we have dead servers in progress here in master startup? Because a 
 servershutdownhandler fired?
 This patch is different to the patch for 0.90. Should go into trunk first 
 with tests, then 0.92. Should it be in this issue? This issue is really hard 
 to follow now. Maybe this issue is for 0.90.x and new issue for more work on 
 this trunk patch?
 This patch needs to have the v18 differences applied.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5454) Refuse operations from Admin befor master is initialized

2012-02-22 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5454:


Attachment: hbase-5454.patch

 Refuse operations from Admin befor master is initialized
 

 Key: HBASE-5454
 URL: https://issues.apache.org/jira/browse/HBASE-5454
 Project: HBase
  Issue Type: Improvement
Reporter: chunhui shen
 Attachments: hbase-5454.patch


 In our testing environment,
 When master is initializing, we found conflict problems between 
 master#assignAllUserRegions and EnableTable event, causing assigning region 
 throw exception so that master abort itself.
 We think we'd better refuse operations from Admin, such as CreateTable, 
 EnableTable,etc, It could reduce error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5422) StartupBulkAssigner would cause a lot of timeout on RIT when assigning large numbers of regions (timeout = 3 mins)

2012-02-22 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5422:


Attachment: hbase-5422v3.patch

Thanks for Ted's review.

 StartupBulkAssigner would cause a lot of timeout on RIT when assigning large 
 numbers of regions (timeout = 3 mins)
 --

 Key: HBASE-5422
 URL: https://issues.apache.org/jira/browse/HBASE-5422
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: chunhui shen
 Attachments: 5422-90.patch, 5422-90v2.patch, hbase-5422.patch, 
 hbase-5422v2.patch, hbase-5422v3.patch


 In our produce environment
 We find a lot of timeout on RIT when cluster up, there are about 7w regions 
 in the cluster( 25 regionservers ).
 First, we could see the following log:(See the region 
 33cf229845b1009aa8a3f7b0f85c9bd0)
 master's log
 2012-02-13 18:07:41,409 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x348f4a94723da5 Async create of unassigned node for 
 33cf229845b1009aa8a3f7b0f85c9bd0 with OFFLINE state 
 2012-02-13 18:07:42,560 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback:
  rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 state=OFFLINE, ts=1329127661409, 
 server=r03f11025.yh.aliyun.com,60020,1329127549907 
 2012-02-13 18:07:42,996 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback:
  rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 state=OFFLINE, ts=1329127661409 
 2012-02-13 18:10:48,072 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 state=PENDING_OPEN, ts=1329127662996
 2012-02-13 18:10:48,072 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_OPEN for too long, reassigning 
 region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 2012-02-13 18:11:16,744 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, 
 server=r03f11025.yh.aliyun.com,60020,1329127549907, 
 region=33cf229845b1009aa8a3f7b0f85c9bd0 
 2012-02-13 18:38:07,310 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 33cf229845b1009aa8a3f7b0f85c9bd0; deleting unassigned node 
 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x348f4a94723da5 Deleting existing unassigned node for 
 33cf229845b1009aa8a3f7b0f85c9bd0 that is in expected state 
 RS_ZK_REGION_OPENED 
 2012-02-13 18:38:07,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x348f4a94723da5 Successfully deleted unassigned node for region 
 33cf229845b1009aa8a3f7b0f85c9bd0 in expected state RS_ZK_REGION_OPENED 
 2012-02-13 18:38:07,573 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region 
 item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. on 
 r03f11025.yh.aliyun.com,60020,1329127549907 
 2012-02-13 18:50:54,428 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan 
 was found (or we are ignoring an existing plan) for 
 item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. so 
 generated a random one; 
 hri=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0., 
 src=, dest=r01b05043.yh.aliyun.com,60020,1329127549041; 29 (online=29, 
 exclude=null) available servers 
 2012-02-13 18:50:54,428 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. to 
 r01b05043.yh.aliyun.com,60020,1329127549041 
 2012-02-13 19:31:50,514 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 state=PENDING_OPEN, ts=1329132528086 
 2012-02-13 19:31:50,514 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_OPEN for too long, reassigning 
 region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 Regionserver's log
 2012-02-13 18:07:43,537 INFO 
 org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open 
 region: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 2012-02-13 18:11:16,560 DEBUG 
 org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing 
 open of item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 Through the RS's log, we could find it is larger than 3mins from receive 
 openRegion request to start processing openRegion, causing timeout on RIT in 
 master for the region.

[jira] [Updated] (HBASE-5422) StartupBulkAssigner would cause a lot of timeout on RIT when assigning large numbers of regions (timeout = 3 mins)

2012-02-22 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5422:


Attachment: 5422-90v3.patch

 StartupBulkAssigner would cause a lot of timeout on RIT when assigning large 
 numbers of regions (timeout = 3 mins)
 --

 Key: HBASE-5422
 URL: https://issues.apache.org/jira/browse/HBASE-5422
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: chunhui shen
 Attachments: 5422-90.patch, 5422-90v2.patch, 5422-90v3.patch, 
 hbase-5422.patch, hbase-5422v2.patch, hbase-5422v3.patch


 In our produce environment
 We find a lot of timeout on RIT when cluster up, there are about 7w regions 
 in the cluster( 25 regionservers ).
 First, we could see the following log:(See the region 
 33cf229845b1009aa8a3f7b0f85c9bd0)
 master's log
 2012-02-13 18:07:41,409 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x348f4a94723da5 Async create of unassigned node for 
 33cf229845b1009aa8a3f7b0f85c9bd0 with OFFLINE state 
 2012-02-13 18:07:42,560 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback:
  rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 state=OFFLINE, ts=1329127661409, 
 server=r03f11025.yh.aliyun.com,60020,1329127549907 
 2012-02-13 18:07:42,996 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback:
  rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 state=OFFLINE, ts=1329127661409 
 2012-02-13 18:10:48,072 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 state=PENDING_OPEN, ts=1329127662996
 2012-02-13 18:10:48,072 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_OPEN for too long, reassigning 
 region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 2012-02-13 18:11:16,744 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, 
 server=r03f11025.yh.aliyun.com,60020,1329127549907, 
 region=33cf229845b1009aa8a3f7b0f85c9bd0 
 2012-02-13 18:38:07,310 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 33cf229845b1009aa8a3f7b0f85c9bd0; deleting unassigned node 
 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x348f4a94723da5 Deleting existing unassigned node for 
 33cf229845b1009aa8a3f7b0f85c9bd0 that is in expected state 
 RS_ZK_REGION_OPENED 
 2012-02-13 18:38:07,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x348f4a94723da5 Successfully deleted unassigned node for region 
 33cf229845b1009aa8a3f7b0f85c9bd0 in expected state RS_ZK_REGION_OPENED 
 2012-02-13 18:38:07,573 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region 
 item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. on 
 r03f11025.yh.aliyun.com,60020,1329127549907 
 2012-02-13 18:50:54,428 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan 
 was found (or we are ignoring an existing plan) for 
 item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. so 
 generated a random one; 
 hri=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0., 
 src=, dest=r01b05043.yh.aliyun.com,60020,1329127549041; 29 (online=29, 
 exclude=null) available servers 
 2012-02-13 18:50:54,428 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. to 
 r01b05043.yh.aliyun.com,60020,1329127549041 
 2012-02-13 19:31:50,514 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 state=PENDING_OPEN, ts=1329132528086 
 2012-02-13 19:31:50,514 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_OPEN for too long, reassigning 
 region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 Regionserver's log
 2012-02-13 18:07:43,537 INFO 
 org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open 
 region: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 2012-02-13 18:11:16,560 DEBUG 
 org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing 
 open of item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 Through the RS's log, we could find it is larger than 3mins from receive 
 openRegion request to start processing openRegion, causing timeout on RIT in 
 master for the region.
 Let's see the code of 

[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-22 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5270:


Attachment: hbase-5270v4.patch

 Handle potential data loss due to concurrent processing of processFaileOver 
 and ServerShutdownHandler
 -

 Key: HBASE-5270
 URL: https://issues.apache.org/jira/browse/HBASE-5270
 Project: HBase
  Issue Type: Sub-task
  Components: master
Reporter: Zhihong Yu
 Fix For: 0.92.1, 0.94.0

 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 
 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 
 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch, 
 hbase-5270v4.patch, sampletest.txt


 This JIRA continues the effort from HBASE-5179. Starting with Stack's 
 comments about patches for 0.92 and TRUNK:
 Reviewing 0.92v17
 isDeadServerInProgress is a new public method in ServerManager but it does 
 not seem to be used anywhere.
 Does isDeadRootServerInProgress need to be public? Ditto for meta version.
 This method param names are not right 'definitiveRootServer'; what is meant 
 by definitive? Do they need this qualifier?
 Is there anything in place to stop us expiring a server twice if its carrying 
 root and meta?
 What is difference between asking assignment manager isCarryingRoot and this 
 variable that is passed in? Should be doc'd at least. Ditto for meta.
 I think I've asked for this a few times - onlineServers needs to be 
 explained... either in javadoc or in comment. This is the param passed into 
 joinCluster. How does it arise? I think I know but am unsure. God love the 
 poor noob that comes awandering this code trying to make sense of it all.
 It looks like we get the list by trawling zk for regionserver znodes that 
 have not checked in. Don't we do this operation earlier in master setup? Are 
 we doing it again here?
 Though distributed split log is configured, we will do in master single 
 process splitting under some conditions with this patch. Its not explained in 
 code why we would do this. Why do we think master log splitting 'high 
 priority' when it could very well be slower. Should we only go this route if 
 distributed splitting is not going on. Do we know if concurrent distributed 
 log splitting and master splitting works?
 Why would we have dead servers in progress here in master startup? Because a 
 servershutdownhandler fired?
 This patch is different to the patch for 0.90. Should go into trunk first 
 with tests, then 0.92. Should it be in this issue? This issue is really hard 
 to follow now. Maybe this issue is for 0.90.x and new issue for more work on 
 this trunk patch?
 This patch needs to have the v18 differences applied.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-21 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5270:


Attachment: 5270-90v3.patch

Takes Stack‘s comment in 5270-90v3.patch

 Handle potential data loss due to concurrent processing of processFaileOver 
 and ServerShutdownHandler
 -

 Key: HBASE-5270
 URL: https://issues.apache.org/jira/browse/HBASE-5270
 Project: HBase
  Issue Type: Sub-task
  Components: master
Reporter: Zhihong Yu
 Fix For: 0.94.0, 0.92.1

 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 
 5270-90.patch, 5270-90v2.patch, 5270-90v3.patch, 5270-testcase.patch, 
 5270-testcasev2.patch, hbase-5270.patch, hbase-5270v2.patch, sampletest.txt


 This JIRA continues the effort from HBASE-5179. Starting with Stack's 
 comments about patches for 0.92 and TRUNK:
 Reviewing 0.92v17
 isDeadServerInProgress is a new public method in ServerManager but it does 
 not seem to be used anywhere.
 Does isDeadRootServerInProgress need to be public? Ditto for meta version.
 This method param names are not right 'definitiveRootServer'; what is meant 
 by definitive? Do they need this qualifier?
 Is there anything in place to stop us expiring a server twice if its carrying 
 root and meta?
 What is difference between asking assignment manager isCarryingRoot and this 
 variable that is passed in? Should be doc'd at least. Ditto for meta.
 I think I've asked for this a few times - onlineServers needs to be 
 explained... either in javadoc or in comment. This is the param passed into 
 joinCluster. How does it arise? I think I know but am unsure. God love the 
 poor noob that comes awandering this code trying to make sense of it all.
 It looks like we get the list by trawling zk for regionserver znodes that 
 have not checked in. Don't we do this operation earlier in master setup? Are 
 we doing it again here?
 Though distributed split log is configured, we will do in master single 
 process splitting under some conditions with this patch. Its not explained in 
 code why we would do this. Why do we think master log splitting 'high 
 priority' when it could very well be slower. Should we only go this route if 
 distributed splitting is not going on. Do we know if concurrent distributed 
 log splitting and master splitting works?
 Why would we have dead servers in progress here in master startup? Because a 
 servershutdownhandler fired?
 This patch is different to the patch for 0.90. Should go into trunk first 
 with tests, then 0.92. Should it be in this issue? This issue is really hard 
 to follow now. Maybe this issue is for 0.90.x and new issue for more work on 
 this trunk patch?
 This patch needs to have the v18 differences applied.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5422) StartupBulkAssigner would cause a lot of timeout on RIT when assigning large numbers of regions (timeout = 3 mins)

2012-02-20 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5422:


Attachment: 5422-90v2.patch
hbase-5422v2.patch

Make an addPlan method that takes a Map of plans in patchv2

 StartupBulkAssigner would cause a lot of timeout on RIT when assigning large 
 numbers of regions (timeout = 3 mins)
 --

 Key: HBASE-5422
 URL: https://issues.apache.org/jira/browse/HBASE-5422
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: chunhui shen
 Attachments: 5422-90.patch, 5422-90v2.patch, hbase-5422.patch, 
 hbase-5422v2.patch


 In our produce environment
 We find a lot of timeout on RIT when cluster up, there are about 7w regions 
 in the cluster( 25 regionservers ).
 First, we could see the following log:(See the region 
 33cf229845b1009aa8a3f7b0f85c9bd0)
 master's log
 2012-02-13 18:07:41,409 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x348f4a94723da5 Async create of unassigned node for 
 33cf229845b1009aa8a3f7b0f85c9bd0 with OFFLINE state 
 2012-02-13 18:07:42,560 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback:
  rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 state=OFFLINE, ts=1329127661409, 
 server=r03f11025.yh.aliyun.com,60020,1329127549907 
 2012-02-13 18:07:42,996 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback:
  rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 state=OFFLINE, ts=1329127661409 
 2012-02-13 18:10:48,072 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 state=PENDING_OPEN, ts=1329127662996
 2012-02-13 18:10:48,072 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_OPEN for too long, reassigning 
 region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 2012-02-13 18:11:16,744 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, 
 server=r03f11025.yh.aliyun.com,60020,1329127549907, 
 region=33cf229845b1009aa8a3f7b0f85c9bd0 
 2012-02-13 18:38:07,310 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 33cf229845b1009aa8a3f7b0f85c9bd0; deleting unassigned node 
 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x348f4a94723da5 Deleting existing unassigned node for 
 33cf229845b1009aa8a3f7b0f85c9bd0 that is in expected state 
 RS_ZK_REGION_OPENED 
 2012-02-13 18:38:07,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x348f4a94723da5 Successfully deleted unassigned node for region 
 33cf229845b1009aa8a3f7b0f85c9bd0 in expected state RS_ZK_REGION_OPENED 
 2012-02-13 18:38:07,573 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region 
 item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. on 
 r03f11025.yh.aliyun.com,60020,1329127549907 
 2012-02-13 18:50:54,428 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan 
 was found (or we are ignoring an existing plan) for 
 item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. so 
 generated a random one; 
 hri=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0., 
 src=, dest=r01b05043.yh.aliyun.com,60020,1329127549041; 29 (online=29, 
 exclude=null) available servers 
 2012-02-13 18:50:54,428 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. to 
 r01b05043.yh.aliyun.com,60020,1329127549041 
 2012-02-13 19:31:50,514 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 state=PENDING_OPEN, ts=1329132528086 
 2012-02-13 19:31:50,514 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_OPEN for too long, reassigning 
 region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 Regionserver's log
 2012-02-13 18:07:43,537 INFO 
 org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open 
 region: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 2012-02-13 18:11:16,560 DEBUG 
 org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing 
 open of item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 Through the RS's log, we could find it is larger than 3mins from receive 
 openRegion request to start processing openRegion, 

[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-16 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5270:


Attachment: 5270-testcasev2.patch
hbase-5270v2.patch

Optimize the testcase as Stack's sample.
And hbase-5270v2 is a patch to fix the issue for trunk including testcase.

 Handle potential data loss due to concurrent processing of processFaileOver 
 and ServerShutdownHandler
 -

 Key: HBASE-5270
 URL: https://issues.apache.org/jira/browse/HBASE-5270
 Project: HBase
  Issue Type: Sub-task
  Components: master
Reporter: Zhihong Yu
 Fix For: 0.94.0, 0.92.1

 Attachments: 5270-90-testcase.patch, 5270-90.patch, 
 5270-testcase.patch, 5270-testcasev2.patch, hbase-5270.patch, 
 hbase-5270v2.patch, sampletest.txt


 This JIRA continues the effort from HBASE-5179. Starting with Stack's 
 comments about patches for 0.92 and TRUNK:
 Reviewing 0.92v17
 isDeadServerInProgress is a new public method in ServerManager but it does 
 not seem to be used anywhere.
 Does isDeadRootServerInProgress need to be public? Ditto for meta version.
 This method param names are not right 'definitiveRootServer'; what is meant 
 by definitive? Do they need this qualifier?
 Is there anything in place to stop us expiring a server twice if its carrying 
 root and meta?
 What is difference between asking assignment manager isCarryingRoot and this 
 variable that is passed in? Should be doc'd at least. Ditto for meta.
 I think I've asked for this a few times - onlineServers needs to be 
 explained... either in javadoc or in comment. This is the param passed into 
 joinCluster. How does it arise? I think I know but am unsure. God love the 
 poor noob that comes awandering this code trying to make sense of it all.
 It looks like we get the list by trawling zk for regionserver znodes that 
 have not checked in. Don't we do this operation earlier in master setup? Are 
 we doing it again here?
 Though distributed split log is configured, we will do in master single 
 process splitting under some conditions with this patch. Its not explained in 
 code why we would do this. Why do we think master log splitting 'high 
 priority' when it could very well be slower. Should we only go this route if 
 distributed splitting is not going on. Do we know if concurrent distributed 
 log splitting and master splitting works?
 Why would we have dead servers in progress here in master startup? Because a 
 servershutdownhandler fired?
 This patch is different to the patch for 0.90. Should go into trunk first 
 with tests, then 0.92. Should it be in this issue? This issue is really hard 
 to follow now. Maybe this issue is for 0.90.x and new issue for more work on 
 this trunk patch?
 This patch needs to have the v18 differences applied.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-16 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5270:


Attachment: 5270-90v2.patch
5270-90-testcasev2.patch

Testcase and patch for 90 version

 Handle potential data loss due to concurrent processing of processFaileOver 
 and ServerShutdownHandler
 -

 Key: HBASE-5270
 URL: https://issues.apache.org/jira/browse/HBASE-5270
 Project: HBase
  Issue Type: Sub-task
  Components: master
Reporter: Zhihong Yu
 Fix For: 0.94.0, 0.92.1

 Attachments: 5270-90-testcase.patch, 5270-90-testcasev2.patch, 
 5270-90.patch, 5270-90v2.patch, 5270-testcase.patch, 5270-testcasev2.patch, 
 hbase-5270.patch, hbase-5270v2.patch, sampletest.txt


 This JIRA continues the effort from HBASE-5179. Starting with Stack's 
 comments about patches for 0.92 and TRUNK:
 Reviewing 0.92v17
 isDeadServerInProgress is a new public method in ServerManager but it does 
 not seem to be used anywhere.
 Does isDeadRootServerInProgress need to be public? Ditto for meta version.
 This method param names are not right 'definitiveRootServer'; what is meant 
 by definitive? Do they need this qualifier?
 Is there anything in place to stop us expiring a server twice if its carrying 
 root and meta?
 What is difference between asking assignment manager isCarryingRoot and this 
 variable that is passed in? Should be doc'd at least. Ditto for meta.
 I think I've asked for this a few times - onlineServers needs to be 
 explained... either in javadoc or in comment. This is the param passed into 
 joinCluster. How does it arise? I think I know but am unsure. God love the 
 poor noob that comes awandering this code trying to make sense of it all.
 It looks like we get the list by trawling zk for regionserver znodes that 
 have not checked in. Don't we do this operation earlier in master setup? Are 
 we doing it again here?
 Though distributed split log is configured, we will do in master single 
 process splitting under some conditions with this patch. Its not explained in 
 code why we would do this. Why do we think master log splitting 'high 
 priority' when it could very well be slower. Should we only go this route if 
 distributed splitting is not going on. Do we know if concurrent distributed 
 log splitting and master splitting works?
 Why would we have dead servers in progress here in master startup? Because a 
 servershutdownhandler fired?
 This patch is different to the patch for 0.90. Should go into trunk first 
 with tests, then 0.92. Should it be in this issue? This issue is really hard 
 to follow now. Maybe this issue is for 0.90.x and new issue for more work on 
 this trunk patch?
 This patch needs to have the v18 differences applied.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5422) StartupBulkAssigner would cause a lot of timeout on RIT when assigning large numbers of regions (timeout = 3 mins)

2012-02-16 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5422:


Attachment: hbase-5422.patch

 StartupBulkAssigner would cause a lot of timeout on RIT when assigning large 
 numbers of regions (timeout = 3 mins)
 --

 Key: HBASE-5422
 URL: https://issues.apache.org/jira/browse/HBASE-5422
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: chunhui shen
 Attachments: hbase-5422.patch


 In our produce environment
 We find a lot of timeout on RIT when cluster up, there are about 7w regions 
 in the cluster( 25 regionservers ).
 First, we could see the following log:(See the region 
 33cf229845b1009aa8a3f7b0f85c9bd0)
 master's log
 2012-02-13 18:07:41,409 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x348f4a94723da5 Async create of unassigned node for 
 33cf229845b1009aa8a3f7b0f85c9bd0 with OFFLINE state 
 2012-02-13 18:07:42,560 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback:
  rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 state=OFFLINE, ts=1329127661409, 
 server=r03f11025.yh.aliyun.com,60020,1329127549907 
 2012-02-13 18:07:42,996 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback:
  rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 state=OFFLINE, ts=1329127661409 
 2012-02-13 18:10:48,072 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 state=PENDING_OPEN, ts=1329127662996
 2012-02-13 18:10:48,072 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_OPEN for too long, reassigning 
 region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 2012-02-13 18:11:16,744 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, 
 server=r03f11025.yh.aliyun.com,60020,1329127549907, 
 region=33cf229845b1009aa8a3f7b0f85c9bd0 
 2012-02-13 18:38:07,310 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 33cf229845b1009aa8a3f7b0f85c9bd0; deleting unassigned node 
 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x348f4a94723da5 Deleting existing unassigned node for 
 33cf229845b1009aa8a3f7b0f85c9bd0 that is in expected state 
 RS_ZK_REGION_OPENED 
 2012-02-13 18:38:07,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x348f4a94723da5 Successfully deleted unassigned node for region 
 33cf229845b1009aa8a3f7b0f85c9bd0 in expected state RS_ZK_REGION_OPENED 
 2012-02-13 18:38:07,573 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region 
 item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. on 
 r03f11025.yh.aliyun.com,60020,1329127549907 
 2012-02-13 18:50:54,428 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan 
 was found (or we are ignoring an existing plan) for 
 item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. so 
 generated a random one; 
 hri=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0., 
 src=, dest=r01b05043.yh.aliyun.com,60020,1329127549041; 29 (online=29, 
 exclude=null) available servers 
 2012-02-13 18:50:54,428 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. to 
 r01b05043.yh.aliyun.com,60020,1329127549041 
 2012-02-13 19:31:50,514 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 state=PENDING_OPEN, ts=1329132528086 
 2012-02-13 19:31:50,514 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_OPEN for too long, reassigning 
 region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 Regionserver's log
 2012-02-13 18:07:43,537 INFO 
 org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open 
 region: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 2012-02-13 18:11:16,560 DEBUG 
 org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing 
 open of item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 Through the RS's log, we could find it is larger than 3mins from receive 
 openRegion request to start processing openRegion, causing timeout on RIT in 
 master for the region.
 Let's see the code of StartupBulkAssigner, we could find regionPlans are not 
 added when assigning regions, 

[jira] [Updated] (HBASE-5422) StartupBulkAssigner would cause a lot of timeout on RIT when assigning large numbers of regions (timeout = 3 mins)

2012-02-16 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5422:


Attachment: 5422-90.patch

 StartupBulkAssigner would cause a lot of timeout on RIT when assigning large 
 numbers of regions (timeout = 3 mins)
 --

 Key: HBASE-5422
 URL: https://issues.apache.org/jira/browse/HBASE-5422
 Project: HBase
  Issue Type: Bug
  Components: master
Reporter: chunhui shen
 Attachments: 5422-90.patch, hbase-5422.patch


 In our produce environment
 We find a lot of timeout on RIT when cluster up, there are about 7w regions 
 in the cluster( 25 regionservers ).
 First, we could see the following log:(See the region 
 33cf229845b1009aa8a3f7b0f85c9bd0)
 master's log
 2012-02-13 18:07:41,409 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x348f4a94723da5 Async create of unassigned node for 
 33cf229845b1009aa8a3f7b0f85c9bd0 with OFFLINE state 
 2012-02-13 18:07:42,560 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager$CreateUnassignedAsyncCallback:
  rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 state=OFFLINE, ts=1329127661409, 
 server=r03f11025.yh.aliyun.com,60020,1329127549907 
 2012-02-13 18:07:42,996 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager$ExistsUnassignedAsyncCallback:
  rs=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 state=OFFLINE, ts=1329127661409 
 2012-02-13 18:10:48,072 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 state=PENDING_OPEN, ts=1329127662996
 2012-02-13 18:10:48,072 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_OPEN for too long, reassigning 
 region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 2012-02-13 18:11:16,744 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_OPENED, 
 server=r03f11025.yh.aliyun.com,60020,1329127549907, 
 region=33cf229845b1009aa8a3f7b0f85c9bd0 
 2012-02-13 18:38:07,310 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED 
 event for 33cf229845b1009aa8a3f7b0f85c9bd0; deleting unassigned node 
 2012-02-13 18:38:07,310 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x348f4a94723da5 Deleting existing unassigned node for 
 33cf229845b1009aa8a3f7b0f85c9bd0 that is in expected state 
 RS_ZK_REGION_OPENED 
 2012-02-13 18:38:07,314 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x348f4a94723da5 Successfully deleted unassigned node for region 
 33cf229845b1009aa8a3f7b0f85c9bd0 in expected state RS_ZK_REGION_OPENED 
 2012-02-13 18:38:07,573 DEBUG 
 org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region 
 item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. on 
 r03f11025.yh.aliyun.com,60020,1329127549907 
 2012-02-13 18:50:54,428 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: No previous transition plan 
 was found (or we are ignoring an existing plan) for 
 item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. so 
 generated a random one; 
 hri=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0., 
 src=, dest=r01b05043.yh.aliyun.com,60020,1329127549041; 29 (online=29, 
 exclude=null) available servers 
 2012-02-13 18:50:54,428 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Assigning region 
 item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. to 
 r01b05043.yh.aliyun.com,60020,1329127549041 
 2012-02-13 19:31:50,514 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 state=PENDING_OPEN, ts=1329132528086 
 2012-02-13 19:31:50,514 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_OPEN for too long, reassigning 
 region=item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 Regionserver's log
 2012-02-13 18:07:43,537 INFO 
 org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open 
 region: item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 2012-02-13 18:11:16,560 DEBUG 
 org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Processing 
 open of item_20120208,\x009,1328794343859.33cf229845b1009aa8a3f7b0f85c9bd0. 
 Through the RS's log, we could find it is larger than 3mins from receive 
 openRegion request to start processing openRegion, causing timeout on RIT in 
 master for the region.
 Let's see the code of StartupBulkAssigner, we could find regionPlans are not 
 added when assigning 

[jira] [Updated] (HBASE-5423) Regionserver may block on waitOnAllRegionsToClose when aborting

2012-02-16 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5423:


Attachment: hbase-5423.patch

We needn't wait failedClose regions to close.

 Regionserver may block on waitOnAllRegionsToClose when aborting
 ---

 Key: HBASE-5423
 URL: https://issues.apache.org/jira/browse/HBASE-5423
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: hbase-5423.patch


 If closeRegion throws any exception (It would be caused by FS ) when RS is 
 aborting, 
 RS will block forever on waitOnAllRegionsToClose().

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5423) Regionserver may block forever on waitOnAllRegionsToClose when aborting

2012-02-16 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5423:


Summary: Regionserver may block forever on waitOnAllRegionsToClose when 
aborting  (was: Regionserver may block on waitOnAllRegionsToClose when aborting)

 Regionserver may block forever on waitOnAllRegionsToClose when aborting
 ---

 Key: HBASE-5423
 URL: https://issues.apache.org/jira/browse/HBASE-5423
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: hbase-5423.patch


 If closeRegion throws any exception (It would be caused by FS ) when RS is 
 aborting, 
 RS will block forever on waitOnAllRegionsToClose().

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5270) Handle potential data loss due to concurrent processing of processFaileOver and ServerShutdownHandler

2012-02-15 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5270:


Attachment: 5270-90-testcase.patch
5270-90.patch
5270-testcase.patch
hbase-5270.patch

I have written a testcase for this issue to show the exist problem in 
5270-testcase.patch

And hbase-5270.patch is the combination of testcase and HBASE-5179 ‘s latest 
patch。

5270-90.patch and 5270-90-testcase.patch are for 90 version

Now, testcase may not contain all the situations which causes problems and runs 
slowly.

I will optimize the testcase later.




 Handle potential data loss due to concurrent processing of processFaileOver 
 and ServerShutdownHandler
 -

 Key: HBASE-5270
 URL: https://issues.apache.org/jira/browse/HBASE-5270
 Project: HBase
  Issue Type: Sub-task
  Components: master
Reporter: Zhihong Yu
 Fix For: 0.94.0, 0.92.1

 Attachments: 5270-90-testcase.patch, 5270-90.patch, 
 5270-testcase.patch, hbase-5270.patch


 This JIRA continues the effort from HBASE-5179. Starting with Stack's 
 comments about patches for 0.92 and TRUNK:
 Reviewing 0.92v17
 isDeadServerInProgress is a new public method in ServerManager but it does 
 not seem to be used anywhere.
 Does isDeadRootServerInProgress need to be public? Ditto for meta version.
 This method param names are not right 'definitiveRootServer'; what is meant 
 by definitive? Do they need this qualifier?
 Is there anything in place to stop us expiring a server twice if its carrying 
 root and meta?
 What is difference between asking assignment manager isCarryingRoot and this 
 variable that is passed in? Should be doc'd at least. Ditto for meta.
 I think I've asked for this a few times - onlineServers needs to be 
 explained... either in javadoc or in comment. This is the param passed into 
 joinCluster. How does it arise? I think I know but am unsure. God love the 
 poor noob that comes awandering this code trying to make sense of it all.
 It looks like we get the list by trawling zk for regionserver znodes that 
 have not checked in. Don't we do this operation earlier in master setup? Are 
 we doing it again here?
 Though distributed split log is configured, we will do in master single 
 process splitting under some conditions with this patch. Its not explained in 
 code why we would do this. Why do we think master log splitting 'high 
 priority' when it could very well be slower. Should we only go this route if 
 distributed splitting is not going on. Do we know if concurrent distributed 
 log splitting and master splitting works?
 Why would we have dead servers in progress here in master startup? Because a 
 servershutdownhandler fired?
 This patch is different to the patch for 0.90. Should go into trunk first 
 with tests, then 0.92. Should it be in this issue? This issue is really hard 
 to follow now. Maybe this issue is for 0.90.x and new issue for more work on 
 this trunk patch?
 This patch needs to have the v18 differences applied.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-20 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Attachment: 5179-90v16.patch

@Zhihong
90-v16 add some logic about when to waitUntilNoLogDir and pass Jinchao's test
Please take a review
Thanks.
 

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 
 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v15.patch, 
 5179-90v16.patch, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 
 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 
 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 
 5179-v4.txt, Errorlog, hbase-5179.patch, hbase-5179v10.patch, 
 hbase-5179v12.patch, hbase-5179v5.patch, hbase-5179v6.patch, 
 hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-20 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Attachment: hbase-5179v17.patch

hbase-5179v17.patch for trunk, changing the method names of DeadServer with the 
same as 90v17.txt 

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 
 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v15.patch, 
 5179-90v16.patch, 5179-90v17.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 
 5179-90v8.patch, 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 
 5179-v3.txt, 5179-v4.txt, Errorlog, hbase-5179.patch, hbase-5179v10.patch, 
 hbase-5179v12.patch, hbase-5179v17.patch, hbase-5179v5.patch, 
 hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-19 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Attachment: 5179-90v12.patch

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 
 5179-90v12.patch, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 
 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 
 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 
 5179-v4.txt, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v5.patch, 
 hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-19 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Attachment: hbase-5179v12.patch

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 
 5179-90v12.patch, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 
 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 
 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 
 5179-v4.txt, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, 
 hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, 
 hbase-5179v8.patch, hbase-5179v9.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-19 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Attachment: 5179-90v14.patch

90v13 has a mistake about waitUntilNoLogDir
{code}if (this.fileSystemManager.logDirExists(serverName) != null) {
break;
  }{code}
should changed as 
{code}
if (this.fileSystemManager.logDirExists(serverName) == null) {
break;
  }{code}


also prevent from NPE about it.

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 
 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v2.patch, 
 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 
 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-v11-92.txt, 
 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, Errorlog, 
 hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, 
 hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, 
 hbase-5179v8.patch, hbase-5179v9.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-19 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Attachment: 5179-90v15.patch

@Jinchao
I find 90V14 has also a problem.
I'm sorry.
Could you use v15.
Thanks. 

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 
 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v15.patch, 
 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 
 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 
 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, 
 Errorlog, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, 
 hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, 
 hbase-5179v8.patch, hbase-5179v9.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-17 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Attachment: hbase-5179v10.patch

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 
 5179-90v8.patch, 5179-90v9.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, 
 hbase-5179.patch, hbase-5179v10.patch, hbase-5179v5.patch, 
 hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-16 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Attachment: 5179-90v8.patch

@Zhihong @Jinchao
Based on 90v7.patch, 90v8 changes waiting forever to waiting for a time out, 
and add  the guarantee with case :
RS is dead but zk node exists when master start, so it is not considered a 
known server and its region will be assigned with out splitting log.

Thanks for the comment.

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 
 5179-90v8.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, 
 hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-16 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Attachment: hbase-5179v8.patch

hbase-5179v8.patch for trunk,
which used to prevent meta-data-loss

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 
 5179-90v8.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, 
 hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-16 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Attachment: 5179-90v9.patch

Add Zhihong's and Jinchao's comment.

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 
 5179-90v8.patch, 5179-90v9.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, 
 hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, 
 hbase-5179v8.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-16 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Attachment: hbase-5179v9.patch

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 
 5179-90v8.patch, 5179-90v9.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, 
 hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, 
 hbase-5179v8.patch, hbase-5179v9.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-15 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Attachment: hbase-5179v6.patch

In v6patch, I add isDeadServerBeingProcessed  logic for Trunk and place it in 
ServerManger.


 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, 
 hbase-5179v5.patch, hbase-5179v6.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-15 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Attachment: hbase-5179v7.patch

v6 has a collision for the trunk

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, 
 hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-15 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Attachment: 5179-90v5.patch

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, 
 hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-15 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Attachment: 5179-90v6.patch

@Zhihong @Jinchao
In 90v6 patch,
I add the logic of expiring server when  master is initializing which is done 
in Trunk。

Also I add a guarantee that splitlog is completed when assigning META in case 
mentioned by Jinchao

What do you think? Please do check

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-v2.txt, 5179-v3.txt, 
 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, 
 hbase-5179v7.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-15 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Attachment: 5179-90v7.patch

Change a little on 90v6, please see the 90v7

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 
 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, 
 hbase-5179v6.patch, hbase-5179v7.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss

2012-01-11 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Attachment: hbase-5179v5.patch

In patch v5, I add javadoc to explain getDeadServersBeingProcessed() and 
getDeadServers.
And also add some more in DeadServer about deadServersBeingProcessed.

About Stack's comment that a server is in either inProgress or its in the 
deadServers list?
I think a server could both in processingDeadServers  list and deadServers list.
DeadServers list only store one instance for one regionserver, but 
processingDeadServers  list may store multi instances for one regionserver with 
several startcode

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: 5179-90.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, 
 hbase-5179.patch, hbase-5179v5.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss

2012-01-11 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Attachment: 5179-90v2.patch

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-v2.txt, 5179-v3.txt, 
 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5165) Concurrent processing of DeleteTableHandler and ServerShutdownHandler may cause deleted region to assign again

2012-01-10 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5165:


Attachment: hbase-5165v2.patch

Modify it as Ted's comment in 5165v2.patch.(DeleteTableHandler doesn't set 
table enabled until no ServerShutdownHanler)

Since fixupDaughters() will not be executed if table is disabled, I think 
patchv2 fix this issue including HBASE-5155

Waiting for better approach.


 Concurrent processing of DeleteTableHandler and ServerShutdownHandler may 
 cause deleted region to assign again
 --

 Key: HBASE-5165
 URL: https://issues.apache.org/jira/browse/HBASE-5165
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: chunhui shen
 Attachments: hbase-5165.patch, hbase-5165v2.patch


 Concurrent processing of DeleteTableHandler and ServerShutdownHandler may 
 cause following situation
 1.Table has already be disabled.
 2.ServerShutdownHandler is doing MetaReader.getServerUserRegions.
 3.When step2 is processing or is completed just now, DeleteTableHandler 
 starts to delete region(Remove region from META and Delete region from FS)
 4.DeleteTableHandler set table enabled.
 4.ServerShutdownHandler is starting to assign region which is alread deleted 
 by DeleteTableHandler.
 The result of above operations is producing an invalid record in .META.  and 
 can't be fixed by hbck 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss

2012-01-10 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Affects Version/s: 0.90.2

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen

 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completing splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss

2012-01-10 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Attachment: hbase-5179.patch

In the patch,
we ensure ProcessingDeadServers not be processed in master's processing of 
failover 

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: hbase-5179.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completing splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss

2012-01-10 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5179:


Description: 
If master's processing its failover and ServerShutdownHandler's processing 
happen concurrently, it may appear following  case.
1.master completed splitLogAfterStartup()
2.RegionserverA restarts, and ServerShutdownHandler is processing.
3.master starts to rebuildUserRegions, and RegionserverA is considered as dead 
server.
4.master starts to assign regions of RegionserverA because it is a dead server 
by step3.

However, when doing step4(assigning region), ServerShutdownHandler may be doing 
split log, Therefore, it may cause data loss.

  was:
If master's processing its failover and ServerShutdownHandler's processing 
happen concurrently, it may appear following  case.
1.master completing splitLogAfterStartup()
2.RegionserverA restarts, and ServerShutdownHandler is processing.
3.master starts to rebuildUserRegions, and RegionserverA is considered as dead 
server.
4.master starts to assign regions of RegionserverA because it is a dead server 
by step3.

However, when doing step4(assigning region), ServerShutdownHandler may be doing 
split log, Therefore, it may cause data loss.


 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: hbase-5179.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5152) Region is on service before completing initialized when doing rollback of split, it will affect read correctness

2012-01-09 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5152:


Summary: Region is on service before completing initialized when doing 
rollback of split, it will affect read correctness   (was: Region is on service 
before completing initialized when doing rollback of split, it will affect 
readcorrectness )

 Region is on service before completing initialized when doing rollback of 
 split, it will affect read correctness 
 -

 Key: HBASE-5152
 URL: https://issues.apache.org/jira/browse/HBASE-5152
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
 Attachments: hbase-5152.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5165) Concurrent processing of DeleteTableHandler and ServerShutdownHandler may cause deleted region to assign again

2012-01-09 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5165:


Attachment: hbase-5165.patch

 Concurrent processing of DeleteTableHandler and ServerShutdownHandler may 
 cause deleted region to assign again
 --

 Key: HBASE-5165
 URL: https://issues.apache.org/jira/browse/HBASE-5165
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: chunhui shen
 Attachments: hbase-5165.patch


 Concurrent processing of DeleteTableHandler and ServerShutdownHandler may 
 cause following situation
 1.Table has already be disabled.
 2.ServerShutdownHandler is doing MetaReader.getServerUserRegions.
 3.When step2 is processing or is completed just now, DeleteTableHandler 
 starts to delete region(Remove region from META and Delete region from FS)
 4.DeleteTableHandler set table enabled.
 4.ServerShutdownHandler is starting to assign region which is alread deleted 
 by DeleteTableHandler.
 The result of above operations is producing an invalid record in .META.  and 
 can't be fixed by hbck 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5121) MajorCompaction may affect scan's correctness

2012-01-05 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5121:


Attachment: hbase-5121-testcase.patch
hbase-5121v2.patch

 MajorCompaction may affect scan's correctness
 -

 Key: HBASE-5121
 URL: https://issues.apache.org/jira/browse/HBASE-5121
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: hbase-5121-testcase.patch, hbase-5121.patch, 
 hbase-5121v2.patch


 In our test, there are two families' keyvalue for one row.
 But we could find a infrequent problem when doing scan's next if 
 majorCompaction happens concurrently.
 In the client's two continuous doing scan.next():
 1.First time, scan's next returns the result where family A is null.
 2.Second time, scan's next returns the result where family B is null.
 The two next()'s result have the same row.
 If there are more families, I think the scenario will be more strange...
 We find the reason is that storescanner.peek() is changed after 
 majorCompaction if there are delete type KeyValue.
 This change causes the PriorityQueueKeyValueScanner of RegionScanner's heap 
 is not sure to be sorted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5121) MajorCompaction may affect scan's correctness

2012-01-03 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5121:


Attachment: hbase-5121.patch

 MajorCompaction may affect scan's correctness
 -

 Key: HBASE-5121
 URL: https://issues.apache.org/jira/browse/HBASE-5121
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: chunhui shen
 Attachments: hbase-5121.patch


 In our test, there are two families' keyvalue for one row.
 But we could find a infrequent problem when doing scan's next if 
 majorCompaction happens concurrently.
 In the client's two continuous doing scan.next():
 1.First time, scan's next returns the result where family A is null.
 2.Second time, scan's next returns the result where family B is null.
 The two next()'s result have the same row.
 If there are more families, I think the scenario will be more strange...
 We find the reason is that storescanner.peek() is changed after 
 majorCompaction if there are delete type KeyValue.
 This change causes the PriorityQueueKeyValueScanner of RegionScanner's heap 
 is not sure to be sorted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-5100) Rollback of split would cause closed region to opened

2011-12-27 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5100:


Attachment: hbase-5100.patch

 Rollback of split would cause closed region to opened 
 --

 Key: HBASE-5100
 URL: https://issues.apache.org/jira/browse/HBASE-5100
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: hbase-5100.patch


 If master sending close region to rs and region's split transaction 
 concurrently happen,
 it may cause closed region to opened. 
 See the detailed code in SplitTransaction#createDaughters
 {code}
 ListStoreFile hstoreFilesToSplit = null;
 try{
   hstoreFilesToSplit = this.parent.close(false);
   if (hstoreFilesToSplit == null) {
 // The region was closed by a concurrent thread.  We can't continue
 // with the split, instead we must just abandon the split.  If we
 // reopen or split this could cause problems because the region has
 // probably already been moved to a different server, or is in the
 // process of moving to a different server.
 throw new IOException(Failed to close region: already closed by  +
   another thread);
   }
 } finally {
   this.journal.add(JournalEntry.CLOSED_PARENT_REGION);
 }
 {code}
 when rolling back, the JournalEntry.CLOSED_PARENT_REGION causes 
 this.parent.initialize();
 Although this region is not onlined in the regionserver, it may bring some 
 potential problem.
 For example, in our environment, the closed parent region is rolled back 
 sucessfully , and then starting compaction and split again.
 The parent region is f892dd6107b6b4130199582abc78e9c1
 master log
 {code}
 2011-12-26 00:24:42,693 INFO org.apache.hadoop.hbase.master.HMaster: balance 
 hri=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1.,
  src=dw87.kgb.sqa.cm4,60020,1324827866085, 
 dest=dw80.kgb.sqa.cm4,60020,1324827865780
 2011-12-26 00:24:42,693 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region 
 writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1.
  (offlining)
 2011-12-26 00:24:42,694 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 serverName=dw87.kgb.sqa.cm4,60020,1324827866085, load=(requests=0, regions=0, 
 usedHeap=0, maxHeap=0) for region 
 writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1.
 2011-12-26 00:24:42,699 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned 
 node: /hbase-tbfs/unassigned/f892dd6107b6b4130199582abc78e9c1 
 (region=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1.,
  server=dw87.kgb.sqa.cm4,60020,1324827866085, state=RS_ZK_REGION_CLOSING)
 2011-12-26 00:24:42,699 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_CLOSING, server=dw87.kgb.sqa.cm4,60020,1324827866085, 
 region=f892dd6107b6b4130199582abc78e9c1
 2011-12-26 00:24:45,348 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_CLOSED, server=dw87.kgb.sqa.cm4,60020,1324827866085, 
 region=f892dd6107b6b4130199582abc78e9c1
 2011-12-26 00:24:45,349 DEBUG 
 org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
 event for f892dd6107b6b4130199582abc78e9c1
 2011-12-26 00:24:45,349 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; 
 was=writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1.
  state=CLOSED, ts=1324830285347
 2011-12-26 00:24:45,349 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:6-0x13447f283f40e73 Creating (or updating) unassigned node for 
 f892dd6107b6b4130199582abc78e9c1 with OFFLINE state
 2011-12-26 00:24:45,354 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=M_ZK_REGION_OFFLINE, server=dw75.kgb.sqa.cm4:6, 
 region=f892dd6107b6b4130199582abc78e9c1
 2011-12-26 00:24:45,354 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for 
 writetest,8ZW417DZP93OU6SZ0QQMKTALTDP4883KW5AXSAFMQ952Y6J6VPPXEXRRPCWBR2PK7DQV3RKK28222JMOJSW3JJ8AB05MIREM1CL6,1324829936318.f892dd6107b6b4130199582abc78e9c1.
  destination server is + serverName=dw80.kgb.sqa.cm4,60020,1324827865780, 
 load=(requests=0, regions=1, 

[jira] [Updated] (HBASE-5020) MetaReader#fullScan doesn't stop scanning when vistor returns false in 0.90 version

2011-12-13 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-5020:


Attachment: hbase-5020.patch

 MetaReader#fullScan doesn't  stop scanning when vistor returns false in 0.90 
 version
 

 Key: HBASE-5020
 URL: https://issues.apache.org/jira/browse/HBASE-5020
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
 Attachments: hbase-5020.patch


 In current 0.90 code,
 {code}
  public static void fullScan(CatalogTracker catalogTracker,
   final Visitor visitor, final byte [] startrow)
   throws IOException {
 HRegionInterface metaServer =
   catalogTracker.waitForMetaServerConnectionDefault();
 Scan scan = new Scan();
 if (startrow != null) scan.setStartRow(startrow);
 scan.addFamily(HConstants.CATALOG_FAMILY);
 long scannerid = metaServer.openScanner(
 HRegionInfo.FIRST_META_REGIONINFO.getRegionName(), scan);
 try {
   Result data;
   while((data = metaServer.next(scannerid)) != null) {
 if (!data.isEmpty()) visitor.visit(data);
   }
 } finally {
   metaServer.close(scannerid);
 }
 return;
   }
 {code}
 If visitor.visit(data) return false, the scan will not stop;
 However, it is not the same as the description of Visitor
 {code}
 public interface Visitor {
 /**
  * Visit the catalog table row.
  * @param r A row from catalog table
  * @return True if we are to proceed scanning the table, else false if
  * we are to stop now.
  */
 public boolean visit(final Result r) throws IOException;
   }
 {code}
 I think it is a miss, and trunk doesn't exist this hole.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4880) Region is on service before openRegionHandler completes, may cause data loss

2011-12-08 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-4880:


Attachment: hbase-4880v4.patch

 Region is on service before openRegionHandler completes, may cause data loss
 

 Key: HBASE-4880
 URL: https://issues.apache.org/jira/browse/HBASE-4880
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0, 0.94.0
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: 4880.txt, hbase-4880.patch, hbase-4880v2.patch, 
 hbase-4880v3.patch, hbase-4880v4.patch


 OpenRegionHandler in regionserver is processed as the following steps:
 {code}
 1.openregion()(Through it, closed = false, closing = false)
 2.addToOnlineRegions(region)
 3.update .meta. table 
 4.update ZK's node state to RS_ZK_REGION_OPEND
 {code}
 We can find that region is on service before Step 4.
 It means client could put data to this region after step 3.
 What will happen if step 4 is failed processing?
 It will execute OpenRegionHandler#cleanupFailedOpen which will do closing 
 region, and master assign this region to another regionserver.
 If closing region is failed, the data which is put between step 3 and step 4 
 may loss, because the region has been opend on another regionserver and be 
 put new data. Therefore, it may not be recoverd through replayRecoveredEdit() 
 because the edit's LogSeqId is smaller than current region SeqId.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4988) MetaServer crash cause all splitting regionserver abort

2011-12-08 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-4988:


Attachment: hbase-4988v1.patch

 MetaServer crash cause all splitting regionserver abort
 ---

 Key: HBASE-4988
 URL: https://issues.apache.org/jira/browse/HBASE-4988
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
 Attachments: hbase-4988v1.patch


 If metaserver crash now,
 All the splitting regionserver will abort theirself.
 Becasue the code
 {code}
 this.journal.add(JournalEntry.PONR);
 MetaEditor.offlineParentInMeta(server.getCatalogTracker(),
 this.parent.getRegionInfo(), a.getRegionInfo(), 
 b.getRegionInfo());
 {code}
 If the JournalEntry is PONR, split's roll back will abort itselef.
 It is terrible in huge putting environment when metaserver crash

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4880) Region is on service before completing openRegionHanlder, may cause data loss

2011-12-06 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-4880:


Attachment: hbase-4880v2.patch

 Region is on service before completing openRegionHanlder, may cause data loss
 -

 Key: HBASE-4880
 URL: https://issues.apache.org/jira/browse/HBASE-4880
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: hbase-4880.patch, hbase-4880v2.patch


 OpenRegionHandler in regionserver is processed as the following steps:
 {code}
 1.openregion()(Through it, closed = false, closing = false)
 2.addToOnlineRegions(region)
 3.update .meta. table 
 4.update ZK's node state to RS_ZK_REGION_OPEND
 {code}
 We can find that region is on service before Step 4.
 It means client could put data to this region after step 3.
 What will happen if step 4 is failed processing?
 It will execute OpenRegionHandler#cleanupFailedOpen which will do closing 
 region, and master assign this region to another regionserver.
 If closing region is failed, the data which is put between step 3 and step 4 
 may loss, because the region has been opend on another regionserver and be 
 put new data. Therefore, it may not be recoverd through replayRecoveredEdit() 
 because the edit's LogSeqId is smaller than current region SeqId.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4880) Region is on service before completing openRegionHanlder, may cause data loss

2011-12-06 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-4880:


Attachment: hbase-4880v3.patch

 Region is on service before completing openRegionHanlder, may cause data loss
 -

 Key: HBASE-4880
 URL: https://issues.apache.org/jira/browse/HBASE-4880
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: hbase-4880.patch, hbase-4880v2.patch, hbase-4880v3.patch


 OpenRegionHandler in regionserver is processed as the following steps:
 {code}
 1.openregion()(Through it, closed = false, closing = false)
 2.addToOnlineRegions(region)
 3.update .meta. table 
 4.update ZK's node state to RS_ZK_REGION_OPEND
 {code}
 We can find that region is on service before Step 4.
 It means client could put data to this region after step 3.
 What will happen if step 4 is failed processing?
 It will execute OpenRegionHandler#cleanupFailedOpen which will do closing 
 region, and master assign this region to another regionserver.
 If closing region is failed, the data which is put between step 3 and step 4 
 may loss, because the region has been opend on another regionserver and be 
 put new data. Therefore, it may not be recoverd through replayRecoveredEdit() 
 because the edit's LogSeqId is smaller than current region SeqId.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4899) Region would be assigned twice easily with continually killing server and moving region in testing environment

2011-11-30 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-4899:


Attachment: hbase-4899v2.patch

 Region would be assigned twice easily with continually  killing server and 
 moving region in testing environment
 ---

 Key: HBASE-4899
 URL: https://issues.apache.org/jira/browse/HBASE-4899
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: hbase-4899.patch, hbase-4899v2.patch


 Before assigning region in ServerShutdownHandler#process, it will check 
 whether region is in RIT,
 however, this checking doesn't work as the excepted in the following case:
 1.move region A from server B to server C
 2.kill server B
 3.start server B immediately
 Let's see what happen in the code for the above case
 {code}
 for step1:
 1.1 server B close the region A,
 1.2 master setOffline for region 
 A,(AssignmentManager#setOffline:this.regions.remove(regionInfo))
 1.3 server C start to open region A.(Not completed)
 for step3:
 master ServerShutdownHandler#process() for server B
 {
 ..
 splitlog()
 ...
 ListRegionState regionsInTransition =
 this.services.getAssignmentManager()
 .processServerShutdown(this.serverName);
 ...
 Skip regions that were in transition unless CLOSING or PENDING_CLOSE
 ...
 assign region
 }
 {code}
 In fact, when running 
 ServerShutdownHandler#process()#this.services.getAssignmentManager().processServerShutdown(this.serverName),
  region A is in RIT (step1.3 not completed), but the return ListRegionState 
 regionsInTransition doesn't contain it, because region A has removed from 
 AssignmentManager.regions by AssignmentManager#setOffline in step 1.2
 Therefore, region A will be assigned twice.
 Actually, one server killed and started twice will also easily cause region 
 assigned twice.
 Exclude the above reason, another probability : 
 when execute ServerShutdownHandler#process()#MetaReader.getServerUserRegions 
 ,region is included which is in RIT now.
 But after completing MetaReader.getServerUserRegions, the region has been 
 opened in other server and is not in RIT now.
 In our testing environment where balancing,moving and killing are executed 
 periodly, assigning region twice often happens, and it is hateful because it 
 will affect other test cases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4899) Region would be assigned twice easily with continually killing server and moving region in testing environment

2011-11-30 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-4899:


Attachment: hbase-4899v3.patch

 Region would be assigned twice easily with continually  killing server and 
 moving region in testing environment
 ---

 Key: HBASE-4899
 URL: https://issues.apache.org/jira/browse/HBASE-4899
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.1
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Attachments: hbase-4899.patch, hbase-4899v2.patch, hbase-4899v3.patch


 Before assigning region in ServerShutdownHandler#process, it will check 
 whether region is in RIT,
 however, this checking doesn't work as the excepted in the following case:
 1.move region A from server B to server C
 2.kill server B
 3.start server B immediately
 Let's see what happen in the code for the above case
 {code}
 for step1:
 1.1 server B close the region A,
 1.2 master setOffline for region 
 A,(AssignmentManager#setOffline:this.regions.remove(regionInfo))
 1.3 server C start to open region A.(Not completed)
 for step3:
 master ServerShutdownHandler#process() for server B
 {
 ..
 splitlog()
 ...
 ListRegionState regionsInTransition =
 this.services.getAssignmentManager()
 .processServerShutdown(this.serverName);
 ...
 Skip regions that were in transition unless CLOSING or PENDING_CLOSE
 ...
 assign region
 }
 {code}
 In fact, when running 
 ServerShutdownHandler#process()#this.services.getAssignmentManager().processServerShutdown(this.serverName),
  region A is in RIT (step1.3 not completed), but the return ListRegionState 
 regionsInTransition doesn't contain it, because region A has removed from 
 AssignmentManager.regions by AssignmentManager#setOffline in step 1.2
 Therefore, region A will be assigned twice.
 Actually, one server killed and started twice will also easily cause region 
 assigned twice.
 Exclude the above reason, another probability : 
 when execute ServerShutdownHandler#process()#MetaReader.getServerUserRegions 
 ,region is included which is in RIT now.
 But after completing MetaReader.getServerUserRegions, the region has been 
 opened in other server and is not in RIT now.
 In our testing environment where balancing,moving and killing are executed 
 periodly, assigning region twice often happens, and it is hateful because it 
 will affect other test cases.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4862) Splitting hlog and opening region concurrently may cause data loss

2011-11-27 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-4862:


Attachment: hbase-4862v7fortrunk.patch
hbase-4862v7for0.90.patch

Based on patchV6,update javadoc of HLog#getSplitEditFilesSorted

 Splitting hlog and opening region concurrently may cause data loss
 --

 Key: HBASE-4862
 URL: https://issues.apache.org/jira/browse/HBASE-4862
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: 4862-v6-90.txt, 4862-v6-trunk.patch, 4862.patch, 
 4862.txt, hbase-4862v1 for 0.90.diff, hbase-4862v1 for 0.90.diff, 
 hbase-4862v1 for trunk.diff, hbase-4862v1 for trunk.diff, 
 hbase-4862v2for0.90.diff, hbase-4862v2fortrunk.diff, 
 hbase-4862v3for0.90.diff, hbase-4862v3fortrunk.diff, 
 hbase-4862v5for0.90.diff, hbase-4862v5fortrunk.diff, 
 hbase-4862v7for0.90.patch, hbase-4862v7fortrunk.patch


 Case Description:
 1.Split hlog thread creat writer for the file region A/recoverd.edits/123456 
 and is appending log entry
 2.Regionserver is opening region A now, and in the process 
 replayRecoveredEditsIfAny() ,it will delete the file region 
 A/recoverd.edits/123456 
 3.Split hlog thread catches the io exception, and stop parse this log file 
 and if skipError = true , add it to the corrupt logsHowever, data in 
 other regions in this log file will loss 
 4.Or if skipError = false, it will check filesystem.Of course, the file 
 system is ok , and it only prints a error log, continue assigning regions. 
 Therefore, data in other log files will also loss!!
 The case may happen in the following:
 1.Move region from server A to server B
 2.kill server A and Server B
 3.restart server A and Server B
 We could prevent this exception throuth forbiding deleting  recover.edits 
 file 
 which is appending by split hlog thread

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4880) Region is on service before completing openRegionHanlder, may cause data loss

2011-11-27 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-4880:


Attachment: hbase-4880.patch

Region isn't on service unitl completing openRegionHanlder successfully.

 Region is on service before completing openRegionHanlder, may cause data loss
 -

 Key: HBASE-4880
 URL: https://issues.apache.org/jira/browse/HBASE-4880
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
 Attachments: hbase-4880.patch


 OpenRegionHandler in regionserver is processed as the following steps:
 {code}
 1.openregion()(Through it, closed = false, closing = false)
 2.addToOnlineRegions(region)
 3.update .meta. table 
 4.update ZK's node state to RS_ZK_REGION_OPEND
 {code}
 We can find that region is on service before Step 4.
 It means client could put data to this region after step 3.
 What will happen if step 4 is failed processing?
 It will execute OpenRegionHandler#cleanupFailedOpen which will do closing 
 region, and master assign this region to another regionserver.
 If closing region is failed, the data which is put between step 3 and step 4 
 may loss, because the region has been opend on another regionserver and be 
 put new data. Therefore, it may not be recoverd through replayRecoveredEdit() 
 because the edit's LogSeqId is smaller than current region SeqId.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4878) Master crash when spliting hlog may cause data loss

2011-11-27 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-4878:


Attachment: hbase-4878v2.patch

@ramkrishna
Have done in patchv2
Thanks.

 Master crash when spliting hlog may cause data loss
 ---

 Key: HBASE-4878
 URL: https://issues.apache.org/jira/browse/HBASE-4878
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
 Attachments: hbase-4878.diff, hbase-4878v2.patch


 Let's see the code of HlogSplitter#splitLog(final FileStatus[] logfiles)
 {code}
 private ListPath splitLog(final FileStatus[] logfiles) throws IOException {
  try {
   for (FileStatus log : logfiles) {
   parseHLog(in, logPath, entryBuffers, fs, conf, skipErrors);
  }
  archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf);
  } finally {
   status.setStatus(Finishing writing output logs and closing down.);
   splits = outputSink.finishWritingAndClose();
 }
 }
 {code}
 If master is killed, after finishing archiveLogs(srcDir, corruptedLogs, 
 processedLogs, oldLogDir, fs, conf), 
 but before finishing splits = outputSink.finishWritingAndClose();
 Log date would loss!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4862) Splitting hlog and opening region concurrently may cause data loss

2011-11-26 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-4862:


Attachment: hbase-4862v5fortrunk.diff
hbase-4862v5for0.90.diff

Add a test case in patchv5

 Splitting hlog and opening region concurrently may cause data loss
 --

 Key: HBASE-4862
 URL: https://issues.apache.org/jira/browse/HBASE-4862
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: 4862.patch, 4862.txt, hbase-4862v1 for 0.90.diff, 
 hbase-4862v1 for 0.90.diff, hbase-4862v1 for trunk.diff, hbase-4862v1 for 
 trunk.diff, hbase-4862v2for0.90.diff, hbase-4862v2fortrunk.diff, 
 hbase-4862v3for0.90.diff, hbase-4862v3fortrunk.diff, 
 hbase-4862v5for0.90.diff, hbase-4862v5fortrunk.diff


 Case Description:
 1.Split hlog thread creat writer for the file region A/recoverd.edits/123456 
 and is appending log entry
 2.Regionserver is opening region A now, and in the process 
 replayRecoveredEditsIfAny() ,it will delete the file region 
 A/recoverd.edits/123456 
 3.Split hlog thread catches the io exception, and stop parse this log file 
 and if skipError = true , add it to the corrupt logsHowever, data in 
 other regions in this log file will loss 
 4.Or if skipError = false, it will check filesystem.Of course, the file 
 system is ok , and it only prints a error log, continue assigning regions. 
 Therefore, data in other log files will also loss!!
 The case may happen in the following:
 1.Move region from server A to server B
 2.kill server A and Server B
 3.restart server A and Server B
 We could prevent this exception throuth forbiding deleting  recover.edits 
 file 
 which is appending by split hlog thread

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4878) Master crash when spliting hlog may cause data loss

2011-11-26 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-4878:


Attachment: hbase-4878.diff

 Master crash when spliting hlog may cause data loss
 ---

 Key: HBASE-4878
 URL: https://issues.apache.org/jira/browse/HBASE-4878
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
 Attachments: hbase-4878.diff


 Let's see the code of HlogSplitter#splitLog(final FileStatus[] logfiles)
 {code}
 private ListPath splitLog(final FileStatus[] logfiles) throws IOException {
  try {
   for (FileStatus log : logfiles) {
   parseHLog(in, logPath, entryBuffers, fs, conf, skipErrors);
  }
  archiveLogs(srcDir, corruptedLogs, processedLogs, oldLogDir, fs, conf);
  } finally {
   status.setStatus(Finishing writing output logs and closing down.);
   splits = outputSink.finishWritingAndClose();
 }
 }
 {code}
 If master is killed, after finishing archiveLogs(srcDir, corruptedLogs, 
 processedLogs, oldLogDir, fs, conf), 
 but before finishing splits = outputSink.finishWritingAndClose();
 Log date would loss!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4862) Splitting hlog and opening region concurrently may cause data loss

2011-11-25 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-4862:


Attachment: hbase-4862v1 for trunk.diff
hbase-4862v1 for 0.90.diff

Grant license to ASF for  the attached patch 

 Splitting hlog and opening region concurrently may cause data loss
 --

 Key: HBASE-4862
 URL: https://issues.apache.org/jira/browse/HBASE-4862
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: 4862.patch, 4862.txt, hbase-4862v1 for 0.90.diff, 
 hbase-4862v1 for 0.90.diff, hbase-4862v1 for trunk.diff, hbase-4862v1 for 
 trunk.diff


 Case Description:
 1.Split hlog thread creat writer for the file region A/recoverd.edits/123456 
 and is appending log entry
 2.Regionserver is opening region A now, and in the process 
 replayRecoveredEditsIfAny() ,it will delete the file region 
 A/recoverd.edits/123456 
 3.Split hlog thread catches the io exception, and stop parse this log file 
 and if skipError = true , add it to the corrupt logsHowever, data in 
 other regions in this log file will loss 
 4.Or if skipError = false, it will check filesystem.Of course, the file 
 system is ok , and it only prints a error log, continue assigning regions. 
 Therefore, data in other log files will also loss!!
 The case may happen in the following:
 1.Move region from server A to server B
 2.kill server A and Server B
 3.restart server A and Server B
 We could prevent this exception throuth forbiding deleting  recover.edits 
 file 
 which is appending by split hlog thread

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4862) Splitting hlog and opening region concurrently may cause data loss

2011-11-25 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-4862:


Attachment: hbase-4862v2fortrunk.diff
hbase-4862v2for0.90.diff

@Ted
I has amend the patch
Please check.

 Splitting hlog and opening region concurrently may cause data loss
 --

 Key: HBASE-4862
 URL: https://issues.apache.org/jira/browse/HBASE-4862
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: 4862.patch, 4862.txt, hbase-4862v1 for 0.90.diff, 
 hbase-4862v1 for 0.90.diff, hbase-4862v1 for trunk.diff, hbase-4862v1 for 
 trunk.diff, hbase-4862v2for0.90.diff, hbase-4862v2fortrunk.diff


 Case Description:
 1.Split hlog thread creat writer for the file region A/recoverd.edits/123456 
 and is appending log entry
 2.Regionserver is opening region A now, and in the process 
 replayRecoveredEditsIfAny() ,it will delete the file region 
 A/recoverd.edits/123456 
 3.Split hlog thread catches the io exception, and stop parse this log file 
 and if skipError = true , add it to the corrupt logsHowever, data in 
 other regions in this log file will loss 
 4.Or if skipError = false, it will check filesystem.Of course, the file 
 system is ok , and it only prints a error log, continue assigning regions. 
 Therefore, data in other log files will also loss!!
 The case may happen in the following:
 1.Move region from server A to server B
 2.kill server A and Server B
 3.restart server A and Server B
 We could prevent this exception throuth forbiding deleting  recover.edits 
 file 
 which is appending by split hlog thread

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4862) Splitting hlog and opening region concurrently may cause data loss

2011-11-25 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-4862:


Attachment: hbase-4862v3fortrunk.diff
hbase-4862v3for0.90.diff

 Splitting hlog and opening region concurrently may cause data loss
 --

 Key: HBASE-4862
 URL: https://issues.apache.org/jira/browse/HBASE-4862
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: 4862.patch, 4862.txt, hbase-4862v1 for 0.90.diff, 
 hbase-4862v1 for 0.90.diff, hbase-4862v1 for trunk.diff, hbase-4862v1 for 
 trunk.diff, hbase-4862v2for0.90.diff, hbase-4862v2fortrunk.diff, 
 hbase-4862v3for0.90.diff, hbase-4862v3fortrunk.diff


 Case Description:
 1.Split hlog thread creat writer for the file region A/recoverd.edits/123456 
 and is appending log entry
 2.Regionserver is opening region A now, and in the process 
 replayRecoveredEditsIfAny() ,it will delete the file region 
 A/recoverd.edits/123456 
 3.Split hlog thread catches the io exception, and stop parse this log file 
 and if skipError = true , add it to the corrupt logsHowever, data in 
 other regions in this log file will loss 
 4.Or if skipError = false, it will check filesystem.Of course, the file 
 system is ok , and it only prints a error log, continue assigning regions. 
 Therefore, data in other log files will also loss!!
 The case may happen in the following:
 1.Move region from server A to server B
 2.kill server A and Server B
 3.restart server A and Server B
 We could prevent this exception throuth forbiding deleting  recover.edits 
 file 
 which is appending by split hlog thread

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4862) Splitting hlog and opening region concurrently may cause data loss

2011-11-24 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-4862:


Attachment: hbase-4862v1 for 0.90.diff

 Splitting hlog and opening region concurrently may cause data loss
 --

 Key: HBASE-4862
 URL: https://issues.apache.org/jira/browse/HBASE-4862
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: 4862.patch, hbase-4862v1 for 0.90.diff, hbase-4862v1 for 
 trunk.diff


 Case Description:
 1.Split hlog thread creat writer for the file region A/recoverd.edits/123456 
 and is appending log entry
 2.Regionserver is opening region A now, and in the process 
 replayRecoveredEditsIfAny() ,it will delete the file region 
 A/recoverd.edits/123456 
 3.Split hlog thread catches the io exception, and stop parse this log file 
 and if skipError = true , add it to the corrupt logsHowever, data in 
 other regions in this log file will loss 
 4.Or if skipError = false, it will check filesystem.Of course, the file 
 system is ok , and it only prints a error log, continue assigning regions. 
 Therefore, data in other log files will also loss!!
 The case may happen in the following:
 1.Move region from server A to server B
 2.kill server A and Server B
 3.restart server A and Server B
 We could prevent this exception throuth forbiding deleting  recover.edits 
 file 
 which is appending by split hlog thread

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4862) Splitting hlog and opening region concurrently may cause data loss

2011-11-24 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-4862:


Attachment: hbase-4862v1 for trunk.diff

 Splitting hlog and opening region concurrently may cause data loss
 --

 Key: HBASE-4862
 URL: https://issues.apache.org/jira/browse/HBASE-4862
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: 4862.patch, hbase-4862v1 for 0.90.diff, hbase-4862v1 for 
 trunk.diff


 Case Description:
 1.Split hlog thread creat writer for the file region A/recoverd.edits/123456 
 and is appending log entry
 2.Regionserver is opening region A now, and in the process 
 replayRecoveredEditsIfAny() ,it will delete the file region 
 A/recoverd.edits/123456 
 3.Split hlog thread catches the io exception, and stop parse this log file 
 and if skipError = true , add it to the corrupt logsHowever, data in 
 other regions in this log file will loss 
 4.Or if skipError = false, it will check filesystem.Of course, the file 
 system is ok , and it only prints a error log, continue assigning regions. 
 Therefore, data in other log files will also loss!!
 The case may happen in the following:
 1.Move region from server A to server B
 2.kill server A and Server B
 3.restart server A and Server B
 We could prevent this exception throuth forbiding deleting  recover.edits 
 file 
 which is appending by split hlog thread

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4862) Split hlog and open region currently happend may cause data loss

2011-11-23 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-4862:


Attachment: 4862.patch

Split hlog :Add suffix .tmp for file in the recoverd.edits directory when 
creating,
and rename it without the suffix after close;

ReplayRecoveredEditsIfAny: skip the file whose name ends with .tmp


 Split hlog and open region currently happend may cause data loss
 

 Key: HBASE-4862
 URL: https://issues.apache.org/jira/browse/HBASE-4862
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.2
Reporter: chunhui shen
 Attachments: 4862.patch


 Case Description:
 1.Split hlog thread creat writer for the file region A/recoverd.edits/123456 
 and is appending log entry
 2.Regionserver is opening region A now, and in the process 
 replayRecoveredEditsIfAny() ,it will delete the file region 
 A/recoverd.edits/123456 
 3.Split hlog thread catches the io exception, and stop parse this log file 
 and if skipError = true , add it to the corrupt logsHowever, data in 
 other regions in this log file will loss 
 4.Or if skipError = false, it will check filesystem.Of course, the file 
 system is ok , and it only prints a error log, continue assigning regions. 
 Therefore, data in other log files will also loss!!
 The case may happen in the following:
 1.Move region from server A to server B
 2.kill server A and Server B
 3.restart server A and Server B
 We could prevent this exception throuth forbiding deleting  recover.edits 
 file 
 which is appending by split hlog thread

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (HBASE-4862) Split hlog and open region concurrently happend may cause data loss

2011-11-23 Thread chunhui shen (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/HBASE-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chunhui shen updated HBASE-4862:


Summary: Split hlog and open region concurrently happend may cause data 
loss  (was: Split hlog and open region currently happend may cause data loss)

 Split hlog and open region concurrently happend may cause data loss
 ---

 Key: HBASE-4862
 URL: https://issues.apache.org/jira/browse/HBASE-4862
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.2
Reporter: chunhui shen
 Attachments: 4862.patch


 Case Description:
 1.Split hlog thread creat writer for the file region A/recoverd.edits/123456 
 and is appending log entry
 2.Regionserver is opening region A now, and in the process 
 replayRecoveredEditsIfAny() ,it will delete the file region 
 A/recoverd.edits/123456 
 3.Split hlog thread catches the io exception, and stop parse this log file 
 and if skipError = true , add it to the corrupt logsHowever, data in 
 other regions in this log file will loss 
 4.Or if skipError = false, it will check filesystem.Of course, the file 
 system is ok , and it only prints a error log, continue assigning regions. 
 Therefore, data in other log files will also loss!!
 The case may happen in the following:
 1.Move region from server A to server B
 2.kill server A and Server B
 3.restart server A and Server B
 We could prevent this exception throuth forbiding deleting  recover.edits 
 file 
 which is appending by split hlog thread

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira