from:"gaojinchao \(Commented\) \(JIRA\)"

[jira] [Commented] (HBASE-5615) the master never do balance becauseof balance the parent region

2012-03-24 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237484#comment-13237484
 ] 

gaojinchao commented on HBASE-5615:
---

+1 

 the master never do balance becauseof  balance the parent region
 

 Key: HBASE-5615
 URL: https://issues.apache.org/jira/browse/HBASE-5615
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.7
Reporter: xufeng
Assignee: xufeng
Priority: Critical
 Attachments: HBASE-5615-90.patch, HBASE-5615.patch, 
 NoPatched-surefire-report-5615-90.html, Patched_surefire-report-5615-90.html


 the master never do balance becauseof when master do rebuildUserRegions()，it 
 will add the parent region into  AssignmentManager#servers,
 if balancer let the parent region to move,the parent will in RIT forever.thus 
 balance will never be executed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5488) Fixed OfflineMetaRepair bug

2012-02-29 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13218976#comment-13218976
 ] 

gaojinchao commented on HBASE-5488:
---

Thanks for your review.

 Fixed OfflineMetaRepair bug 
 

 Key: HBASE-5488
 URL: https://issues.apache.org/jira/browse/HBASE-5488
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.6
Reporter: gaojinchao
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.90.7, 0.92.1

 Attachments: HBASE-5488-branch92.patch, HBASE-5488-trunk.patch, 
 HBASE-5488_branch90.txt


 I want to use OfflineMetaRepair tools and found onbody fix this bugs. I 
 will make a patch.
  12/01/05 23:23:30 ERROR util.HBaseFsck: Bailed out due to:
  java.lang.IllegalArgumentException: Wrong FS: hdfs:// 
  us01-ciqps1-name01.carrieriq.com:9000/hbase/M2M-INTEGRATION-MM_TION-13
  25190318714/0003d2ede27668737e192d8430dbe5d0/.regioninfo,
  expected: file:///
 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:352)
 at
  org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
 at
  org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:368)
 at
  org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
 at
  org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:126)
 at
  org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:398)
 at
  org.apache.hadoop.hbase.util.HBaseFsck.loadMetaEntry(HBaseFsck.java:256)
 at
  org.apache.hadoop.hbase.util.HBaseFsck.loadTableInfo(HBaseFsck.java:284)
 at
  org.apache.hadoop.hbase.util.HBaseFsck.rebuildMeta(HBaseFsck.java:402)
 at
  org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair.main(OfflineMetaRe

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4925) Collect test cases for hadoop/hbase cluster

2012-02-15 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13209115#comment-13209115
 ] 

gaojinchao commented on HBASE-4925:
---

@ Thomas

Your framework is available?  We finished part cases automation , but I find it 
is not stable. 

 Collect test cases for hadoop/hbase cluster
 ---

 Key: HBASE-4925
 URL: https://issues.apache.org/jira/browse/HBASE-4925
 Project: HBase
  Issue Type: Brainstorming
  Components: test
Reporter: Thomas Pan

 This entry is used to collect all the useful test cases to verify a 
 hadoop/hbase cluster. This is to follow up on yesterday's hack day in 
 Salesforce. Hopefully that the information would be very useful for the whole 
 community.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5200) AM.ProcessRegionInTransition() and AM.handleRegion() race thus leaving the region assignment inconsistent

2012-02-13 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206834#comment-13206834
 ] 

gaojinchao commented on HBASE-5200:
---

+1 for 0.92 and trunk

 AM.ProcessRegionInTransition() and AM.handleRegion() race thus leaving the 
 region assignment inconsistent
 -

 Key: HBASE-5200
 URL: https://issues.apache.org/jira/browse/HBASE-5200
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.5
Reporter: ramkrishna.s.vasudevan
Assignee: ramkrishna.s.vasudevan
 Fix For: 0.94.0, 0.90.7, 0.92.1

 Attachments: 5200-v2.txt, HBASE-5200.patch, HBASE-5200_1.patch, 
 TEST-org.apache.hadoop.hbase.master.TestRestartCluster.xml


 This is the scenario
 Consider a case where the balancer is going on thus trying to close regions 
 in a RS.
 Before we could close a master switch happens.  
 On Master switch the set of nodes that are in RIT is collected and we first 
 get Data and start watching the node
 After that the node data is added into RIT.
 Now by this time (before adding to RIT) if the RS to which close was called 
 does a transition in AM.handleRegion() we miss the handling saying RIT state 
 was null.
 {code}
 2012-01-13 10:50:46,358 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 a66d281d231dfcaea97c270698b26b6f from server 
 HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,358 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 c12e53bfd48ddc5eec507d66821c4d23 from server 
 HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,358 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 59ae13de8c1eb325a0dd51f4902d2052 from server 
 HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,359 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 f45bc9614d7575f35244849af85aa078 from server 
 HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,359 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 cc3ecd7054fe6cd4a1159ed92fd62641 from server 
 HOST-192-168-47-204,20020,1326342744518 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,359 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 3af40478a17fee96b4a192b22c90d5a2 from server 
 HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,359 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 e6096a8466e730463e10d3d61f809b92 from server 
 HOST-192-168-47-204,20020,1326342744518 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,359 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 4806781a1a23066f7baed22b4d237e24 from server 
 HOST-192-168-47-204,20020,1326342744518 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,359 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 d69e104131accaefe21dcc01fddc7629 from server 
 HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 {code}
 In branch the CLOSING node is created by RS thus leading to more 
 inconsistency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5200) AM.ProcessRegionInTransition() and AM.handleRegion() race thus leaving the region assignment inconsistent

2012-02-11 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206118#comment-13206118
 ] 

gaojinchao commented on HBASE-5200:
---

It seems we need consider issue HBASE-4739

 AM.ProcessRegionInTransition() and AM.handleRegion() race thus leaving the 
 region assignment inconsistent
 -

 Key: HBASE-5200
 URL: https://issues.apache.org/jira/browse/HBASE-5200
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.5
Reporter: ramkrishna.s.vasudevan
Assignee: ramkrishna.s.vasudevan
 Fix For: 0.94.0, 0.90.7, 0.92.1

 Attachments: 5200-v2.txt, HBASE-5200.patch, HBASE-5200_1.patch, 
 TEST-org.apache.hadoop.hbase.master.TestRestartCluster.xml


 This is the scenario
 Consider a case where the balancer is going on thus trying to close regions 
 in a RS.
 Before we could close a master switch happens.  
 On Master switch the set of nodes that are in RIT is collected and we first 
 get Data and start watching the node
 After that the node data is added into RIT.
 Now by this time (before adding to RIT) if the RS to which close was called 
 does a transition in AM.handleRegion() we miss the handling saying RIT state 
 was null.
 {code}
 2012-01-13 10:50:46,358 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 a66d281d231dfcaea97c270698b26b6f from server 
 HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,358 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 c12e53bfd48ddc5eec507d66821c4d23 from server 
 HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,358 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 59ae13de8c1eb325a0dd51f4902d2052 from server 
 HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,359 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 f45bc9614d7575f35244849af85aa078 from server 
 HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,359 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 cc3ecd7054fe6cd4a1159ed92fd62641 from server 
 HOST-192-168-47-204,20020,1326342744518 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,359 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 3af40478a17fee96b4a192b22c90d5a2 from server 
 HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,359 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 e6096a8466e730463e10d3d61f809b92 from server 
 HOST-192-168-47-204,20020,1326342744518 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,359 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 4806781a1a23066f7baed22b4d237e24 from server 
 HOST-192-168-47-204,20020,1326342744518 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,359 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 d69e104131accaefe21dcc01fddc7629 from server 
 HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 {code}
 In branch the CLOSING node is created by RS thus leading to more 
 inconsistency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5200) AM.ProcessRegionInTransition() and AM.handleRegion() race thus leaving the region assignment inconsistent

2012-02-11 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206128#comment-13206128
 ] 

gaojinchao commented on HBASE-5200:
---

other issues , 
1. comments said If ROOT or .META. table is waiting for timeout..., But the 
code isMetaTable is only Meta table . it seems we should use isMetaRegion.
2. In branch 90 getRegion only get region from meta table? It is any problem 
when root region server crashed? we reassign the root region?


 AM.ProcessRegionInTransition() and AM.handleRegion() race thus leaving the 
 region assignment inconsistent
 -

 Key: HBASE-5200
 URL: https://issues.apache.org/jira/browse/HBASE-5200
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.5
Reporter: ramkrishna.s.vasudevan
Assignee: ramkrishna.s.vasudevan
 Fix For: 0.94.0, 0.90.7, 0.92.1

 Attachments: 5200-v2.txt, HBASE-5200.patch, HBASE-5200_1.patch, 
 TEST-org.apache.hadoop.hbase.master.TestRestartCluster.xml


 This is the scenario
 Consider a case where the balancer is going on thus trying to close regions 
 in a RS.
 Before we could close a master switch happens.  
 On Master switch the set of nodes that are in RIT is collected and we first 
 get Data and start watching the node
 After that the node data is added into RIT.
 Now by this time (before adding to RIT) if the RS to which close was called 
 does a transition in AM.handleRegion() we miss the handling saying RIT state 
 was null.
 {code}
 2012-01-13 10:50:46,358 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 a66d281d231dfcaea97c270698b26b6f from server 
 HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,358 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 c12e53bfd48ddc5eec507d66821c4d23 from server 
 HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,358 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 59ae13de8c1eb325a0dd51f4902d2052 from server 
 HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,359 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 f45bc9614d7575f35244849af85aa078 from server 
 HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,359 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 cc3ecd7054fe6cd4a1159ed92fd62641 from server 
 HOST-192-168-47-204,20020,1326342744518 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,359 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 3af40478a17fee96b4a192b22c90d5a2 from server 
 HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,359 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 e6096a8466e730463e10d3d61f809b92 from server 
 HOST-192-168-47-204,20020,1326342744518 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,359 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 4806781a1a23066f7baed22b4d237e24 from server 
 HOST-192-168-47-204,20020,1326342744518 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 2012-01-13 10:50:46,359 WARN 
 org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 
 d69e104131accaefe21dcc01fddc7629 from server 
 HOST-192-168-47-205,20020,1326363111288 but region was in  the state null and 
 not in expected PENDING_CLOSE or CLOSING states
 {code}
 In branch the CLOSING node is created by RS thus leading to more 
 inconsistency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5231) Backport HBASE-3373 (per-table load balancing) to 0.92

2012-01-23 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190945#comment-13190945
 ] 

gaojinchao commented on HBASE-5231:
---

I think we can do. 
regarding to log Done. Calculated a load balance in , we can move out 
balanceCluster.
move to below code ?

+  for (MapServerName, ListHRegionInfo assignments : 
assignmentsByTable.values()) {
+ListRegionPlan partialPlans = 
this.balancer.balanceCluster(assignments);
+if (partialPlans != null) plans.addAll(partialPlans);
   }

 Backport HBASE-3373 (per-table load balancing) to 0.92
 --

 Key: HBASE-5231
 URL: https://issues.apache.org/jira/browse/HBASE-5231
 Project: HBase
  Issue Type: Improvement
Reporter: Zhihong Yu
 Fix For: 0.92.1

 Attachments: 5231.txt


 This JIRA backports per-table load balancing to 0.90

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5231) Backport HBASE-3373 (per-table load balancing) to 0.92

2012-01-22 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190899#comment-13190899
 ] 

gaojinchao commented on HBASE-5231:
---

Maybe there is a little problem about log. every table calls balanceCluster 
,Many logs as Skipping load balancing because balanced cluster;  will be 
printed.

 Backport HBASE-3373 (per-table load balancing) to 0.92
 --

 Key: HBASE-5231
 URL: https://issues.apache.org/jira/browse/HBASE-5231
 Project: HBase
  Issue Type: Improvement
Reporter: Zhihong Yu
 Fix For: 0.92.1

 Attachments: 5231.txt


 This JIRA backports per-table load balancing to 0.90

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-20 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190299#comment-13190299
 ] 

gaojinchao commented on HBASE-5179:
---

Do you want to add any new case?

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 
 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v15.patch, 
 5179-90v16.patch, 5179-90v17.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 
 5179-90v8.patch, 5179-90v9.patch, 5179-92v17.patch, 5179-v11-92.txt, 
 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, Errorlog, 
 hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, 
 hbase-5179v17.patch, hbase-5179v17.patch, hbase-5179v5.patch, 
 hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-20 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190298#comment-13190298
 ] 

gaojinchao commented on HBASE-5179:
---

This is V16 test result in branch90, 
number  discriberesult
1   a new cluster startup   ok
2   restart a cluster   ok
3   No region serve crash   ok
4   After Meta region server registered, and then crashed   ok
5   After Meta/root region server registered, and then crashed  ok
6   After Hmaster crashed and Meta/root region server crashed. Hmaster and 
region server start at same time.0k
7   After Hmaster crashed and Meta/root region server crashed. Hmaster 
startok


 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 
 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v15.patch, 
 5179-90v16.patch, 5179-90v17.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 
 5179-90v8.patch, 5179-90v9.patch, 5179-92v17.patch, 5179-v11-92.txt, 
 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, Errorlog, 
 hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, 
 hbase-5179v17.patch, hbase-5179v17.patch, hbase-5179v5.patch, 
 hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-19 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13189140#comment-13189140
 ] 

gaojinchao commented on HBASE-5179:
---

@Chunhui
In 90v12, Maybe below code(metaServerInfo/rootServerInfo) has some 
nullexception?

  // MetaServer is may being processed as dead server. Before assign meta,
  // we need to wait until its log is splitted.
  waitUntilNoLogDir(metaServerInfo.getServerName());
  if (!this.serverManager.isDeadMetaServerInProgress()) {
this.assignmentManager.assignMeta();
  }

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 
 5179-90v12.patch, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 
 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 
 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 
 5179-v4.txt, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, 
 hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, 
 hbase-5179v8.patch, hbase-5179v9.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-19 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13189148#comment-13189148
 ] 

gaojinchao commented on HBASE-5179:
---

@chunhui
I only verify the branch90 version. I don't have the cluster for branch92(that 
needs a few day, we are planing to setup some test cluster.)

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 
 5179-90v12.patch, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 
 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 
 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 
 5179-v4.txt, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, 
 hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, 
 hbase-5179v8.patch, hbase-5179v9.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-19 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13189595#comment-13189595
 ] 

gaojinchao commented on HBASE-5179:
---

In my test case, I kill meta/root at same time. when master start to assign 
root/meta region , it should finish split hlogs. 

Today I will give a detail test report about 90V14.


 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 
 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v2.patch, 
 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 
 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-v11-92.txt, 
 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, Errorlog, 
 hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, 
 hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, 
 hbase-5179v8.patch, hbase-5179v9.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-19 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13189600#comment-13189600
 ] 

gaojinchao commented on HBASE-5179:
---

no problem. before I start to verify the patch. I will review it.

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 
 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v15.patch, 
 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 
 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 
 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, 
 Errorlog, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, 
 hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, 
 hbase-5179v8.patch, hbase-5179v9.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-19 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13189641#comment-13189641
 ] 

gaojinchao commented on HBASE-5179:
---

@chunhui
The first test case failed, we start a cluster, the patch is split a new region 
server's Hlog.

2012-01-20 00:34:39,462 INFO org.mortbay.log: Started 
SelectChannelConnector@0.0.0.0:20010
2012-01-20 00:34:39,462 DEBUG org.apache.hadoop.hbase.master.HMaster: Started 
service threads
2012-01-20 00:34:40,158 INFO org.apache.hadoop.hbase.master.ServerManager: 
Registering server=C3S32,20020,1327037679721, regionCount=0, userLoad=false
2012-01-20 00:34:40,296 INFO org.apache.hadoop.hbase.master.ServerManager: 
Registering server=C3S33,20020,1327037679059, regionCount=0, userLoad=false
2012-01-20 00:34:40,488 INFO org.apache.hadoop.hbase.master.ServerManager: 
Registering server=C3S31,20020,1327037679673, regionCount=0, userLoad=false
2012-01-20 00:34:40,962 INFO org.apache.hadoop.hbase.master.ServerManager: 
Waiting on regionserver(s) count to settle; currently=3
2012-01-20 00:34:42,462 INFO org.apache.hadoop.hbase.master.ServerManager: 
Finished waiting for regionserver count to settle; count=3, sleptFor=3000
2012-01-20 00:34:42,463 INFO org.apache.hadoop.hbase.master.ServerManager: 
Exiting wait on regionserver(s) to checkin; count=3, stopped=false, count of 
regions out on cluster=0
2012-01-20 00:34:42,463 INFO org.apache.hadoop.hbase.master.HMaster: 
--sleep 60s-
2012-01-20 00:35:42,469 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
Log folder hdfs://C3S31:9000/hbase/.logs/C3S31,20020,1327037679673 belongs to 
an existing region server
2012-01-20 00:35:42,470 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
Log folder hdfs://C3S31:9000/hbase/.logs/C3S32,20020,1327037679721 belongs to 
an existing region server
2012-01-20 00:35:42,470 INFO org.apache.hadoop.hbase.master.MasterFileSystem: 
Log folder hdfs://C3S31:9000/hbase/.logs/C3S33,20020,1327037679059 belongs to 
an existing region server
2012-01-20 00:35:42,504 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: 
Failed verification of -ROOT-,,0 at address=C3S32:20020; 
org.apache.hadoop.hbase.NotServingRegionException: 
org.apache.hadoop.hbase.NotServingRegionException: Region is not online: 
-ROOT-,,0
2012-01-20 00:36:42,610 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled 
exception. Starting shutdown.
java.lang.RuntimeException: Timed out waiting to finish splitting log for 
C3S32,20020,1327037679721
at 
org.apache.hadoop.hbase.master.HMaster.waitUntilNoLogDir(HMaster.java:578)
at 
org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:478)
at 
org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:422)
at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
2012-01-20 00:36:42,613 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
2012-01-20 00:36:42,613 DEBUG org.apache.hadoop.hbase.master.HMaster: Stopping 
service threads
2012-01-20 00:36:42,613 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server 
on 2

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 
 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v15.patch, 
 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 
 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 
 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, 
 Errorlog, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, 
 hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, 
 hbase-5179v8.patch, hbase-5179v9.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This

[jira] [Commented] (HBASE-5231) Backport HBASE-3373 (per-table load balancing) to 0.92

2012-01-19 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13189650#comment-13189650
 ] 

gaojinchao commented on HBASE-5231:
---

It seems a little change. +1

 Backport HBASE-3373 (per-table load balancing) to 0.92
 --

 Key: HBASE-5231
 URL: https://issues.apache.org/jira/browse/HBASE-5231
 Project: HBase
  Issue Type: Improvement
Reporter: Zhihong Yu
 Attachments: 5231.txt


 This JIRA backports per-table load balancing to 0.90

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-18 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188938#comment-13188938
 ] 

gaojinchao commented on HBASE-5179:
---

+1, Good job! 

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 
 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 
 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 
 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, 
 hbase-5179.patch, hbase-5179v10.patch, hbase-5179v5.patch, 
 hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-18 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188939#comment-13188939
 ] 

gaojinchao commented on HBASE-5179:
---

@chunhui
Do you want me to test this patch in our cluster ?

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 
 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 
 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 
 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, 
 hbase-5179.patch, hbase-5179v10.patch, hbase-5179v5.patch, 
 hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-16 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186770#comment-13186770
 ] 

gaojinchao commented on HBASE-5179:
---

@chunhui
Maybe it has a problem. the number of shutdownhandler thread pool is 
3(default), If there are more than 3 deadserver is processing. we will wait 
forever.



 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 
 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, 
 hbase-5179v6.patch, hbase-5179v7.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-16 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186786#comment-13186786
 ] 

gaojinchao commented on HBASE-5179:
---

@chunhui
Regarding to a normal flow. METAServerShutdownHandler use different thread 
pool. only init flow, scome cases we can't distinguish meta region server.


 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 
 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, 
 hbase-5179v6.patch, hbase-5179v7.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-16 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186928#comment-13186928
 ] 

gaojinchao commented on HBASE-5179:
---

In patch v7, Can we replace process expired server to public void 
splitLog(final String serverName)?

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 
 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, 
 hbase-5179v6.patch, hbase-5179v7.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4191) hbase load balancer needs locality awareness

2012-01-16 Thread gaojinchao (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186941#comment-13186941
]

gaojinchao commented on HBASE-4191:
---

@Liyin
This is a good feature, How do you process now?

hbase load balancer needs locality awareness

Key: HBASE-4191
URL: https://issues.apache.org/jira/browse/HBASE-4191
Project: HBase
Issue Type: New Feature
Reporter: Ted Yu
Assignee: Liyin Tang

Previously, HBASE-4114 implements the metrics for HFile HDFS block locality,
which provides the HFile level locality information.
But in order to work with load balancer and region assignment, we need the
region level locality information.
Let's define the region locality information first, which is almost the same
as HFile locality index.
HRegion locality index (HRegion A, RegionServer B) =
(Total number of HDFS blocks that can be retrieved locally by the
RegionServer B for the HRegion A) / ( Total number of the HDFS blocks for the
Region A)
So the HRegion locality index tells us that how much locality we can get if
the HMaster assign the HRegion A to the RegionServer B.
So there will be 2 steps involved to assign regions based on the locality.
1) During the cluster start up time, the master will scan the hdfs to
calculate the HRegion locality index for each pair of HRegion and Region
Server. It is pretty expensive to scan the dfs. So we only needs to do this
once during the start up time.
2) During the cluster run time, each region server will update the HRegion
locality index as metrics periodically as HBASE-4114 did. The Region Server
can expose them to the Master through ZK, meta table, or just RPC messages.
Based on the HRegion locality index, the assignment manager in the master
would have a global knowledge about the region locality distribution and can
run the MIN COST MAXIMUM FLOW solver to reach the global optimization.
Let's construct the graph first:
[Graph]
Imaging there is a bipartite graph and the left side is the set of regions
and the right side is the set of region servers.
There is a source node which links itself to each node in the region set.
There is a sink node which is linked from each node in the region server set.
[Capacity]
The capacity between the source node and region nodes is 1.
And the capacity between the region nodes and region server nodes is also 1.
(The purpose is each region can ONLY be assigned to one region server at one
time)
The capacity between the region server nodes and sink node are the avg number
of regions which should be assigned each region server.
(The purpose is balance the load for each region server)
[Cost]
The cost between each region and region server is the opposite of locality
index, which means the higher locality is, if region A is assigned to region
server B, the lower cost it is.
The cost function could be more sophisticated when we put more metrics into
account.
So after running the min-cost max flow solver, the master could assign the
regions based on the global locality optimization.
Also the master should share this global view to secondary master in case the
master fail over happens.
In addition, the HBASE-4491 (Locality Checker) is the tool, which is based on
the same metrics, to proactively to scan dfs to calculate the global locality
information in the cluster. It will help us to verify data locality
information during the run time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3724) Load balancer improvements

2012-01-16 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186959#comment-13186959
 ] 

gaojinchao commented on HBASE-3724:
---

I found the balance ago in branch92 is invalid for our scenario. 
So I use this issue to hang all issues related to balance. If someone want to 
see it, 
it will be easy.


 Load balancer improvements
 --

 Key: HBASE-3724
 URL: https://issues.apache.org/jira/browse/HBASE-3724
 Project: HBase
  Issue Type: Umbrella
Reporter: stack

 Umbrella issue under which we hang all regions related to balancer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-16 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187415#comment-13187415
 ] 

gaojinchao commented on HBASE-5179:
---

@chunhui
Does this code make sense to you ?

  if (!this.serverManager.isDeadMetaServerInProgress()) {
if(metaServerInfo != null){
  this.fileSystemManager.splitLog(metaServerInfo.getServerName());
}
this.assignmentManager.assignMeta();
assigned++;
  }

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 
 5179-90v8.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, 
 hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-16 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187417#comment-13187417
 ] 

gaojinchao commented on HBASE-5179:
---

+1 v8 , we need some test cases to verify in real cluster.

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 
 5179-90v8.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, 
 hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5202) NPE during Master failover in master.AssignmentManager.regionOnline()

2012-01-16 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187421#comment-13187421
 ] 

gaojinchao commented on HBASE-5202:
---

look https://issues.apache.org/jira/browse/HBASE-5179. Maybe it can resolve 
this issue.

 NPE during Master failover in master.AssignmentManager.regionOnline()
 -

 Key: HBASE-5202
 URL: https://issues.apache.org/jira/browse/HBASE-5202
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.6
Reporter: Eugene Koontz
Assignee: Eugene Koontz
 Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt


 The following NPE can occur during master failover:
 {code}
 2012-01-15 17:45:00,314 FATAL 
 [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] 
 master.HMaster(944): Unhandled exception. Starting shutdown.
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
 at 
 org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
 at java.lang.Thread.run(Thread.java:636)
 {code}
 This is caused by regionOnline() being passed a null serverInfo (its second 
 parameter). 
 The AssignmentManager's processFailover() method is passing a null to 
 regionOnline() because the value that regionOnline is passing, hsi, is set as:
 {code}
 hsi = 
 this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
 {code}
 and
  
 {code}
 hsi = 
 this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
 {code}
 getHServerInfo() is defined as:
 {code}
   public HServerInfo getHServerInfo(final HServerAddress hsa) {
 synchronized(this.onlineServers) {
   // TODO: This is primitive.  Do a better search.
   for (Map.EntryString, HServerInfo e: this.onlineServers.entrySet()) {
 if (e.getValue().getServerAddress().equals(hsa)) {
   return e.getValue();
 }
   }
 }
 return null;
   }
 {code}
 This will return null if the onlineServers map does not yet have a value 
 corresponding to the key supplied by the catalogTracker's getRootLocation() 
 or getMetaLocation(). 
 Since the catalogTracker uses zookeeper to establish the server locations of 
 {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to 
 the these servers' registering with the master, there can be an inconsistency 
 between the catalogTracker and the onlineServers if either of these 
 regionservers is online with respect to zookeeper, but haven't yet registered 
 with the master (perhaps due to a high latency network between the master and 
 the regionserver).
 The attached testMasterFailoverWithSlowRS.txt patch can be used to modify 
 TestMasterFailover to cause this NPE. 
 The proposed fix (provided along with the above test in a separate 
 attachment) is for the master to use the new verifyMetaTablesAreUp() to wait 
 for both of the servers named by the catalog tracker's getRootLocation() and 
 getMetaLocation() to register with the master before the master can continue 
 with failover.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3724) Load balancer improvements

2012-01-16 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187441#comment-13187441
 ] 

gaojinchao commented on HBASE-3724:
---

@zhihong


In my scenario:

1. One region server has more than 1,000 regions.(hdf hard disk capacity(12T) / 
3 (replication) / 
region size = 2G).

2. One moment, Dozens of Regions per region server are working for put 
operation.

I went through the balance code , I found current balnace algorithm is invalid 
for our scenarios. 

below scenarios:
1)When adds one new machine to our cluster, Maybe all of hot regions(is 
working) will move to this one.
2)When one RS restarts, Maybe all of hot regions(is working) will move to this 
machine. 

 Load balancer improvements
 --

 Key: HBASE-3724
 URL: https://issues.apache.org/jira/browse/HBASE-3724
 Project: HBase
  Issue Type: Umbrella
Reporter: stack

 Umbrella issue under which we hang all regions related to balancer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3724) Load balancer improvements

2012-01-16 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187456#comment-13187456
 ] 

gaojinchao commented on HBASE-3724:
---

for example:
we have 10 nodes cluster. 1000 regions per node and 50 regions are hot.
per the current algorithm.

If we add a new machine to this cluster. all the 50 hot regions will be moved 
to the new machine?



 Load balancer improvements
 --

 Key: HBASE-3724
 URL: https://issues.apache.org/jira/browse/HBASE-3724
 Project: HBase
  Issue Type: Umbrella
Reporter: stack

 Umbrella issue under which we hang all regions related to balancer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3724) Load balancer improvements

2012-01-16 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187464#comment-13187464
 ] 

gaojinchao commented on HBASE-3724:
---

Thanks, I am looking forward to seeing this part of the code.

Sorry,I didn't express clearly. 

In my scenario:
1.There is 10 nodes cluster. 1000 regions per node.
2.We add a new machine to the cluster.
3.Balance needs move 1000* 10/11= 909 regions to the new machine. each region 
server will move both 45 hot regions and 45 cold regions to the new one. in 
this case, all hot regions will move to this new one?




 Load balancer improvements
 --

 Key: HBASE-3724
 URL: https://issues.apache.org/jira/browse/HBASE-3724
 Project: HBase
  Issue Type: Umbrella
Reporter: stack

 Umbrella issue under which we hang all regions related to balancer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3724) Load balancer improvements

2012-01-16 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187488#comment-13187488
 ] 

gaojinchao commented on HBASE-3724:
---

Yes, you are right.
we can't increase the limit to close to 1000 regions. because another reason is 
when hot regions are too many, It will produce many small Hfile.
I have a solution to deal with this case. I will delevop a new balance 
algorithm for our scenario.


 Load balancer improvements
 --

 Key: HBASE-3724
 URL: https://issues.apache.org/jira/browse/HBASE-3724
 Project: HBase
  Issue Type: Umbrella
Reporter: stack

 Umbrella issue under which we hang all regions related to balancer

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-15 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186675#comment-13186675
 ] 

gaojinchao commented on HBASE-5179:
---

@chunhui
I agree with you. In branch 90, I want to add a flag that marks SSH finish 
split Hlog. 
If all of Dead servers had split Hlog, Loss data should be quite rare.


 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, 
 hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5202) NPE during Master failover in master.AssignmentManager.regionOnline()

2012-01-15 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186681#comment-13186681
 ] 

gaojinchao commented on HBASE-5202:
---

The root reason of this issue is some region server register lately.

when one region server without META/ROOT registers atfer rebuildUserRegions 
finished. The regions in this one will be opened twice.


 NPE during Master failover in master.AssignmentManager.regionOnline()
 -

 Key: HBASE-5202
 URL: https://issues.apache.org/jira/browse/HBASE-5202
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.6
Reporter: Eugene Koontz
Assignee: Eugene Koontz
 Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt


 The following NPE can occur during master failover:
 {code}
 2012-01-15 17:45:00,314 FATAL 
 [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] 
 master.HMaster(944): Unhandled exception. Starting shutdown.
 java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
 at 
 org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
 at java.lang.Thread.run(Thread.java:636)
 {code}
 This is caused by regionOnline() being passed a null serverInfo (its second 
 parameter). 
 The AssignmentManager's processFailover() method is passing a null to 
 regionOnline() because the value that regionOnline is passing, hsi, is set as:
 {code}
 hsi = 
 this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
 {code}
 and
  
 {code}
 hsi = 
 this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
 {code}
 getHServerInfo() is defined as:
 {code}
   public HServerInfo getHServerInfo(final HServerAddress hsa) {
 synchronized(this.onlineServers) {
   // TODO: This is primitive.  Do a better search.
   for (Map.EntryString, HServerInfo e: this.onlineServers.entrySet()) {
 if (e.getValue().getServerAddress().equals(hsa)) {
   return e.getValue();
 }
   }
 }
 return null;
   }
 {code}
 This will return null if the onlineServers map does not yet have a value 
 corresponding to the key supplied by the catalogTracker's getRootLocation() 
 or getMetaLocation(). 
 Since the catalogTracker uses zookeeper to establish the server locations of 
 {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to 
 the these servers' registering with the master, there can be an inconsistency 
 between the catalogTracker and the onlineServers if either of these 
 regionservers is online with respect to zookeeper, but haven't yet registered 
 with the master (perhaps due to a high latency network between the master and 
 the regionserver).
 The attached testMasterFailoverWithSlowRS.txt patch can be used to modify 
 TestMasterFailover to cause this NPE. 
 The proposed fix (provided along with the above test in a separate 
 attachment) is for the master to use the new verifyMetaTablesAreUp() to wait 
 for both of the servers named by the catalog tracker's getRootLocation() and 
 getMetaLocation() to register with the master before the master can continue 
 with failover.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-15 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186689#comment-13186689
 ] 

gaojinchao commented on HBASE-5179:
---

@chunhui
Root is fine. 
But in the initialization phase, Meta flag is always false.
so this.serverManager.isDeadMetaServerInProgress is invalid.



 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, 
 hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-15 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186695#comment-13186695
 ] 

gaojinchao commented on HBASE-5179:
---

Yes, it always false in the inializtion phase.

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, 
 hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-15 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186701#comment-13186701
 ] 

gaojinchao commented on HBASE-5179:
---

@chunhui
I thought that you have said, I want to get from root table , thare is also 
some problem. for example:
1. root and meta is same machine.
2. root is down and so on.

I don't find a good way to do this.

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, 
 hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-15 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186704#comment-13186704
 ] 

gaojinchao commented on HBASE-5179:
---

+1 on introduceing Trunk's logic

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, 
 hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-15 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186708#comment-13186708
 ] 

gaojinchao commented on HBASE-5179:
---

This logic is very complex. I thought a few days and did not find a good way to 
slove all problems.


 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, 
 hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-15 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186720#comment-13186720
 ] 

gaojinchao commented on HBASE-5179:
---

You can test this case:

1.create some table

2.restart hmaster. after geting knownServers wait 60s(added some code in 
hmaster)

LOG.info(+++sleep 60+++ );
Thread.sleep(6);

3.kill(kill -9) the meta region server when Hmaster log print +++sleep 60+++ 

4.Master gets the event of Meta region server is down. Before split Hlog, sleep 
90s

5. check the table after master finish init

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, 
 hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-15 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186725#comment-13186725
 ] 

gaojinchao commented on HBASE-5179:
---

I configure the zk.session is 40s. So I design this case uses sleeping 60s


 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, 
 hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-15 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186726#comment-13186726
 ] 

gaojinchao commented on HBASE-5179:
---

@chunhui
It is fine. after you finhish, I will also test in our cluster.

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-90v5.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, 
 hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-14 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186142#comment-13186142
 ] 

gaojinchao commented on HBASE-5179:
---

Please resolve this issue firstly. Maybe HBase-4748 need a long time.

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-14 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186155#comment-13186155
 ] 

gaojinchao commented on HBASE-5179:
---

@zhihong
Regarding to 5179-90v2.patch, 
when dead servers are processing, their logs would be split by 
ServerShutdownHandler.
Maybe this change is result in meta data loss. 
for example:
without patch:
1.split the Hlog with regionsever don't report itself(eg My comments@14/Jan/12 
05:38 
  1,2 case will don't report).

2. assign the Meta/Root region(If meta region server has some exceptions, 
because we have split the Hlog, so there is no meta data loss)




 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-14 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186157#comment-13186157
 ] 

gaojinchao commented on HBASE-5179:
---

@zhihong
I am sorry to submit my comments lately. I am testing patch v2. 

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-14 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186161#comment-13186161
 ] 

gaojinchao commented on HBASE-5179:
---

Current,This is my guess.
I am developing some code to produce this scenario. If I have further 
information , I will inform you

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-14 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186208#comment-13186208
 ] 

gaojinchao commented on HBASE-5179:
---

I had produced this case:
In order to simulate some dead server is being prcoessed:
1.create some table

2.restart hmaster, after geting region server , wait 60s(I added some code in 
hmaster)
  int regionCount = this.serverManager.waitForRegionServers();
LOG.info(sleep 60 );
Thread.sleep(6);
3.kill the meta region server

4.Master gets the envents of Meta region server is down.

5.assign the meta table.

6. SSH start to split the Hlog(some meta will lose)

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, 
 hbase-5179v5.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3933) Hmaster throw NullPointerException

2012-01-13 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186097#comment-13186097
 ] 

gaojinchao commented on HBASE-3933:
---

@Eugene 
Yes the issue exits in branch90. I avoid this by increasing 
hbase.master.wait.on.regionservers.timeout

 Hmaster throw NullPointerException
 --

 Key: HBASE-3933
 URL: https://issues.apache.org/jira/browse/HBASE-3933
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6
Reporter: gaojinchao
Assignee: Eugene Koontz
 Attachments: HBASE-3933.patch, Hmastersetup0.90


 NullPointerException while hmaster starting.
 {code}
   java.lang.NullPointerException
 at java.util.TreeMap.getEntry(TreeMap.java:324)
 at java.util.TreeMap.get(TreeMap.java:255)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.addToServers(AssignmentManager.java:1512)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:606)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
 at 
 org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:402)
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3933) Hmaster throw NullPointerException

2012-01-13 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186098#comment-13186098
 ] 

gaojinchao commented on HBASE-3933:
---

@Eugene 
In your patches, You only deale with the root/meta regionserver. If a normal 
regionserver registers laterly.
Master will process it as a dead one. Some regions in the later one will be 
opened twice.

 Hmaster throw NullPointerException
 --

 Key: HBASE-3933
 URL: https://issues.apache.org/jira/browse/HBASE-3933
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.6
Reporter: gaojinchao
Assignee: Eugene Koontz
 Attachments: HBASE-3933.patch, Hmastersetup0.90


 NullPointerException while hmaster starting.
 {code}
   java.lang.NullPointerException
 at java.util.TreeMap.getEntry(TreeMap.java:324)
 at java.util.TreeMap.get(TreeMap.java:255)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.addToServers(AssignmentManager.java:1512)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:606)
 at 
 org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
 at 
 org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:402)
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-13 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186099#comment-13186099
 ] 

gaojinchao commented on HBASE-5179:
---

When master starts, As current flow, RS has three part:
1.Some has registered in Hmaster by heatbeat report
2.Some is dead server being processed by ssh
3.Some is waiting zk is expired(default session timeout is 3 minutes)

1,2 is easy.
3 is little diffult.


 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-13 Thread gaojinchao (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186100#comment-13186100
]

gaojinchao commented on HBASE-5179:
---

Other problem is shutdownhandler flow can't get the meta flag when master
starts.

See below comment in expired function.

// Was this server carrying meta? Can't ask CatalogTracker because it
// may have reset the meta location as null already (it may have already
// run into fact that meta is dead). I can ask assignment manager. It
// has an inmemory list of who has what. This list will be cleared as we
// process the dead server but should be find asking it now.
HServerAddress address = ct.getMetaLocation();
boolean carryingMeta =

Concurrent processing of processFaileOver and ServerShutdownHandler may cause
region to be assigned before log splitting is completed, causing data loss

Key: HBASE-5179
URL: https://issues.apache.org/jira/browse/HBASE-5179
Project: HBase
Issue Type: Bug
Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
Fix For: 0.92.0, 0.94.0, 0.90.6

Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch,
5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch

If master's processing its failover and ServerShutdownHandler's processing
happen concurrently, it may appear following case.
1.master completed splitLogAfterStartup()
2.RegionserverA restarts, and ServerShutdownHandler is processing.
3.master starts to rebuildUserRegions, and RegionserverA is considered as
dead server.
4.master starts to assign regions of RegionserverA because it is a dead
server by step3.
However, when doing step4(assigning region), ServerShutdownHandler may be
doing split log, Therefore, it may cause data loss.

[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-13 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186116#comment-13186116
 ] 

gaojinchao commented on HBASE-5179:
---

I suggest that we don't. I need some time to a more detailed verification.

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5152) Region is on service before completing initialized when doing rollback of split, it will affect read correctness

2012-01-09 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13182511#comment-13182511
 ] 

gaojinchao commented on HBASE-5152:
---

It seems the code should be above if (coprocessorHost != null.
coprocessorHost.postOpen() means we have opened the region.

 Region is on service before completing initialized when doing rollback of 
 split, it will affect read correctness 
 -

 Key: HBASE-5152
 URL: https://issues.apache.org/jira/browse/HBASE-5152
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
 Attachments: hbase-5152.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4988) MetaServer crash cause all splitting regionserver abort

2012-01-09 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13182547#comment-13182547
 ] 

gaojinchao commented on HBASE-4988:
---

+1

Good job!  


 MetaServer crash cause all splitting regionserver abort
 ---

 Key: HBASE-4988
 URL: https://issues.apache.org/jira/browse/HBASE-4988
 Project: HBase
  Issue Type: Bug
Reporter: chunhui shen
 Attachments: hbase-4988v1.patch


 If metaserver crash now,
 All the splitting regionserver will abort theirself.
 Becasue the code
 {code}
 this.journal.add(JournalEntry.PONR);
 MetaEditor.offlineParentInMeta(server.getCatalogTracker(),
 this.parent.getRegionInfo(), a.getRegionInfo(), 
 b.getRegionInfo());
 {code}
 If the JournalEntry is PONR, split's roll back will abort itselef.
 It is terrible in huge putting environment when metaserver crash

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5060) HBase client is blocked forever

2011-12-18 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172069#comment-13172069
 ] 

gaojinchao commented on HBASE-5060:
---

Test case passed:
My test code:
try {
  HBaseAdmin hbase = new HBaseAdmin(config);
  while (true) {
try {
if (hbase.tableExists(tableName)) {
  System.out.println([FATAL] The usertable:  + tableName
  +  is already existed);
}
try {
  Thread.sleep(50);
} catch (InterruptedException e) {
  continue;
}
}catch(IOException e){
   e.printStackTrace();
   continue;
}
  }
1. run test case
2. kill two zk servers(total three zk servers)
3. start the killed server again



 HBase client is blocked forever
 ---

 Key: HBASE-5060
 URL: https://issues.apache.org/jira/browse/HBASE-5060
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.4
Reporter: gaojinchao
Assignee: gaojinchao
Priority: Critical
 Fix For: 0.92.1, 0.90.6

 Attachments: HBASE-5060_Branch90trial.patch, HBASE-5060_trunk.patch


 Since the client had a temporary network failure, After it recovered.
 I found my client thread was blocked. 
 Looks below stack and logs, It said that we use a invalid CatalogTracker in 
 function tableExists.
 Block stack:
 WriteHbaseThread33 prio=10 tid=0x7f76bc27a800 nid=0x2540 in 
 Object.wait() [0x7f76af4f3000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:331)
  - locked 0x7f7a67817c98 (a 
 java.util.concurrent.atomic.AtomicBoolean)
  at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366)
  at 
 org.apache.hadoop.hbase.catalog.MetaReader.tableExists(MetaReader.java:427)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:164)
  at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown 
 Source)
  at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source)
  - locked 0x7f7a4c5dc578 (a com.huawei.hdi.hbase.HbaseReOper)
  at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source)
  at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source)
 In ZooKeeperNodeTracker, We don't throw the KeeperException to high level.
 So in CatalogTracker level, We think ZooKeeperNodeTracker start success and
 continue to process .
 [WriteHbaseThread33]2011-12-16 17:07:33,153[WARN ]  | 
 hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Unable to 
 get data of znode /hbase/root-region-server | 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:557)
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss for /hbase/root-region-server
  at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
  at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931)
  at 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
  at 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73)
  at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162)
  at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown 
 Source)
  at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source)
  at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source)
  at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source)
 [WriteHbaseThread33]2011-12-16 17:07:33,361[ERROR]  | 
 hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Received 
 unexpected KeeperException, re-throwing exception | 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.keeperException(ZooKeeperWatcher.java:385)
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss for /hbase/root-region-server
  at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
  at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931)

[jira] [Commented] (HBASE-4970) Allow better control of resource consumption in HTable (backport HBASE-4805 to 0.90 branch)

2011-12-18 Thread gaojinchao (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171986#comment-13171986
]

gaojinchao commented on HBASE-4970:
---

No problem, Thanks for your work!

Allow better control of resource consumption in HTable (backport HBASE-4805
to 0.90 branch)
---

Key: HBASE-4970
URL: https://issues.apache.org/jira/browse/HBASE-4970
Project: HBase
Issue Type: Improvement
Components: client
Affects Versions: 0.90.4
Reporter: gaojinchao
Assignee: gaojinchao
Priority: Trivial
Fix For: 0.90.6

Attachments: HBASE-4970_Branch90.patch,
HBASE-4970_Branch90_V1_trial.patch, HBASE-4970_Branch90_V2.patch,
HBASE-4970_Branch92_V2.patch, HBASE-4970_Trunk_V2.patch

In my cluster, I changed keepAliveTime from 60 s to 3600 s. Increasing RES
is slowed down.
Why increasing keepAliveTime of HBase thread pool is slowing down our problem
occurance [RES value increase]?
You can go through the source of sun.nio.ch.Util. Every thread hold 3
softreference of direct buffer(mustangsrc) for reusage. The code names the 3
softreferences buffercache. If the buffer was all occupied or none was
suitable in size, and new request comes, new direct buffer is allocated.
After the service, the bigger one replaces the smaller one in buffercache.
The replaced buffer is released.
So I think we can add a parameter to change keepAliveTime of Htable thread
pool.

[jira] [Commented] (HBASE-4970) Allow better control of resource consumption in HTable (backport HBASE-4805 to 0.90 branch)

2011-12-16 Thread gaojinchao (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171381#comment-13171381
]

gaojinchao commented on HBASE-4970:
---

@Lars
Thanks for your review.

I tend to only modify the parameters.

Allow better control of resource consumption in HTable (backport HBASE-4805
to 0.90 branch)
---

Attachments: HBASE-4970_Branch90.patch,
HBASE-4970_Branch90_V1_trial.patch, HBASE-4970_Branch90_V2.patch,
HBASE-4970_Branch92_V2.patch, HBASE-4970_Trunk_V2.patch

[jira] [Commented] (HBASE-5009) Failure of creating split dir if it already exists prevents splits from happening further

2011-12-14 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169266#comment-13169266
 ] 

gaojinchao commented on HBASE-5009:
---

Yes , that is th root reason, I think we should guarantee all children threads 
is stoped.
At same time, splitdir is not useful, we alse delete it. It seems no harm


 Failure of creating split dir if it already exists prevents splits from 
 happening further
 -

 Key: HBASE-5009
 URL: https://issues.apache.org/jira/browse/HBASE-5009
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.6
Reporter: ramkrishna.s.vasudevan
Assignee: ramkrishna.s.vasudevan

 The scenario is
 - The split of a region takes a long time
 - The deletion of the splitDir fails due to HDFS problems.
 - Subsequent splits also fail after that.
 {code}
 private static void createSplitDir(final FileSystem fs, final Path splitdir)
   throws IOException {
 if (fs.exists(splitdir)) throw new IOException(Splitdir already exits?  
 + splitdir);
 if (!fs.mkdirs(splitdir)) throw new IOException(Failed create of  + 
 splitdir);
   }
 {code}
 Correct me if am wrong? If it is an issue can we change the behaviour of 
 throwing exception?
 Pls suggest.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5008) The clusters can't provide services because Region can't flush.

2011-12-11 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13167086#comment-13167086
 ] 

gaojinchao commented on HBASE-5008:
---

TestSplitTransactionOnCluster and TestSplitTransaction have passed.
All test cases are running and will give a result tomorrow. 


 The clusters can't  provide services because Region can't flush.
 

 Key: HBASE-5008
 URL: https://issues.apache.org/jira/browse/HBASE-5008
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: gaojinchao
Priority: Blocker
 Fix For: 0.90.6

 Attachments: HBASE-5008_Branch90.patch


 Hbase version 0.90.4 + patches
 My analysis is as follows:
 //Started splitting region b24d8ccb852ff742f2a27d01b7f5853e and closed region.
 2011-12-10 17:32:48,653 INFO 
 org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of 
 region Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.
 2011-12-10 17:32:49,759 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Closing 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: 
 disabling compactions  flushes
 2011-12-10 17:32:49,759 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Running close preflush of 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.
 //Processed a flush request and skipped , But flushRequested had set to true
 2011-12-10 17:33:06,963 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Started memstore flush for 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e., 
 current region memstore size 12.6m
 2011-12-10 17:33:17,277 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Skipping flush on 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e. because 
 closing
 //split region b24d8ccb852ff742f2a27d01b7f5853 failed and rolled back, 
 flushRequested flag was true, So all handle was blocked 
 2011-12-10 17:34:01,293 INFO 
 org.apache.hadoop.hbase.regionserver.SplitTransaction: Cleaned up old failed 
 split transaction detritus: 
 hdfs://193.195.18.121:9000/hbase/Htable_UFDR_004/b24d8ccb852ff742f2a27d01b7f5853e/splits
 2011-12-10 17:34:01,294 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Onlined 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.; next 
 sequenceid=15494173
 2011-12-10 17:34:01,295 INFO 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Successful rollback 
 of failed split of 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.
 2011-12-10 17:43:10,147 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Blocking updates for 'IPC Server handler 19 on 20020' on region 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: 
 memstore size 384.0m is = than blocking 384.0m size
 // All handles had been blocked. The clusters could not provide services
 2011-12-10 17:34:01,295 INFO 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Successful rollback 
 of failed split of 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.
 2011-12-10 17:43:10,147 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Blocking updates for 'IPC Server handler 19 on 20020' on region 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: 
 memstore size 384.0m is = than blocking 384.0m size
 2011-12-10 17:43:10,192 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Blocking updates for 'IPC Server handler 34 on 20020' on region 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: 
 memstore size 384.0m is = than blocking 384.0m size
 2011-12-10 17:43:10,193 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Blocking updates for 'IPC Server handler 51 on 20020' on region 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: 
 memstore size 384.0m is = than blocking 384.0m size
 2011-12-10 17:43:10,196 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Blocking updates for 'IPC Server handler 85 on 20020' on region 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: 
 memstore size 384.0m is = than blocking 384.0m size
 2011-12-10 17:43:10,199 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Blocking updates for 'IPC Server handler 88 on 20020' on region 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: 
 memstore size 384.0m is = than blocking 384.0m size
 2011-12-10 17:43:10,202 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Blocking updates for 'IPC Server handler 44 on 20020' on region 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: 
 memstore size 384.0m is = than blocking 384.0m size
 2011-12-10 17:43:11,663 INFO

[jira] [Commented] (HBASE-4970) Allow better control of resource consumption in HTable (backport HBASE-4805 to 0.90 branch)

2011-12-08 Thread gaojinchao (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13165802#comment-13165802
]

gaojinchao commented on HBASE-4970:
---

I also want to add a parameter to change keepAliveTime of Htable thread pool.
so that clients can have more option

Allow better control of resource consumption in HTable (backport HBASE-4805
to 0.90 branch)
---

Attachments: HBASE-4970_Branch90.patch,
HBASE-4970_Branch90_V1_trial.patch

[jira] [Commented] (HBASE-4970) Add a parameter to change keepAliveTime of Htable thread pool.

2011-12-07 Thread gaojinchao (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13164276#comment-13164276
]

gaojinchao commented on HBASE-4970:
---

Sorry, I didn't see the Lars's comment. I will try to backport HBASE-4805.

Add a parameter to change keepAliveTime of Htable thread pool.
---

Attachments: HBASE-4970_Branch90.patch

[jira] [Commented] (HBASE-4970) Add a parameter to change keepAliveTime of Htable thread pool.

2011-12-07 Thread gaojinchao (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13164364#comment-13164364
]

gaojinchao commented on HBASE-4970:
---

Fixed Lars's comment.

@Lars
Please review firstly, I will test it in real cluster tomorrow.

Add a parameter to change keepAliveTime of Htable thread pool.
---

Attachments: HBASE-4970_Branch90.patch,
HBASE-4970_Branch90_V1_trial.patch

[jira] [Commented] (HBASE-4970) Add a parameter to change keepAliveTime of Htable thread pool.

2011-12-06 Thread gaojinchao (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13164127#comment-13164127
]

gaojinchao commented on HBASE-4970:
---

ok, No problem.

Add a parameter to change keepAliveTime of Htable thread pool.
---

[jira] [Commented] (HBASE-4633) Potential memory leak in client RPC timeout mechanism

2011-12-04 Thread gaojinchao (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162538#comment-13162538
]

gaojinchao commented on HBASE-4633:
---

Hbase version is 0.90.4 + patch.
Cluseter number is 10
One HBase client process includes 50 threads, So the max threads connect to the
RS is (50 * RS number).

I have noticed some memory leak problems in my HBase client.
RES has increased to 27g
PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND
12676 root 20 0 30.8g 27g 5092 S2 57.5 587:57.76
/opt/java/jre/bin/java -Djava.library.path=lib/.

But I am not sure the leak comes from HBase Client jar itself or just our
client code.

This is some parameters of jvm.
:-Xms15g -Xmn12g -Xmx15g -XX:PermSize=64m -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=65
-XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=1
-XX:+CMSParallelRemarkEnabled

Potential memory leak in client RPC timeout mechanism
-

Key: HBASE-4633
URL: https://issues.apache.org/jira/browse/HBASE-4633
Project: HBase
Issue Type: Bug
Components: client
Affects Versions: 0.90.3
Environment: HBase version: 0.90.3 + Patches , Hadoop version: CDH3u0
Reporter: Shrijeet Paliwal
Attachments: HBaseclientstack.png

Relevant Jiras: https://issues.apache.org/jira/browse/HBASE-2937,
https://issues.apache.org/jira/browse/HBASE-4003
We have been using the 'hbase.client.operation.timeout' knob
introduced in 2937 for quite some time now. It helps us enforce SLA.
We have two HBase clusters and two HBase client clusters. One of them
is much busier than the other.
We have seen a deterministic behavior of clients running in busy
cluster. Their (client's) memory footprint increases consistently
after they have been up for roughly 24 hours.
This memory footprint almost doubles from its usual value (usual case
== RPC timeout disabled). After much investigation nothing concrete
came out and we had to put a hack
which keep heap size in control even when RPC timeout is enabled. Also
note , the same behavior is not observed in 'not so busy
cluster.
The patch is here : https://gist.github.com/1288023

[jira] [Commented] (HBASE-4633) Potential memory leak in client RPC timeout mechanism

2011-12-04 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162596#comment-13162596
 ] 

gaojinchao commented on HBASE-4633:
---

I have tested, Memory does not increase when specified MaxDirectMemorySize with 
moderate value.

In my cluster, nearly one hours , trigger a full GC. look this logs:
10022.210: [Full GC (System) 10022.210: [Tenured: 577566K-257349K(1048576K), 
1.7515610 secs] 9651924K-257349K(14260672K), [Perm : 19161K-19161K(65536K)], 
1.7518350 secs] [Times: user=1.75 sys=0.00, real=1.75 secs] .

.
13532.930: [GC 13532.931: [ParNew: 12801558K-981626K(13212096K), 0.1414370 
secs] 13111752K-1291828K(14260672K), 0.1416880 secs] [Times: user=1.90 
sys=0.01, real=0.14 secs]
13624.630: [Full GC (System) 13624.630: [Tenured: 310202K-175378K(1048576K), 
1.9529280 secs] 11581276K-175378K(14260672K), [Perm : 19225K-19225K(65536K)], 
1.9531660 secs] 
   [Times: user=1.94 sys=0.00, real=1.96 secs]

 I monitored the memory. It is stable.

 7543 root  20   0 16.9g  15g 9892 S1 33.0   1258:59 java
 7543 root  20   0 16.9g  15g 9892 S0 33.0   1258:59 java
 7543 root  20   0 16.9g  15g 9892 S1 33.0   1258:59 java
 7543 root  20   0 16.9g  15g 9892 S0 33.0   1258:59 java
 7543 root  20   0 16.9g  15g 9892 S1 33.0   1258:59 java
 7543 root  20   0 16.9g  15g 9892 S1 33.0   1259:00 java


 Potential memory leak in client RPC timeout mechanism
 -

 Key: HBASE-4633
 URL: https://issues.apache.org/jira/browse/HBASE-4633
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.3
 Environment: HBase version: 0.90.3 + Patches , Hadoop version: CDH3u0
Reporter: Shrijeet Paliwal
 Attachments: HBaseclientstack.png


 Relevant Jiras: https://issues.apache.org/jira/browse/HBASE-2937,
 https://issues.apache.org/jira/browse/HBASE-4003
 We have been using the 'hbase.client.operation.timeout' knob
 introduced in 2937 for quite some time now. It helps us enforce SLA.
 We have two HBase clusters and two HBase client clusters. One of them
 is much busier than the other.
 We have seen a deterministic behavior of clients running in busy
 cluster. Their (client's) memory footprint increases consistently
 after they have been up for roughly 24 hours.
 This memory footprint almost doubles from its usual value (usual case
 == RPC timeout disabled). After much investigation nothing concrete
 came out and we had to put a hack
 which keep heap size in control even when RPC timeout is enabled. Also
 note , the same behavior is not observed in 'not so busy
 cluster.
 The patch is here : https://gist.github.com/1288023

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4773) HBaseAdmin leaks ZooKeeper connections

2011-11-25 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157352#comment-13157352
 ] 

gaojinchao commented on HBASE-4773:
---

In TRUNK, before throwing exception, we should call deleteStaleConnection to 
clean the dirty data


 HBaseAdmin leaks ZooKeeper connections
 --

 Key: HBASE-4773
 URL: https://issues.apache.org/jira/browse/HBASE-4773
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.4
Reporter: gaojinchao
Priority: Critical
 Fix For: 0.90.5

 Attachments: 4773.patch


 When master crashs, HBaseAdmin will leaks ZooKeeper connections
 I think we should close the zk connetion when throw MasterNotRunningException
  public HBaseAdmin(Configuration c)
   throws MasterNotRunningException, ZooKeeperConnectionException {
 this.conf = HBaseConfiguration.create(c);
 this.connection = HConnectionManager.getConnection(this.conf);
 this.pause = this.conf.getLong(hbase.client.pause, 1000);
 this.numRetries = this.conf.getInt(hbase.client.retries.number, 10);
 this.retryLongerMultiplier = 
 this.conf.getInt(hbase.client.retries.longer.multiplier, 10);
 //we should add this code and close the zk connection
 try{
   this.connection.getMaster();
 }catch(MasterNotRunningException e){
   HConnectionManager.deleteConnection(conf, false);
   throw e;  
 }
   }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4868) testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails

2011-11-25 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157353#comment-13157353
 ] 

gaojinchao commented on HBASE-4868:
---

Thanks reveiw, I will fix all the comments


 testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails
 -

 Key: HBASE-4868
 URL: https://issues.apache.org/jira/browse/HBASE-4868
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.92.0
Reporter: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0

 Attachments: HBASE-4868_trial.patch


 looks: 
 https://builds.apache.org/job/HBase-TRUNK-security/7/testReport/org.apache.hadoop.hbase.util.hbck/TestOfflineMetaRebuildBase/testMetaRebuild/
 Please review, see whether the method makes sense? 
 If it makes sense, I will check other cases?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4868) testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails

2011-11-25 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157400#comment-13157400
 ] 

gaojinchao commented on HBASE-4868:
---

@Test(timeout = 12)
  public void testMetaRebuild() throws Exception {

This code can't guarantee ?. I don't understand where should add?
Please remind me :)




-邮件原件-
发件人: Ted Yu (Commented) (JIRA) [mailto:j...@apache.org] 
发送时间: 2011年11月26日 14:59
收件人: Gaojinchao
主题: [jira] [Commented] (HBASE-4868) testMetaRebuild#TestOfflineMetaRebuildBase 
occasionally fails


[ 
https://issues.apache.org/jira/browse/HBASE-4868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157395#comment-13157395
 ] 

Ted Yu commented on HBASE-4868:
---

HadoopQA is having problem - it didn't run test suite.

@Jinchao:
Please also add the timeout parameter as Jonathan suggested.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




 testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails
 -

 Key: HBASE-4868
 URL: https://issues.apache.org/jira/browse/HBASE-4868
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.92.0
Reporter: gaojinchao
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0

 Attachments: HBASE-4868_trial.patch, HBASE-4868_trunkv2.patch


 looks: 
 https://builds.apache.org/job/HBase-TRUNK-security/7/testReport/org.apache.hadoop.hbase.util.hbck/TestOfflineMetaRebuildBase/testMetaRebuild/
 Please review, see whether the method makes sense? 
 If it makes sense, I will check other cases?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-22 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13155050#comment-13155050
 ] 

gaojinchao commented on HBASE-4739:
---

This patch is not compatible, I added M_ZK_REGION_CLOSING and delete 
RS_ZK_REGION_CLOSING in EventHandler.java.

I have another question, Can I delete below code block in function 
unassign(HRegionInfo region, boolean force) ?

 } catch (NotServingRegionException nsre) {
  LOG.info(Server  + server +  returned  + nsre +  for  +
region.getEncodedName());
  // Presume that master has stale data.  Presume remote side just split.
  // Presume that the split message when it comes in will fix up the 
master's
  // in memory cluster state.
}catch (Throwable t)

I think we should use the wrap of RemoteException. 

 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: 4739_trial2.patch, 4739_trialV3.patch, 
 HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_trail5.patch, 
 HBASE-4739_trial.patch, HBASE-4739_trial6.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-22 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13155600#comment-13155600
 ] 

gaojinchao commented on HBASE-4739:
---

@Ted
It seems 0.90.5 logic is ok. 
1. if RS_ZK_REGION_CLOSING is created, It says that RS has received the RPC
2. When RIT is timeout, There is two case, one RS is slow, in this case we 
don't need send RPC again.
   another case, closing the region has exception, we send rpc can't solve the 
problem, it may also fail.
So I think we don't need fix anything.


 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: 4739_trial2.patch, 4739_trialV3.patch, 
 HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_trail5.patch, 
 HBASE-4739_trial.patch, HBASE-4739_trial6.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-20 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13153969#comment-13153969
 ] 

gaojinchao commented on HBASE-4739:
---

HBASE-4739_trail5 made a few changes, Please review, if it makes sense, I will 
verify in a real cluster.


 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: 4739_trial2.patch, 4739_trialV3.patch, 
 HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_trail5.patch, 
 HBASE-4739_trial.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-20 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13154010#comment-13154010
 ] 

gaojinchao commented on HBASE-4739:
---

I think we don't need handle RegionAlreadyInTransitionException exception. We 
only need update the timestamp of RIT,we have done.
my reason is :
1. The moniter timeout is 30 minutes, There are enough time to close a region.
2. if the RS throws RegionAlreadyInTransitionException exception, we need 
update the timestamp of RIT and wait next timeout.


 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: 4739_trial2.patch, 4739_trialV3.patch, 
 HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_trail5.patch, 
 HBASE-4739_trial.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-17 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151923#comment-13151923
 ] 

gaojinchao commented on HBASE-4739:
---

Only sends a RPC, I think state machine lose a state flag. Master doesn't 
distinguish pending or closing. Code structure is not well. But this patch is 
not compatible.

Fixed other comments
 


 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: 4739_trial2.patch, HBASE-4739_Trunk.patch, 
 HBASE-4739_Trunk_V2.patch, HBASE-4739_trial.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-17 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13152021#comment-13152021
 ] 

gaojinchao commented on HBASE-4739:
---

Not completed the test ,tomorrow continue!

 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: 4739_trial2.patch, 4739_trialV3.patch, 
 HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_trial.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-17 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13152585#comment-13152585
 ] 

gaojinchao commented on HBASE-4739:
---

@J-D
In 0.92 version, uses HBASE-4739_Trunk_V2 in timeout monitor for sending a 
CLOSING rpc.(I try to modify this patch)
In trunk, uses patch 4739_trialV3.
Hbase thousands of people in the use of, If we once, may appear more. So I 
think we need slove this isse.

What do you say J-D? 

I will do some more detailed testing about these patches and give my test cases.



 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: 4739_trial2.patch, 4739_trialV3.patch, 
 HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_trial.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4654) [replication] Add a check to make sure we don't replicate to ourselves

2011-11-16 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151143#comment-13151143
 ] 

gaojinchao commented on HBASE-4654:
---

Do we need throw exceptin in api addPeer? 

 [replication] Add a check to make sure we don't replicate to ourselves
 --

 Key: HBASE-4654
 URL: https://issues.apache.org/jira/browse/HBASE-4654
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
 Fix For: 0.90.5

 Attachments: 4654-trunk.txt


 It's currently possible to add a peer for replication and point it to the 
 local cluster, which I believe could very well happen for those like us that 
 use only one ZK ensemble per DC so that only the root znode changes when you 
 want to set up replication intra-DC.
 I don't think comparing just the cluster ID would be enough because you would 
 normally use a different one for another cluster and nothing will block you 
 from pointing elsewhere.
 Comparing the ZK ensemble address doesn't work either when you have multiple 
 DNS entries that point at the same place.
 I think this could be resolved by looking up the master address in the 
 relevant znode as it should be exactly the same thing in the case where you 
 have the same cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-16 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151712#comment-13151712
 ] 

gaojinchao commented on HBASE-4739:
---

trail2 fixed your comment. if you and Ram make sense, I will test it.

 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: 4739_trial2.patch, HBASE-4739_Trunk.patch, 
 HBASE-4739_Trunk_V2.patch, HBASE-4739_trial.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-15 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150310#comment-13150310
 ] 

gaojinchao commented on HBASE-4739:
---

about test failed:
1.org.apache.hadoop.hbase.io.hfile.TestHFileBlock

  Look this comments
  public void testBlockHeapSize() {
// We have seen multiple possible values for this estimate of the heap size
// of a ByteBuffer, presumably depending on the JDK version.
assertTrue(HFileBlock.BYTE_BUFFER_HEAP_SIZE == 64 ||
   HFileBlock.BYTE_BUFFER_HEAP_SIZE == 80); But in 
https://issues.apache.org/jira/browse/HBASE-4768

We add some code snippets：
  assertEquals(80, HFileBlock.BYTE_BUFFER_HEAP_SIZE);
  long byteBufferExpectedSize =
  ClassSize.align(ClassSize.estimateBase(buf.getClass(), true)
  + HFileBlock.HEADER_SIZE + size);



2.org.apache.hadoop.hbase.master.TestDistributedLogSplitting
Because we choose a rs with 0 regions.

// it said that regions is 0.
2011-11-15 03:53:11,215 INFO  [Thread-2335] 
master.TestDistributedLogSplitting(211): #regions = 0
2011-11-15 03:53:11,215 DEBUG 
[RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer] 
wal.HLog$LogSyncer(1192): 
RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer 
interrupted while waiting for sync requests
2011-11-15 03:53:11,215 INFO  
[RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer] 
wal.HLog$LogSyncer(1194): 
RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer exiting
2011-11-15 03:53:11,215 DEBUG [Thread-2335] wal.HLog(967): closing hlog writer 
in 
hdfs://localhost:46229/user/jenkins/.logs/asf001.sp2.ygridcore.net,36721,1321329179789
2011-11-15 03:53:11,637 DEBUG [Thread-2335] master.SplitLogManager(233): 
Scheduling batch of logs to split


 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-15 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150325#comment-13150325
 ] 

gaojinchao commented on HBASE-4739:
---

I made a issue for 2: 
https://issues.apache.org/jira/browse/HBASE-4790

 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-14 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149528#comment-13149528
 ] 

gaojinchao commented on HBASE-4739:
---

---
 T E S T S
---

---
 T E S T S
---
Running org.apache.hadoop.hbase.master.TestMasterFailover
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 99.82 sec

Results :

Tests run: 4, Failures: 0, Errors: 0, Skipped: 0


 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: HBASE-4739_Trunk.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-14 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149532#comment-13149532
 ] 

gaojinchao commented on HBASE-4739:
---

Please review this patch.


 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: HBASE-4739_Trunk.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-14 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150142#comment-13150142
 ] 

gaojinchao commented on HBASE-4739:
---

Why was the check for zk node existence in actOnTimeOut() at line 2531 removed ?

This code snippets is used in branch 0.90, because the zk node 
RS_ZK_REGION_CLOSING is created by RS,
But in trunk the zk node RS_ZK_REGION_CLOSING is created by master. So this 
conditin should be impossible.

 


 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: HBASE-4739_Trunk.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-14 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150168#comment-13150168
 ] 

gaojinchao commented on HBASE-4739:
---

V2 fixs Ted's comment.

In my local, replication test case passed. I will try to dig this issue 
https://builds.apache.org/job/PreCommit-HBASE-Build/243//testReport/org.apache.hadoop.hbase.replication/TestReplication/queueFailover/;

Running org.apache.hadoop.hbase.replication.TestReplicationSource
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.897 sec
Running org.apache.hadoop.hbase.replication.TestReplicationPeer
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 25.538 sec
Running 
org.apache.hadoop.hbase.replication.regionserver.TestReplicationSourceManager
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.089 sec
Running org.apache.hadoop.hbase.replication.regionserver.TestReplicationSink
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 22.178 sec
Running org.apache.hadoop.hbase.replication.TestReplication
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 145.709 sec
Running org.apache.hadoop.hbase.replication.TestMultiSlaveReplication
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 55.521 sec
Running org.apache.hadoop.hbase.replication.TestMasterReplication
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 99.819 sec
Running org.apache.hadoop.hbase.TestServerName


 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-14 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150267#comment-13150267
 ] 

gaojinchao commented on HBASE-4739:
---

@Ram
That is a good suggestion, But timeout monitor needs 30 minutes. should we wait 
so long time ?

 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-14 Thread gaojinchao (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150290#comment-13150290
]

gaojinchao commented on HBASE-4739:
---

In fact, I made another patch that is more complicated and is not compatible,
so I didn't commit.
another patch step:
1. Hmaster creater a zk node(M_ZK_REGION_PENDING_CLOSE) and set RIT to
pendingclose state
2. send rpc to RS
3. RS change zk node to RS_ZK_REGION_CLOSING
4. Master changes RIT to closing

if above steps, We can distinguish the state of RIT. if
M_ZK_REGION_PENDING_CLOSE we can send a rpc. RS_ZK_REGION_CLOSING we can add to
RIT.

Master dying while going to close a region can leave it in transition forever
-

Key: HBASE-4739
URL: https://issues.apache.org/jira/browse/HBASE-4739
Project: HBase
Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
Fix For: 0.92.0, 0.94.0, 0.90.5

Attachments: HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch

I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when
the master died it had just created the RIT znode for a region but didn't
tell the RS to close it yet.
When the master restarted it saw the znode and started printing this:
{quote}
2011-11-03 00:02:49,130 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed
out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc.
state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
2011-11-03 00:02:49,130 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for
too long, this should eventually complete or the server will expire, doing
nothing
{quote}
It's never going to happen, and it's blocking balancing.
I'm marking this as minor since I believe this situation is pretty rare
unless you hit other bugs while trying out stuff to root bugs out.

[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-10 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13148300#comment-13148300
 ] 

gaojinchao commented on HBASE-4739:
---

Yes, I am trying to reproduce this issue in trunk and make a patch.


 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4681) StringIndexOutOfBoundsException parsing Hostname

2011-11-09 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13146878#comment-13146878
 ] 

gaojinchao commented on HBASE-4681:
---

I tested in 3 nodes cluseter and couldn't reproduce this issue. 


 StringIndexOutOfBoundsException parsing Hostname
 

 Key: HBASE-4681
 URL: https://issues.apache.org/jira/browse/HBASE-4681
 Project: HBase
  Issue Type: Bug
Reporter: stack
 Fix For: 0.92.0


 Starting a 0.92 on 0.90 data with 0.90 zk ensemble I got this:
 {code}
 2011-10-26 06:13:53,920 ERROR 
 org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node /hbase/master 
 already exists and this is not a retry
 2011-10-26 06:13:53,927 FATAL org.apache.hadoop.hbase.master.HMaster: 
 Unhandled exception. Starting shutdown.
 java.lang.StringIndexOutOfBoundsException: String index out of range: -1
 at java.lang.String.substring(String.java:1937)
 at 
 org.apache.hadoop.hbase.ServerName.parseHostname(ServerName.java:81)
 at org.apache.hadoop.hbase.ServerName.init(ServerName.java:63)
 at 
 org.apache.hadoop.hbase.master.ActiveMasterManager.blockUntilBecomingActiveMaster(ActiveMasterManager.java:148)
 at 
 org.apache.hadoop.hbase.master.HMaster.becomeActiveMaster(HMaster.java:346)
 at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:301)
 at java.lang.Thread.run(Thread.java:662)
 2011-10-26 06:13:53,929 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
 2011-10-26 06:13:53,929 DEBUG org.apache.hadoop.hbase.master.HMaster: 
 Stopping service thre
 {code}
 I thought this had been fixed.  Dig in .

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-09 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147458#comment-13147458
 ] 

gaojinchao commented on HBASE-4739:
---

I think it's issue about RS. when RS is closing region and throws exception 
because RS has created a closing zk node.
we close again, may not solve the problem. If we have any Rs's logs, it is 
better.
If you don't want to lose data, we should close the RS and split the hlog to 
recover data.



 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4654) [replication] Add a check to make sure we don't replicate to ourselves

2011-11-09 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147469#comment-13147469
 ] 

gaojinchao commented on HBASE-4654:
---

@J-D
Can we fix only in trunk or 0.92? 
We can use ClusterId to judge whether is a same cluster.

 [replication] Add a check to make sure we don't replicate to ourselves
 --

 Key: HBASE-4654
 URL: https://issues.apache.org/jira/browse/HBASE-4654
 Project: HBase
  Issue Type: Improvement
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
 Fix For: 0.90.5


 It's currently possible to add a peer for replication and point it to the 
 local cluster, which I believe could very well happen for those like us that 
 use only one ZK ensemble per DC so that only the root znode changes when you 
 want to set up replication intra-DC.
 I don't think comparing just the cluster ID would be enough because you would 
 normally use a different one for another cluster and nothing will block you 
 from pointing elsewhere.
 Comparing the ZK ensemble address doesn't work either when you have multiple 
 DNS entries that point at the same place.
 I think this could be resolved by looking up the master address in the 
 relevant znode as it should be exactly the same thing in the case where you 
 have the same cluster.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-09 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147515#comment-13147515
 ] 

gaojinchao commented on HBASE-4739:
---

In 0.90 version, I think there is no this scenario, The closing zk node is only 
created by RS.
look this code:
 public void process() {
  int expectedVersion = FAILED;
  if (this.zk) {
expectedVersion = setClosingState();
if (expectedVersion == FAILED) return;
  }

But in 0.92/trunk version, The problem looks like refactoring code,
I think should be:1
1.master create a pending cose state flag
2.RS receives the close call and change zk node to cosing state.
3.When the master restarted , we should handle two state.



 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-09 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147543#comment-13147543
 ] 

gaojinchao commented on HBASE-4739:
---

It seems HBASE-3789 refactored the code

 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

2011-11-06 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144982#comment-13144982
 ] 

gaojinchao commented on HBASE-4511:
---

+1

 There is data loss when master failovers
 

 Key: HBASE-4511
 URL: https://issues.apache.org/jira/browse/HBASE-4511
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.92.0
Reporter: gaojinchao
Assignee: stack
Priority: Minor
 Fix For: 0.92.0

 Attachments: 4511-v2.txt, 4511.txt, 
 org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt


 It goes like this:
 Master crashed ,  at the same time RS with meta is crashing, but RS doesn't 
 eixt.
 Master startups again and finds all living RS. 
 Master verifies the meta failed,  because this RS is crashing.
 Master reassigns the meta, but it doesn't split the Hlog. 
 So some meta data is loss.
 About the logs of a failover test case fail. 
 //It said that we want to kill a RS
 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): 
 STOPPED: Killing for unit test
 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
 RS 192.168.2.102,54385,1317264874629 killed 
 //Rs didn't crash. 
 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 master.HMaster(458): Registering server found up in zk: 
 192.168.2.102,54385,1317264874629
 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 master.ServerManager(232): Registering 
 server=192.168.2.102,54385,1317264874629
 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of 
 znode /hbase/unassigned/1028785192 because node does not exist (not an error)
 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) 
 of data from znode /hbase/root-region-server and set watcher; 
 192.168.2.102,54383,131726487...
 //Meta verification failed and ressigned the meta. So all the regions in the 
 meta is loss.
 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
 address=192.168.2.102,54385,1317264874629; 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
 192.168.2.102,54385,1317264874629 not running, aborting
 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(316): new .META. server: 
 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) 
 of data from znode /hbase/root-region-server and set watcher; 
 192.168.2.102,54383,131726487...
 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
 address=192.168.2.102,54385,1317264874629; 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
 192.168.2.102,54385,1317264874629 not running, aborting
 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(316): new .META. server: 
 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) 
 of data from znode /hbase/root-region-server and set watcher; 
 192.168.2.102,54383,131726487...
 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
 address=192.168.2.102,54385,1317264874629; 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
 192.168.2.102,54385,1317264874629 not running, aborting
 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(316): new .META. server: 
 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or 
 updating) unassigned node for 1028785192 with OFFLINE state
 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] 
 zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

2011-11-05 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144604#comment-13144604
 ] 

gaojinchao commented on HBASE-4511:
---

I am still a little doubt, If Meta RS is dying RS, How to pass this his name to 
metaLocation?

look this logs, it said that this.metaLocation is null .

2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
address=192.168.2.102,54385,1317264874629; 
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
192.168.2.102,54385,1317264874629 not running, aborting
2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
catalog.CatalogTracker(316): new .META. server: 
192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of 
data from znode /hbase/root-region-server and set watcher; 
192.168.2.102,54383,131726487...
2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
address=192.168.2.102,54385,1317264874629; 
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
192.168.2.102,54385,1317264874629 not running, aborting
2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
catalog.CatalogTracker(316): new .META. server: 
192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of 
data from znode /hbase/root-region-server and set watcher; 
192.168.2.102,54383,131726487...
2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
address=192.168.2.102,54385,1317264874629; 
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
192.168.2.102,54385,1317264874629 not running, aborting
2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
catalog.CatalogTracker(316): new .META. server: 
192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) 
unassigned node for 1028785192 with OFFLINE state


 There is data loss when master failovers
 

 Key: HBASE-4511
 URL: https://issues.apache.org/jira/browse/HBASE-4511
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.92.0
Reporter: gaojinchao
Priority: Minor
 Fix For: 0.92.0

 Attachments: 4511.txt, 
 org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt


 It goes like this:
 Master crashed ,  at the same time RS with meta is crashing, but RS doesn't 
 eixt.
 Master startups again and finds all living RS. 
 Master verifies the meta failed,  because this RS is crashing.
 Master reassigns the meta, but it doesn't split the Hlog. 
 So some meta data is loss.
 About the logs of a failover test case fail. 
 //It said that we want to kill a RS
 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): 
 STOPPED: Killing for unit test
 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
 RS 192.168.2.102,54385,1317264874629 killed 
 //Rs didn't crash. 
 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 master.HMaster(458): Registering server found up in zk: 
 192.168.2.102,54385,1317264874629
 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 master.ServerManager(232): Registering 
 server=192.168.2.102,54385,1317264874629
 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of 
 znode /hbase/unassigned/1028785192 because node does not exist (not an error)
 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) 
 of data from znode /hbase/root-region-server and set watcher; 
 192.168.2.102,54383,131726487...
 //Meta verification failed and ressigned the meta. So all the regions in the 
 meta is loss.
 2011-09-28 19:54:51,773 INFO

[jira] [Commented] (HBASE-4749) TestMasterFailover case occasional fails

2011-11-04 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143990#comment-13143990
 ] 

gaojinchao commented on HBASE-4749:
---

It seems a bug for TRUNK.
In version 0.90, We kill a RS and at same time start a Master, Master don't add 
a dying RS to online set.
But in version 0.92 We will add a dying RS to online set.
This will produce a lot of unusual scenarios:
1. if the root/meta is in a dying RS, we may lose data because don't split 
Hlog. looks issue: https://issues.apache.org/jira/browse/HBASE-4511.
2.In testMasterFailoverWithMockedRITOnDeadRScase , mocking scenarios will be 
invalid.

look this logs:

//we kill this RS(1320357166142 )
2011-11-03 21:52:56,007 INFO  [Thread-986] master.TestMasterFailover(1011): 

Killing RS juno.apache.org,60001,1320357166142 

//we pick up this RS(1320357166142) through zk node.
2011-11-03 21:52:57,356 INFO  [Master:0;juno.apache.org,51313,1320357176029] 
master.HMaster(464): Registering server found up in zk: 
juno.apache.org,60001,1320357166142
2011-11-03 21:52:57,357 INFO  [Master:0;juno.apache.org,51313,1320357176029] 
master.ServerManager(239): Registering 
server=juno.apache.org,60001,1320357166142


So I think we should wait until killing RS is shut down and start a new hmaster.

 TestMasterFailover case occasional fails
 

 Key: HBASE-4749
 URL: https://issues.apache.org/jira/browse/HBASE-4749
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.92.0
Reporter: gaojinchao
Priority: Minor
 Fix For: 0.92.0


 look this logs:
 https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

2011-11-04 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144523#comment-13144523
 ] 

gaojinchao commented on HBASE-4511:
---

Meta table is different from Root table. When Meta RS is dying, we should not 
use getMetaLocation.
I think we should get the sn from root table. 

 There is data loss when master failovers
 

 Key: HBASE-4511
 URL: https://issues.apache.org/jira/browse/HBASE-4511
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.92.0
Reporter: gaojinchao
Priority: Minor
 Fix For: 0.92.0

 Attachments: 
 org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt


 It goes like this:
 Master crashed ,  at the same time RS with meta is crashing, but RS doesn't 
 eixt.
 Master startups again and finds all living RS. 
 Master verifies the meta failed,  because this RS is crashing.
 Master reassigns the meta, but it doesn't split the Hlog. 
 So some meta data is loss.
 About the logs of a failover test case fail. 
 //It said that we want to kill a RS
 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): 
 STOPPED: Killing for unit test
 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
 RS 192.168.2.102,54385,1317264874629 killed 
 //Rs didn't crash. 
 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 master.HMaster(458): Registering server found up in zk: 
 192.168.2.102,54385,1317264874629
 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 master.ServerManager(232): Registering 
 server=192.168.2.102,54385,1317264874629
 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of 
 znode /hbase/unassigned/1028785192 because node does not exist (not an error)
 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) 
 of data from znode /hbase/root-region-server and set watcher; 
 192.168.2.102,54383,131726487...
 //Meta verification failed and ressigned the meta. So all the regions in the 
 meta is loss.
 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
 address=192.168.2.102,54385,1317264874629; 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
 192.168.2.102,54385,1317264874629 not running, aborting
 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(316): new .META. server: 
 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) 
 of data from znode /hbase/root-region-server and set watcher; 
 192.168.2.102,54383,131726487...
 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
 address=192.168.2.102,54385,1317264874629; 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
 192.168.2.102,54385,1317264874629 not running, aborting
 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(316): new .META. server: 
 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) 
 of data from znode /hbase/root-region-server and set watcher; 
 192.168.2.102,54383,131726487...
 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
 address=192.168.2.102,54385,1317264874629; 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
 192.168.2.102,54385,1317264874629 not running, aborting
 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(316): new .META. server: 
 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or 
 updating) unassigned node for 1028785192 with OFFLINE state
 2011-09-28 19:54:52,825 DEBUG

[jira] [Commented] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails

2011-11-04 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144526#comment-13144526
 ] 

gaojinchao commented on HBASE-4749:
---

There is this logs Caused by: java.io.IOException: Too many open files


 TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
 -

 Key: HBASE-4749
 URL: https://issues.apache.org/jira/browse/HBASE-4749
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.92.0
Reporter: gaojinchao
Priority: Critical
 Fix For: 0.92.0

 Attachments: 4749.txt


 look this logs:
 https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4511) There is data loss when master failovers

2011-11-04 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144598#comment-13144598
 ] 

gaojinchao commented on HBASE-4511:
---

@stack
Sorry , I am wrong. the patch makes sense.

 There is data loss when master failovers
 

 Key: HBASE-4511
 URL: https://issues.apache.org/jira/browse/HBASE-4511
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.92.0
Reporter: gaojinchao
Priority: Minor
 Fix For: 0.92.0

 Attachments: 4511.txt, 
 org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt


 It goes like this:
 Master crashed ,  at the same time RS with meta is crashing, but RS doesn't 
 eixt.
 Master startups again and finds all living RS. 
 Master verifies the meta failed,  because this RS is crashing.
 Master reassigns the meta, but it doesn't split the Hlog. 
 So some meta data is loss.
 About the logs of a failover test case fail. 
 //It said that we want to kill a RS
 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): 
 STOPPED: Killing for unit test
 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
 RS 192.168.2.102,54385,1317264874629 killed 
 //Rs didn't crash. 
 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 master.HMaster(458): Registering server found up in zk: 
 192.168.2.102,54385,1317264874629
 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 master.ServerManager(232): Registering 
 server=192.168.2.102,54385,1317264874629
 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of 
 znode /hbase/unassigned/1028785192 because node does not exist (not an error)
 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) 
 of data from znode /hbase/root-region-server and set watcher; 
 192.168.2.102,54383,131726487...
 //Meta verification failed and ressigned the meta. So all the regions in the 
 meta is loss.
 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
 address=192.168.2.102,54385,1317264874629; 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
 192.168.2.102,54385,1317264874629 not running, aborting
 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(316): new .META. server: 
 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) 
 of data from znode /hbase/root-region-server and set watcher; 
 192.168.2.102,54383,131726487...
 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
 address=192.168.2.102,54385,1317264874629; 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
 192.168.2.102,54385,1317264874629 not running, aborting
 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(316): new .META. server: 
 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) 
 of data from znode /hbase/root-region-server and set watcher; 
 192.168.2.102,54383,131726487...
 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
 address=192.168.2.102,54385,1317264874629; 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
 192.168.2.102,54385,1317264874629 not running, aborting
 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(316): new .META. server: 
 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or 
 updating) unassigned node for 1028785192 with OFFLINE state
 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] 
 zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received

[jira] [Commented] (HBASE-4577) Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB

2011-11-03 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142866#comment-13142866
 ] 

gaojinchao commented on HBASE-4577:
---

Yes,I think so. But I am diging.
Because I am not familiar with MR and this issue is not very important to 
release 0.92 version, Please give me some time.



 Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB
 -

 Key: HBASE-4577
 URL: https://issues.apache.org/jira/browse/HBASE-4577
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0

 Attachments: HBASE-4577_trial_Trunk.patch, HBASE-4577_trunk.patch


 Minor issue while looking at the RS metrics:
 bq. numberOfStorefiles=8, storefileUncompressedSizeMB=2418, 
 storefileSizeMB=2420, compressionRatio=1.0008
 I guess there's a truncation somewhere when it's adding the numbers up.
 FWIW there's no compression on that table.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4577) Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB

2011-11-03 Thread gaojinchao (Commented) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143172#comment-13143172
]

gaojinchao commented on HBASE-4577:
---

I didn't find any exception log. It seems the test case has a bug.
Processing steps:
1. Creating table with 5 regions
2. Producing data base on 5 regions
3. Changing table to 15 regions
4. Loading data to new table
Some cases, the data of 1 region may become the data of 11 regions. but the
parameter of hbase.bulkload.retries.number is set to 10,we only try 10 time,
some data can't load to the region.

If I am wrong, Please correct me! Thanks

I think we should change table to 14 rather than 15 regions in this case.

Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB
-

Key: HBASE-4577
URL: https://issues.apache.org/jira/browse/HBASE-4577
Project: HBase
Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
Fix For: 0.92.0

Attachments: HBASE-4577_trial_Trunk.patch, HBASE-4577_trunk.patch

Minor issue while looking at the RS metrics:
bq. numberOfStorefiles=8, storefileUncompressedSizeMB=2418,
storefileSizeMB=2420, compressionRatio=1.0008
I guess there's a truncation somewhere when it's adding the numbers up.
FWIW there's no compression on that table.

[jira] [Commented] (HBASE-4577) Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB

2011-11-03 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143174#comment-13143174
 ] 

gaojinchao commented on HBASE-4577:
---

look this logs:
2011-11-02 04:23:42,761 ERROR [main] mapreduce.LoadIncrementalHFiles(214): 
Retry attempted 10 times without completing, bailing out
2011-11-02 04:23:42,761 ERROR [main] mapreduce.LoadIncrementalHFiles(240): 
-
Bulk load aborted with some files not yet loaded:
-
  
hdfs://localhost:45622/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/trunk/target/test-data/4660a8cd-aaa5-466a-8ff1-5b824cd4553e/testLocalMRIncrementalLoad/info-A/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/TestTable,27.bottom
  
hdfs://localhost:45622/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/trunk/target/test-data/4660a8cd-aaa5-466a-8ff1-5b824cd4553e/testLocalMRIncrementalLoad/info-A/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/TestTable,27.top
  
hdfs://localhost:45622/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/trunk/target/test-data/4660a8cd-aaa5-466a-8ff1-5b824cd4553e/testLocalMRIncrementalLoad/info-B/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/TestTable,28.bottom
  
hdfs://localhost:45622/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/trunk/target/test-data/4660a8cd-aaa5-466a-8ff1-5b824cd4553e/testLocalMRIncrementalLoad/info-B/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/TestTable,28.top

 Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB
 -

 Key: HBASE-4577
 URL: https://issues.apache.org/jira/browse/HBASE-4577
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0

 Attachments: HBASE-4577_trial_Trunk.patch, HBASE-4577_trunk.patch


 Minor issue while looking at the RS metrics:
 bq. numberOfStorefiles=8, storefileUncompressedSizeMB=2418, 
 storefileSizeMB=2420, compressionRatio=1.0008
 I guess there's a truncation somewhere when it's adding the numbers up.
 FWIW there's no compression on that table.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4577) Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB

2011-11-02 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142008#comment-13142008
 ] 

gaojinchao commented on HBASE-4577:
---

Test failed, it seems not a patch problem.

 Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB
 -

 Key: HBASE-4577
 URL: https://issues.apache.org/jira/browse/HBASE-4577
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0

 Attachments: HBASE-4577_trial_Trunk.patch, HBASE-4577_trunk.patch


 Minor issue while looking at the RS metrics:
 bq. numberOfStorefiles=8, storefileUncompressedSizeMB=2418, 
 storefileSizeMB=2420, compressionRatio=1.0008
 I guess there's a truncation somewhere when it's adding the numbers up.
 FWIW there's no compression on that table.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4577) Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB

2011-11-02 Thread gaojinchao (Commented) (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142012#comment-13142012
 ] 

gaojinchao commented on HBASE-4577:
---

My local test result:

Running org.apache.hadoop.hbase.TestMultiVersions
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 39.045 sec

Results :

Failed tests:   testHBaseFsck(org.apache.hadoop.hbase.util.TestHBaseFsck): 
expected:0 but was:1

Tests in error:
  
testMasterFailoverWithMockedRITOnDeadRS(org.apache.hadoop.hbase.master.TestMasterFailover):
 test timed out after 18 milliseconds
  
testEnableTableRoundRobinAssignment(org.apache.hadoop.hbase.client.TestAdmin): 
org.apache.hadoop.hbase.TableNotEnabledException: testEnableTableAssignment
  
testBadOriginalRootLocation(org.apache.hadoop.hbase.catalog.TestCatalogTrackerOnCluster):
 unknown host: example.org

Tests run: 1073, Failures: 1, Errors: 3, Skipped: 9



 Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB
 -

 Key: HBASE-4577
 URL: https://issues.apache.org/jira/browse/HBASE-4577
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0

 Attachments: HBASE-4577_trial_Trunk.patch, HBASE-4577_trunk.patch


 Minor issue while looking at the RS metrics:
 bq. numberOfStorefiles=8, storefileUncompressedSizeMB=2418, 
 storefileSizeMB=2420, compressionRatio=1.0008
 I guess there's a truncation somewhere when it's adding the numbers up.
 FWIW there's no compression on that table.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

1 2 >

1 - 100 of 135 matches

Mail list logo