[jira] [Commented] (HBASE-5615) the master never do balance becauseof balance the parent region
[ https://issues.apache.org/jira/browse/HBASE-5615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13237484#comment-13237484 ] gaojinchao commented on HBASE-5615: --- +1 the master never do balance becauseof balance the parent region Key: HBASE-5615 URL: https://issues.apache.org/jira/browse/HBASE-5615 Project: HBase Issue Type: Bug Affects Versions: 0.90.7 Reporter: xufeng Assignee: xufeng Priority: Critical Attachments: HBASE-5615-90.patch, HBASE-5615.patch, NoPatched-surefire-report-5615-90.html, Patched_surefire-report-5615-90.html the master never do balance becauseof when master do rebuildUserRegions(),it will add the parent region into AssignmentManager#servers, if balancer let the parent region to move,the parent will in RIT forever.thus balance will never be executed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5488) Fixed OfflineMetaRepair bug
[ https://issues.apache.org/jira/browse/HBASE-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13218976#comment-13218976 ] gaojinchao commented on HBASE-5488: --- Thanks for your review. Fixed OfflineMetaRepair bug Key: HBASE-5488 URL: https://issues.apache.org/jira/browse/HBASE-5488 Project: HBase Issue Type: Bug Affects Versions: 0.90.6 Reporter: gaojinchao Assignee: gaojinchao Priority: Minor Fix For: 0.90.7, 0.92.1 Attachments: HBASE-5488-branch92.patch, HBASE-5488-trunk.patch, HBASE-5488_branch90.txt I want to use OfflineMetaRepair tools and found onbody fix this bugs. I will make a patch. 12/01/05 23:23:30 ERROR util.HBaseFsck: Bailed out due to: java.lang.IllegalArgumentException: Wrong FS: hdfs:// us01-ciqps1-name01.carrieriq.com:9000/hbase/M2M-INTEGRATION-MM_TION-13 25190318714/0003d2ede27668737e192d8430dbe5d0/.regioninfo, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:352) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:368) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:126) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:398) at org.apache.hadoop.hbase.util.HBaseFsck.loadMetaEntry(HBaseFsck.java:256) at org.apache.hadoop.hbase.util.HBaseFsck.loadTableInfo(HBaseFsck.java:284) at org.apache.hadoop.hbase.util.HBaseFsck.rebuildMeta(HBaseFsck.java:402) at org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair.main(OfflineMetaRe -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4925) Collect test cases for hadoop/hbase cluster
[ https://issues.apache.org/jira/browse/HBASE-4925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13209115#comment-13209115 ] gaojinchao commented on HBASE-4925: --- @ Thomas Your framework is available? We finished part cases automation , but I find it is not stable. Collect test cases for hadoop/hbase cluster --- Key: HBASE-4925 URL: https://issues.apache.org/jira/browse/HBASE-4925 Project: HBase Issue Type: Brainstorming Components: test Reporter: Thomas Pan This entry is used to collect all the useful test cases to verify a hadoop/hbase cluster. This is to follow up on yesterday's hack day in Salesforce. Hopefully that the information would be very useful for the whole community. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5200) AM.ProcessRegionInTransition() and AM.handleRegion() race thus leaving the region assignment inconsistent
[ https://issues.apache.org/jira/browse/HBASE-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206834#comment-13206834 ] gaojinchao commented on HBASE-5200: --- +1 for 0.92 and trunk AM.ProcessRegionInTransition() and AM.handleRegion() race thus leaving the region assignment inconsistent - Key: HBASE-5200 URL: https://issues.apache.org/jira/browse/HBASE-5200 Project: HBase Issue Type: Bug Affects Versions: 0.90.5 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 0.94.0, 0.90.7, 0.92.1 Attachments: 5200-v2.txt, HBASE-5200.patch, HBASE-5200_1.patch, TEST-org.apache.hadoop.hbase.master.TestRestartCluster.xml This is the scenario Consider a case where the balancer is going on thus trying to close regions in a RS. Before we could close a master switch happens. On Master switch the set of nodes that are in RIT is collected and we first get Data and start watching the node After that the node data is added into RIT. Now by this time (before adding to RIT) if the RS to which close was called does a transition in AM.handleRegion() we miss the handling saying RIT state was null. {code} 2012-01-13 10:50:46,358 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region a66d281d231dfcaea97c270698b26b6f from server HOST-192-168-47-205,20020,1326363111288 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,358 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region c12e53bfd48ddc5eec507d66821c4d23 from server HOST-192-168-47-205,20020,1326363111288 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,358 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 59ae13de8c1eb325a0dd51f4902d2052 from server HOST-192-168-47-205,20020,1326363111288 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region f45bc9614d7575f35244849af85aa078 from server HOST-192-168-47-205,20020,1326363111288 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region cc3ecd7054fe6cd4a1159ed92fd62641 from server HOST-192-168-47-204,20020,1326342744518 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 3af40478a17fee96b4a192b22c90d5a2 from server HOST-192-168-47-205,20020,1326363111288 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region e6096a8466e730463e10d3d61f809b92 from server HOST-192-168-47-204,20020,1326342744518 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 4806781a1a23066f7baed22b4d237e24 from server HOST-192-168-47-204,20020,1326342744518 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region d69e104131accaefe21dcc01fddc7629 from server HOST-192-168-47-205,20020,1326363111288 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states {code} In branch the CLOSING node is created by RS thus leading to more inconsistency. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5200) AM.ProcessRegionInTransition() and AM.handleRegion() race thus leaving the region assignment inconsistent
[ https://issues.apache.org/jira/browse/HBASE-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206118#comment-13206118 ] gaojinchao commented on HBASE-5200: --- It seems we need consider issue HBASE-4739 AM.ProcessRegionInTransition() and AM.handleRegion() race thus leaving the region assignment inconsistent - Key: HBASE-5200 URL: https://issues.apache.org/jira/browse/HBASE-5200 Project: HBase Issue Type: Bug Affects Versions: 0.90.5 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 0.94.0, 0.90.7, 0.92.1 Attachments: 5200-v2.txt, HBASE-5200.patch, HBASE-5200_1.patch, TEST-org.apache.hadoop.hbase.master.TestRestartCluster.xml This is the scenario Consider a case where the balancer is going on thus trying to close regions in a RS. Before we could close a master switch happens. On Master switch the set of nodes that are in RIT is collected and we first get Data and start watching the node After that the node data is added into RIT. Now by this time (before adding to RIT) if the RS to which close was called does a transition in AM.handleRegion() we miss the handling saying RIT state was null. {code} 2012-01-13 10:50:46,358 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region a66d281d231dfcaea97c270698b26b6f from server HOST-192-168-47-205,20020,1326363111288 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,358 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region c12e53bfd48ddc5eec507d66821c4d23 from server HOST-192-168-47-205,20020,1326363111288 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,358 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 59ae13de8c1eb325a0dd51f4902d2052 from server HOST-192-168-47-205,20020,1326363111288 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region f45bc9614d7575f35244849af85aa078 from server HOST-192-168-47-205,20020,1326363111288 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region cc3ecd7054fe6cd4a1159ed92fd62641 from server HOST-192-168-47-204,20020,1326342744518 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 3af40478a17fee96b4a192b22c90d5a2 from server HOST-192-168-47-205,20020,1326363111288 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region e6096a8466e730463e10d3d61f809b92 from server HOST-192-168-47-204,20020,1326342744518 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 4806781a1a23066f7baed22b4d237e24 from server HOST-192-168-47-204,20020,1326342744518 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region d69e104131accaefe21dcc01fddc7629 from server HOST-192-168-47-205,20020,1326363111288 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states {code} In branch the CLOSING node is created by RS thus leading to more inconsistency. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5200) AM.ProcessRegionInTransition() and AM.handleRegion() race thus leaving the region assignment inconsistent
[ https://issues.apache.org/jira/browse/HBASE-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206128#comment-13206128 ] gaojinchao commented on HBASE-5200: --- other issues , 1. comments said If ROOT or .META. table is waiting for timeout..., But the code isMetaTable is only Meta table . it seems we should use isMetaRegion. 2. In branch 90 getRegion only get region from meta table? It is any problem when root region server crashed? we reassign the root region? AM.ProcessRegionInTransition() and AM.handleRegion() race thus leaving the region assignment inconsistent - Key: HBASE-5200 URL: https://issues.apache.org/jira/browse/HBASE-5200 Project: HBase Issue Type: Bug Affects Versions: 0.90.5 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 0.94.0, 0.90.7, 0.92.1 Attachments: 5200-v2.txt, HBASE-5200.patch, HBASE-5200_1.patch, TEST-org.apache.hadoop.hbase.master.TestRestartCluster.xml This is the scenario Consider a case where the balancer is going on thus trying to close regions in a RS. Before we could close a master switch happens. On Master switch the set of nodes that are in RIT is collected and we first get Data and start watching the node After that the node data is added into RIT. Now by this time (before adding to RIT) if the RS to which close was called does a transition in AM.handleRegion() we miss the handling saying RIT state was null. {code} 2012-01-13 10:50:46,358 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region a66d281d231dfcaea97c270698b26b6f from server HOST-192-168-47-205,20020,1326363111288 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,358 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region c12e53bfd48ddc5eec507d66821c4d23 from server HOST-192-168-47-205,20020,1326363111288 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,358 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 59ae13de8c1eb325a0dd51f4902d2052 from server HOST-192-168-47-205,20020,1326363111288 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region f45bc9614d7575f35244849af85aa078 from server HOST-192-168-47-205,20020,1326363111288 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region cc3ecd7054fe6cd4a1159ed92fd62641 from server HOST-192-168-47-204,20020,1326342744518 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 3af40478a17fee96b4a192b22c90d5a2 from server HOST-192-168-47-205,20020,1326363111288 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region e6096a8466e730463e10d3d61f809b92 from server HOST-192-168-47-204,20020,1326342744518 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region 4806781a1a23066f7baed22b4d237e24 from server HOST-192-168-47-204,20020,1326342744518 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states 2012-01-13 10:50:46,359 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region d69e104131accaefe21dcc01fddc7629 from server HOST-192-168-47-205,20020,1326363111288 but region was in the state null and not in expected PENDING_CLOSE or CLOSING states {code} In branch the CLOSING node is created by RS thus leading to more inconsistency. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5231) Backport HBASE-3373 (per-table load balancing) to 0.92
[ https://issues.apache.org/jira/browse/HBASE-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190945#comment-13190945 ] gaojinchao commented on HBASE-5231: --- I think we can do. regarding to log Done. Calculated a load balance in , we can move out balanceCluster. move to below code ? + for (MapServerName, ListHRegionInfo assignments : assignmentsByTable.values()) { +ListRegionPlan partialPlans = this.balancer.balanceCluster(assignments); +if (partialPlans != null) plans.addAll(partialPlans); } Backport HBASE-3373 (per-table load balancing) to 0.92 -- Key: HBASE-5231 URL: https://issues.apache.org/jira/browse/HBASE-5231 Project: HBase Issue Type: Improvement Reporter: Zhihong Yu Fix For: 0.92.1 Attachments: 5231.txt This JIRA backports per-table load balancing to 0.90 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5231) Backport HBASE-3373 (per-table load balancing) to 0.92
[ https://issues.apache.org/jira/browse/HBASE-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190899#comment-13190899 ] gaojinchao commented on HBASE-5231: --- Maybe there is a little problem about log. every table calls balanceCluster ,Many logs as Skipping load balancing because balanced cluster; will be printed. Backport HBASE-3373 (per-table load balancing) to 0.92 -- Key: HBASE-5231 URL: https://issues.apache.org/jira/browse/HBASE-5231 Project: HBase Issue Type: Improvement Reporter: Zhihong Yu Fix For: 0.92.1 Attachments: 5231.txt This JIRA backports per-table load balancing to 0.90 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190299#comment-13190299 ] gaojinchao commented on HBASE-5179: --- Do you want to add any new case? Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v15.patch, 5179-90v16.patch, 5179-90v17.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-92v17.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, Errorlog, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, hbase-5179v17.patch, hbase-5179v17.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13190298#comment-13190298 ] gaojinchao commented on HBASE-5179: --- This is V16 test result in branch90, number discriberesult 1 a new cluster startup ok 2 restart a cluster ok 3 No region serve crash ok 4 After Meta region server registered, and then crashed ok 5 After Meta/root region server registered, and then crashed ok 6 After Hmaster crashed and Meta/root region server crashed. Hmaster and region server start at same time.0k 7 After Hmaster crashed and Meta/root region server crashed. Hmaster startok Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v15.patch, 5179-90v16.patch, 5179-90v17.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-92v17.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, Errorlog, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, hbase-5179v17.patch, hbase-5179v17.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13189140#comment-13189140 ] gaojinchao commented on HBASE-5179: --- @Chunhui In 90v12, Maybe below code(metaServerInfo/rootServerInfo) has some nullexception? // MetaServer is may being processed as dead server. Before assign meta, // we need to wait until its log is splitted. waitUntilNoLogDir(metaServerInfo.getServerName()); if (!this.serverManager.isDeadMetaServerInProgress()) { this.assignmentManager.assignMeta(); } Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 5179-90v12.patch, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13189148#comment-13189148 ] gaojinchao commented on HBASE-5179: --- @chunhui I only verify the branch90 version. I don't have the cluster for branch92(that needs a few day, we are planing to setup some test cluster.) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 5179-90v12.patch, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13189595#comment-13189595 ] gaojinchao commented on HBASE-5179: --- In my test case, I kill meta/root at same time. when master start to assign root/meta region , it should finish split hlogs. Today I will give a detail test report about 90V14. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, Errorlog, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13189600#comment-13189600 ] gaojinchao commented on HBASE-5179: --- no problem. before I start to verify the patch. I will review it. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v15.patch, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, Errorlog, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13189641#comment-13189641 ] gaojinchao commented on HBASE-5179: --- @chunhui The first test case failed, we start a cluster, the patch is split a new region server's Hlog. 2012-01-20 00:34:39,462 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:20010 2012-01-20 00:34:39,462 DEBUG org.apache.hadoop.hbase.master.HMaster: Started service threads 2012-01-20 00:34:40,158 INFO org.apache.hadoop.hbase.master.ServerManager: Registering server=C3S32,20020,1327037679721, regionCount=0, userLoad=false 2012-01-20 00:34:40,296 INFO org.apache.hadoop.hbase.master.ServerManager: Registering server=C3S33,20020,1327037679059, regionCount=0, userLoad=false 2012-01-20 00:34:40,488 INFO org.apache.hadoop.hbase.master.ServerManager: Registering server=C3S31,20020,1327037679673, regionCount=0, userLoad=false 2012-01-20 00:34:40,962 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting on regionserver(s) count to settle; currently=3 2012-01-20 00:34:42,462 INFO org.apache.hadoop.hbase.master.ServerManager: Finished waiting for regionserver count to settle; count=3, sleptFor=3000 2012-01-20 00:34:42,463 INFO org.apache.hadoop.hbase.master.ServerManager: Exiting wait on regionserver(s) to checkin; count=3, stopped=false, count of regions out on cluster=0 2012-01-20 00:34:42,463 INFO org.apache.hadoop.hbase.master.HMaster: --sleep 60s- 2012-01-20 00:35:42,469 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://C3S31:9000/hbase/.logs/C3S31,20020,1327037679673 belongs to an existing region server 2012-01-20 00:35:42,470 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://C3S31:9000/hbase/.logs/C3S32,20020,1327037679721 belongs to an existing region server 2012-01-20 00:35:42,470 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://C3S31:9000/hbase/.logs/C3S33,20020,1327037679059 belongs to an existing region server 2012-01-20 00:35:42,504 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of -ROOT-,,0 at address=C3S32:20020; org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region is not online: -ROOT-,,0 2012-01-20 00:36:42,610 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. java.lang.RuntimeException: Timed out waiting to finish splitting log for C3S32,20020,1327037679721 at org.apache.hadoop.hbase.master.HMaster.waitUntilNoLogDir(HMaster.java:578) at org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:478) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:422) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283) 2012-01-20 00:36:42,613 INFO org.apache.hadoop.hbase.master.HMaster: Aborting 2012-01-20 00:36:42,613 DEBUG org.apache.hadoop.hbase.master.HMaster: Stopping service threads 2012-01-20 00:36:42,613 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server on 2 Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 5179-90v12.patch, 5179-90v13.txt, 5179-90v14.patch, 5179-90v15.patch, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, Errorlog, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This
[jira] [Commented] (HBASE-5231) Backport HBASE-3373 (per-table load balancing) to 0.92
[ https://issues.apache.org/jira/browse/HBASE-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13189650#comment-13189650 ] gaojinchao commented on HBASE-5231: --- It seems a little change. +1 Backport HBASE-3373 (per-table load balancing) to 0.92 -- Key: HBASE-5231 URL: https://issues.apache.org/jira/browse/HBASE-5231 Project: HBase Issue Type: Improvement Reporter: Zhihong Yu Attachments: 5231.txt This JIRA backports per-table load balancing to 0.90 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188938#comment-13188938 ] gaojinchao commented on HBASE-5179: --- +1, Good job! Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188939#comment-13188939 ] gaojinchao commented on HBASE-5179: --- @chunhui Do you want me to test this patch in our cluster ? Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186770#comment-13186770 ] gaojinchao commented on HBASE-5179: --- @chunhui Maybe it has a problem. the number of shutdownhandler thread pool is 3(default), If there are more than 3 deadserver is processing. we will wait forever. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186786#comment-13186786 ] gaojinchao commented on HBASE-5179: --- @chunhui Regarding to a normal flow. METAServerShutdownHandler use different thread pool. only init flow, scome cases we can't distinguish meta region server. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186928#comment-13186928 ] gaojinchao commented on HBASE-5179: --- In patch v7, Can we replace process expired server to public void splitLog(final String serverName)? Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4191) hbase load balancer needs locality awareness
[ https://issues.apache.org/jira/browse/HBASE-4191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186941#comment-13186941 ] gaojinchao commented on HBASE-4191: --- @Liyin This is a good feature, How do you process now? hbase load balancer needs locality awareness Key: HBASE-4191 URL: https://issues.apache.org/jira/browse/HBASE-4191 Project: HBase Issue Type: New Feature Reporter: Ted Yu Assignee: Liyin Tang Previously, HBASE-4114 implements the metrics for HFile HDFS block locality, which provides the HFile level locality information. But in order to work with load balancer and region assignment, we need the region level locality information. Let's define the region locality information first, which is almost the same as HFile locality index. HRegion locality index (HRegion A, RegionServer B) = (Total number of HDFS blocks that can be retrieved locally by the RegionServer B for the HRegion A) / ( Total number of the HDFS blocks for the Region A) So the HRegion locality index tells us that how much locality we can get if the HMaster assign the HRegion A to the RegionServer B. So there will be 2 steps involved to assign regions based on the locality. 1) During the cluster start up time, the master will scan the hdfs to calculate the HRegion locality index for each pair of HRegion and Region Server. It is pretty expensive to scan the dfs. So we only needs to do this once during the start up time. 2) During the cluster run time, each region server will update the HRegion locality index as metrics periodically as HBASE-4114 did. The Region Server can expose them to the Master through ZK, meta table, or just RPC messages. Based on the HRegion locality index, the assignment manager in the master would have a global knowledge about the region locality distribution and can run the MIN COST MAXIMUM FLOW solver to reach the global optimization. Let's construct the graph first: [Graph] Imaging there is a bipartite graph and the left side is the set of regions and the right side is the set of region servers. There is a source node which links itself to each node in the region set. There is a sink node which is linked from each node in the region server set. [Capacity] The capacity between the source node and region nodes is 1. And the capacity between the region nodes and region server nodes is also 1. (The purpose is each region can ONLY be assigned to one region server at one time) The capacity between the region server nodes and sink node are the avg number of regions which should be assigned each region server. (The purpose is balance the load for each region server) [Cost] The cost between each region and region server is the opposite of locality index, which means the higher locality is, if region A is assigned to region server B, the lower cost it is. The cost function could be more sophisticated when we put more metrics into account. So after running the min-cost max flow solver, the master could assign the regions based on the global locality optimization. Also the master should share this global view to secondary master in case the master fail over happens. In addition, the HBASE-4491 (Locality Checker) is the tool, which is based on the same metrics, to proactively to scan dfs to calculate the global locality information in the cluster. It will help us to verify data locality information during the run time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3724) Load balancer improvements
[ https://issues.apache.org/jira/browse/HBASE-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186959#comment-13186959 ] gaojinchao commented on HBASE-3724: --- I found the balance ago in branch92 is invalid for our scenario. So I use this issue to hang all issues related to balance. If someone want to see it, it will be easy. Load balancer improvements -- Key: HBASE-3724 URL: https://issues.apache.org/jira/browse/HBASE-3724 Project: HBase Issue Type: Umbrella Reporter: stack Umbrella issue under which we hang all regions related to balancer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187415#comment-13187415 ] gaojinchao commented on HBASE-5179: --- @chunhui Does this code make sense to you ? if (!this.serverManager.isDeadMetaServerInProgress()) { if(metaServerInfo != null){ this.fileSystemManager.splitLog(metaServerInfo.getServerName()); } this.assignmentManager.assignMeta(); assigned++; } Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187417#comment-13187417 ] gaojinchao commented on HBASE-5179: --- +1 v8 , we need some test cases to verify in real cluster. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5202) NPE during Master failover in master.AssignmentManager.regionOnline()
[ https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187421#comment-13187421 ] gaojinchao commented on HBASE-5202: --- look https://issues.apache.org/jira/browse/HBASE-5179. Maybe it can resolve this issue. NPE during Master failover in master.AssignmentManager.regionOnline() - Key: HBASE-5202 URL: https://issues.apache.org/jira/browse/HBASE-5202 Project: HBase Issue Type: Bug Affects Versions: 0.90.6 Reporter: Eugene Koontz Assignee: Eugene Koontz Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt The following NPE can occur during master failover: {code} 2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown. java.lang.NullPointerException at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724) at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279) at java.lang.Thread.run(Thread.java:636) {code} This is caused by regionOnline() being passed a null serverInfo (its second parameter). The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as: {code} hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation()); {code} and {code} hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation()); {code} getHServerInfo() is defined as: {code} public HServerInfo getHServerInfo(final HServerAddress hsa) { synchronized(this.onlineServers) { // TODO: This is primitive. Do a better search. for (Map.EntryString, HServerInfo e: this.onlineServers.entrySet()) { if (e.getValue().getServerAddress().equals(hsa)) { return e.getValue(); } } } return null; } {code} This will return null if the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers' registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver). The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3724) Load balancer improvements
[ https://issues.apache.org/jira/browse/HBASE-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187441#comment-13187441 ] gaojinchao commented on HBASE-3724: --- @zhihong In my scenario: 1. One region server has more than 1,000 regions.(hdf hard disk capacity(12T) / 3 (replication) / region size = 2G). 2. One moment, Dozens of Regions per region server are working for put operation. I went through the balance code , I found current balnace algorithm is invalid for our scenarios. below scenarios: 1)When adds one new machine to our cluster, Maybe all of hot regions(is working) will move to this one. 2)When one RS restarts, Maybe all of hot regions(is working) will move to this machine. Load balancer improvements -- Key: HBASE-3724 URL: https://issues.apache.org/jira/browse/HBASE-3724 Project: HBase Issue Type: Umbrella Reporter: stack Umbrella issue under which we hang all regions related to balancer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3724) Load balancer improvements
[ https://issues.apache.org/jira/browse/HBASE-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187456#comment-13187456 ] gaojinchao commented on HBASE-3724: --- for example: we have 10 nodes cluster. 1000 regions per node and 50 regions are hot. per the current algorithm. If we add a new machine to this cluster. all the 50 hot regions will be moved to the new machine? Load balancer improvements -- Key: HBASE-3724 URL: https://issues.apache.org/jira/browse/HBASE-3724 Project: HBase Issue Type: Umbrella Reporter: stack Umbrella issue under which we hang all regions related to balancer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3724) Load balancer improvements
[ https://issues.apache.org/jira/browse/HBASE-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187464#comment-13187464 ] gaojinchao commented on HBASE-3724: --- Thanks, I am looking forward to seeing this part of the code. Sorry,I didn't express clearly. In my scenario: 1.There is 10 nodes cluster. 1000 regions per node. 2.We add a new machine to the cluster. 3.Balance needs move 1000* 10/11= 909 regions to the new machine. each region server will move both 45 hot regions and 45 cold regions to the new one. in this case, all hot regions will move to this new one? Load balancer improvements -- Key: HBASE-3724 URL: https://issues.apache.org/jira/browse/HBASE-3724 Project: HBase Issue Type: Umbrella Reporter: stack Umbrella issue under which we hang all regions related to balancer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3724) Load balancer improvements
[ https://issues.apache.org/jira/browse/HBASE-3724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13187488#comment-13187488 ] gaojinchao commented on HBASE-3724: --- Yes, you are right. we can't increase the limit to close to 1000 regions. because another reason is when hot regions are too many, It will produce many small Hfile. I have a solution to deal with this case. I will delevop a new balance algorithm for our scenario. Load balancer improvements -- Key: HBASE-3724 URL: https://issues.apache.org/jira/browse/HBASE-3724 Project: HBase Issue Type: Umbrella Reporter: stack Umbrella issue under which we hang all regions related to balancer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186675#comment-13186675 ] gaojinchao commented on HBASE-5179: --- @chunhui I agree with you. In branch 90, I want to add a flag that marks SSH finish split Hlog. If all of Dead servers had split Hlog, Loss data should be quite rare. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5202) NPE during Master failover in master.AssignmentManager.regionOnline()
[ https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186681#comment-13186681 ] gaojinchao commented on HBASE-5202: --- The root reason of this issue is some region server register lately. when one region server without META/ROOT registers atfer rebuildUserRegions finished. The regions in this one will be opened twice. NPE during Master failover in master.AssignmentManager.regionOnline() - Key: HBASE-5202 URL: https://issues.apache.org/jira/browse/HBASE-5202 Project: HBase Issue Type: Bug Affects Versions: 0.90.6 Reporter: Eugene Koontz Assignee: Eugene Koontz Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt The following NPE can occur during master failover: {code} 2012-01-15 17:45:00,314 FATAL [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] master.HMaster(944): Unhandled exception. Starting shutdown. java.lang.NullPointerException at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724) at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279) at java.lang.Thread.run(Thread.java:636) {code} This is caused by regionOnline() being passed a null serverInfo (its second parameter). The AssignmentManager's processFailover() method is passing a null to regionOnline() because the value that regionOnline is passing, hsi, is set as: {code} hsi = this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation()); {code} and {code} hsi = this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation()); {code} getHServerInfo() is defined as: {code} public HServerInfo getHServerInfo(final HServerAddress hsa) { synchronized(this.onlineServers) { // TODO: This is primitive. Do a better search. for (Map.EntryString, HServerInfo e: this.onlineServers.entrySet()) { if (e.getValue().getServerAddress().equals(hsa)) { return e.getValue(); } } } return null; } {code} This will return null if the onlineServers map does not yet have a value corresponding to the key supplied by the catalogTracker's getRootLocation() or getMetaLocation(). Since the catalogTracker uses zookeeper to establish the server locations of {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to the these servers' registering with the master, there can be an inconsistency between the catalogTracker and the onlineServers if either of these regionservers is online with respect to zookeeper, but haven't yet registered with the master (perhaps due to a high latency network between the master and the regionserver). The attached testMasterFailoverWithSlowRS.txt patch can be used to modify TestMasterFailover to cause this NPE. The proposed fix (provided along with the above test in a separate attachment) is for the master to use the new verifyMetaTablesAreUp() to wait for both of the servers named by the catalog tracker's getRootLocation() and getMetaLocation() to register with the master before the master can continue with failover. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186689#comment-13186689 ] gaojinchao commented on HBASE-5179: --- @chunhui Root is fine. But in the initialization phase, Meta flag is always false. so this.serverManager.isDeadMetaServerInProgress is invalid. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186695#comment-13186695 ] gaojinchao commented on HBASE-5179: --- Yes, it always false in the inializtion phase. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186701#comment-13186701 ] gaojinchao commented on HBASE-5179: --- @chunhui I thought that you have said, I want to get from root table , thare is also some problem. for example: 1. root and meta is same machine. 2. root is down and so on. I don't find a good way to do this. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186704#comment-13186704 ] gaojinchao commented on HBASE-5179: --- +1 on introduceing Trunk's logic Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186708#comment-13186708 ] gaojinchao commented on HBASE-5179: --- This logic is very complex. I thought a few days and did not find a good way to slove all problems. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186720#comment-13186720 ] gaojinchao commented on HBASE-5179: --- You can test this case: 1.create some table 2.restart hmaster. after geting knownServers wait 60s(added some code in hmaster) LOG.info(+++sleep 60+++ ); Thread.sleep(6); 3.kill(kill -9) the meta region server when Hmaster log print +++sleep 60+++ 4.Master gets the event of Meta region server is down. Before split Hlog, sleep 90s 5. check the table after master finish init Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186725#comment-13186725 ] gaojinchao commented on HBASE-5179: --- I configure the zk.session is 40s. So I design this case uses sleeping 60s Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186726#comment-13186726 ] gaojinchao commented on HBASE-5179: --- @chunhui It is fine. after you finhish, I will also test in our cluster. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186142#comment-13186142 ] gaojinchao commented on HBASE-5179: --- Please resolve this issue firstly. Maybe HBase-4748 need a long time. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186155#comment-13186155 ] gaojinchao commented on HBASE-5179: --- @zhihong Regarding to 5179-90v2.patch, when dead servers are processing, their logs would be split by ServerShutdownHandler. Maybe this change is result in meta data loss. for example: without patch: 1.split the Hlog with regionsever don't report itself(eg My comments@14/Jan/12 05:38 1,2 case will don't report). 2. assign the Meta/Root region(If meta region server has some exceptions, because we have split the Hlog, so there is no meta data loss) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186157#comment-13186157 ] gaojinchao commented on HBASE-5179: --- @zhihong I am sorry to submit my comments lately. I am testing patch v2. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186161#comment-13186161 ] gaojinchao commented on HBASE-5179: --- Current,This is my guess. I am developing some code to produce this scenario. If I have further information , I will inform you Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186208#comment-13186208 ] gaojinchao commented on HBASE-5179: --- I had produced this case: In order to simulate some dead server is being prcoessed: 1.create some table 2.restart hmaster, after geting region server , wait 60s(I added some code in hmaster) int regionCount = this.serverManager.waitForRegionServers(); LOG.info(sleep 60 ); Thread.sleep(6); 3.kill the meta region server 4.Master gets the envents of Meta region server is down. 5.assign the meta table. 6. SSH start to split the Hlog(some meta will lose) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3933) Hmaster throw NullPointerException
[ https://issues.apache.org/jira/browse/HBASE-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186097#comment-13186097 ] gaojinchao commented on HBASE-3933: --- @Eugene Yes the issue exits in branch90. I avoid this by increasing hbase.master.wait.on.regionservers.timeout Hmaster throw NullPointerException -- Key: HBASE-3933 URL: https://issues.apache.org/jira/browse/HBASE-3933 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6 Reporter: gaojinchao Assignee: Eugene Koontz Attachments: HBASE-3933.patch, Hmastersetup0.90 NullPointerException while hmaster starting. {code} java.lang.NullPointerException at java.util.TreeMap.getEntry(TreeMap.java:324) at java.util.TreeMap.get(TreeMap.java:255) at org.apache.hadoop.hbase.master.AssignmentManager.addToServers(AssignmentManager.java:1512) at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:606) at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:402) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3933) Hmaster throw NullPointerException
[ https://issues.apache.org/jira/browse/HBASE-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186098#comment-13186098 ] gaojinchao commented on HBASE-3933: --- @Eugene In your patches, You only deale with the root/meta regionserver. If a normal regionserver registers laterly. Master will process it as a dead one. Some regions in the later one will be opened twice. Hmaster throw NullPointerException -- Key: HBASE-3933 URL: https://issues.apache.org/jira/browse/HBASE-3933 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6 Reporter: gaojinchao Assignee: Eugene Koontz Attachments: HBASE-3933.patch, Hmastersetup0.90 NullPointerException while hmaster starting. {code} java.lang.NullPointerException at java.util.TreeMap.getEntry(TreeMap.java:324) at java.util.TreeMap.get(TreeMap.java:255) at org.apache.hadoop.hbase.master.AssignmentManager.addToServers(AssignmentManager.java:1512) at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:606) at org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214) at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:402) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186099#comment-13186099 ] gaojinchao commented on HBASE-5179: --- When master starts, As current flow, RS has three part: 1.Some has registered in Hmaster by heatbeat report 2.Some is dead server being processed by ssh 3.Some is waiting zk is expired(default session timeout is 3 minutes) 1,2 is easy. 3 is little diffult. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186100#comment-13186100 ] gaojinchao commented on HBASE-5179: --- Other problem is shutdownhandler flow can't get the meta flag when master starts. See below comment in expired function. // Was this server carrying meta? Can't ask CatalogTracker because it // may have reset the meta location as null already (it may have already // run into fact that meta is dead). I can ask assignment manager. It // has an inmemory list of who has what. This list will be cleared as we // process the dead server but should be find asking it now. HServerAddress address = ct.getMetaLocation(); boolean carryingMeta = Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186116#comment-13186116 ] gaojinchao commented on HBASE-5179: --- I suggest that we don't. I need some time to a more detailed verification. Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5152) Region is on service before completing initialized when doing rollback of split, it will affect read correctness
[ https://issues.apache.org/jira/browse/HBASE-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13182511#comment-13182511 ] gaojinchao commented on HBASE-5152: --- It seems the code should be above if (coprocessorHost != null. coprocessorHost.postOpen() means we have opened the region. Region is on service before completing initialized when doing rollback of split, it will affect read correctness - Key: HBASE-5152 URL: https://issues.apache.org/jira/browse/HBASE-5152 Project: HBase Issue Type: Bug Reporter: chunhui shen Attachments: hbase-5152.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4988) MetaServer crash cause all splitting regionserver abort
[ https://issues.apache.org/jira/browse/HBASE-4988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13182547#comment-13182547 ] gaojinchao commented on HBASE-4988: --- +1 Good job! MetaServer crash cause all splitting regionserver abort --- Key: HBASE-4988 URL: https://issues.apache.org/jira/browse/HBASE-4988 Project: HBase Issue Type: Bug Reporter: chunhui shen Attachments: hbase-4988v1.patch If metaserver crash now, All the splitting regionserver will abort theirself. Becasue the code {code} this.journal.add(JournalEntry.PONR); MetaEditor.offlineParentInMeta(server.getCatalogTracker(), this.parent.getRegionInfo(), a.getRegionInfo(), b.getRegionInfo()); {code} If the JournalEntry is PONR, split's roll back will abort itselef. It is terrible in huge putting environment when metaserver crash -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5060) HBase client is blocked forever
[ https://issues.apache.org/jira/browse/HBASE-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13172069#comment-13172069 ] gaojinchao commented on HBASE-5060: --- Test case passed: My test code: try { HBaseAdmin hbase = new HBaseAdmin(config); while (true) { try { if (hbase.tableExists(tableName)) { System.out.println([FATAL] The usertable: + tableName + is already existed); } try { Thread.sleep(50); } catch (InterruptedException e) { continue; } }catch(IOException e){ e.printStackTrace(); continue; } } 1. run test case 2. kill two zk servers(total three zk servers) 3. start the killed server again HBase client is blocked forever --- Key: HBASE-5060 URL: https://issues.apache.org/jira/browse/HBASE-5060 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Priority: Critical Fix For: 0.92.1, 0.90.6 Attachments: HBASE-5060_Branch90trial.patch, HBASE-5060_trunk.patch Since the client had a temporary network failure, After it recovered. I found my client thread was blocked. Looks below stack and logs, It said that we use a invalid CatalogTracker in function tableExists. Block stack: WriteHbaseThread33 prio=10 tid=0x7f76bc27a800 nid=0x2540 in Object.wait() [0x7f76af4f3000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:331) - locked 0x7f7a67817c98 (a java.util.concurrent.atomic.AtomicBoolean) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366) at org.apache.hadoop.hbase.catalog.MetaReader.tableExists(MetaReader.java:427) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:164) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source) - locked 0x7f7a4c5dc578 (a com.huawei.hdi.hbase.HbaseReOper) at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source) at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source) In ZooKeeperNodeTracker, We don't throw the KeeperException to high level. So in CatalogTracker level, We think ZooKeeperNodeTracker start success and continue to process . [WriteHbaseThread33]2011-12-16 17:07:33,153[WARN ] | hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Unable to get data of znode /hbase/root-region-server | org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:557) org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/root-region-server at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73) at org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136) at org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source) at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source) at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source) [WriteHbaseThread33]2011-12-16 17:07:33,361[ERROR] | hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Received unexpected KeeperException, re-throwing exception | org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.keeperException(ZooKeeperWatcher.java:385) org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/root-region-server at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931)
[jira] [Commented] (HBASE-4970) Allow better control of resource consumption in HTable (backport HBASE-4805 to 0.90 branch)
[ https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171986#comment-13171986 ] gaojinchao commented on HBASE-4970: --- No problem, Thanks for your work! Allow better control of resource consumption in HTable (backport HBASE-4805 to 0.90 branch) --- Key: HBASE-4970 URL: https://issues.apache.org/jira/browse/HBASE-4970 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Priority: Trivial Fix For: 0.90.6 Attachments: HBASE-4970_Branch90.patch, HBASE-4970_Branch90_V1_trial.patch, HBASE-4970_Branch90_V2.patch, HBASE-4970_Branch92_V2.patch, HBASE-4970_Trunk_V2.patch In my cluster, I changed keepAliveTime from 60 s to 3600 s. Increasing RES is slowed down. Why increasing keepAliveTime of HBase thread pool is slowing down our problem occurance [RES value increase]? You can go through the source of sun.nio.ch.Util. Every thread hold 3 softreference of direct buffer(mustangsrc) for reusage. The code names the 3 softreferences buffercache. If the buffer was all occupied or none was suitable in size, and new request comes, new direct buffer is allocated. After the service, the bigger one replaces the smaller one in buffercache. The replaced buffer is released. So I think we can add a parameter to change keepAliveTime of Htable thread pool. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4970) Allow better control of resource consumption in HTable (backport HBASE-4805 to 0.90 branch)
[ https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13171381#comment-13171381 ] gaojinchao commented on HBASE-4970: --- @Lars Thanks for your review. I tend to only modify the parameters. Allow better control of resource consumption in HTable (backport HBASE-4805 to 0.90 branch) --- Key: HBASE-4970 URL: https://issues.apache.org/jira/browse/HBASE-4970 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Priority: Trivial Fix For: 0.90.6 Attachments: HBASE-4970_Branch90.patch, HBASE-4970_Branch90_V1_trial.patch, HBASE-4970_Branch90_V2.patch, HBASE-4970_Branch92_V2.patch, HBASE-4970_Trunk_V2.patch In my cluster, I changed keepAliveTime from 60 s to 3600 s. Increasing RES is slowed down. Why increasing keepAliveTime of HBase thread pool is slowing down our problem occurance [RES value increase]? You can go through the source of sun.nio.ch.Util. Every thread hold 3 softreference of direct buffer(mustangsrc) for reusage. The code names the 3 softreferences buffercache. If the buffer was all occupied or none was suitable in size, and new request comes, new direct buffer is allocated. After the service, the bigger one replaces the smaller one in buffercache. The replaced buffer is released. So I think we can add a parameter to change keepAliveTime of Htable thread pool. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5009) Failure of creating split dir if it already exists prevents splits from happening further
[ https://issues.apache.org/jira/browse/HBASE-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169266#comment-13169266 ] gaojinchao commented on HBASE-5009: --- Yes , that is th root reason, I think we should guarantee all children threads is stoped. At same time, splitdir is not useful, we alse delete it. It seems no harm Failure of creating split dir if it already exists prevents splits from happening further - Key: HBASE-5009 URL: https://issues.apache.org/jira/browse/HBASE-5009 Project: HBase Issue Type: Bug Affects Versions: 0.90.6 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan The scenario is - The split of a region takes a long time - The deletion of the splitDir fails due to HDFS problems. - Subsequent splits also fail after that. {code} private static void createSplitDir(final FileSystem fs, final Path splitdir) throws IOException { if (fs.exists(splitdir)) throw new IOException(Splitdir already exits? + splitdir); if (!fs.mkdirs(splitdir)) throw new IOException(Failed create of + splitdir); } {code} Correct me if am wrong? If it is an issue can we change the behaviour of throwing exception? Pls suggest. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-5008) The clusters can't provide services because Region can't flush.
[ https://issues.apache.org/jira/browse/HBASE-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13167086#comment-13167086 ] gaojinchao commented on HBASE-5008: --- TestSplitTransactionOnCluster and TestSplitTransaction have passed. All test cases are running and will give a result tomorrow. The clusters can't provide services because Region can't flush. Key: HBASE-5008 URL: https://issues.apache.org/jira/browse/HBASE-5008 Project: HBase Issue Type: Bug Components: regionserver Reporter: gaojinchao Priority: Blocker Fix For: 0.90.6 Attachments: HBASE-5008_Branch90.patch Hbase version 0.90.4 + patches My analysis is as follows: //Started splitting region b24d8ccb852ff742f2a27d01b7f5853e and closed region. 2011-12-10 17:32:48,653 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e. 2011-12-10 17:32:49,759 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: disabling compactions flushes 2011-12-10 17:32:49,759 INFO org.apache.hadoop.hbase.regionserver.HRegion: Running close preflush of Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e. //Processed a flush request and skipped , But flushRequested had set to true 2011-12-10 17:33:06,963 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e., current region memstore size 12.6m 2011-12-10 17:33:17,277 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Skipping flush on Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e. because closing //split region b24d8ccb852ff742f2a27d01b7f5853 failed and rolled back, flushRequested flag was true, So all handle was blocked 2011-12-10 17:34:01,293 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Cleaned up old failed split transaction detritus: hdfs://193.195.18.121:9000/hbase/Htable_UFDR_004/b24d8ccb852ff742f2a27d01b7f5853e/splits 2011-12-10 17:34:01,294 INFO org.apache.hadoop.hbase.regionserver.HRegion: Onlined Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.; next sequenceid=15494173 2011-12-10 17:34:01,295 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: Successful rollback of failed split of Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e. 2011-12-10 17:43:10,147 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 19 on 20020' on region Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: memstore size 384.0m is = than blocking 384.0m size // All handles had been blocked. The clusters could not provide services 2011-12-10 17:34:01,295 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: Successful rollback of failed split of Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e. 2011-12-10 17:43:10,147 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 19 on 20020' on region Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: memstore size 384.0m is = than blocking 384.0m size 2011-12-10 17:43:10,192 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 34 on 20020' on region Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: memstore size 384.0m is = than blocking 384.0m size 2011-12-10 17:43:10,193 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 51 on 20020' on region Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: memstore size 384.0m is = than blocking 384.0m size 2011-12-10 17:43:10,196 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 85 on 20020' on region Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: memstore size 384.0m is = than blocking 384.0m size 2011-12-10 17:43:10,199 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 88 on 20020' on region Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: memstore size 384.0m is = than blocking 384.0m size 2011-12-10 17:43:10,202 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 44 on 20020' on region Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: memstore size 384.0m is = than blocking 384.0m size 2011-12-10 17:43:11,663 INFO
[jira] [Commented] (HBASE-4970) Allow better control of resource consumption in HTable (backport HBASE-4805 to 0.90 branch)
[ https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13165802#comment-13165802 ] gaojinchao commented on HBASE-4970: --- I also want to add a parameter to change keepAliveTime of Htable thread pool. so that clients can have more option Allow better control of resource consumption in HTable (backport HBASE-4805 to 0.90 branch) --- Key: HBASE-4970 URL: https://issues.apache.org/jira/browse/HBASE-4970 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Priority: Trivial Fix For: 0.90.5 Attachments: HBASE-4970_Branch90.patch, HBASE-4970_Branch90_V1_trial.patch In my cluster, I changed keepAliveTime from 60 s to 3600 s. Increasing RES is slowed down. Why increasing keepAliveTime of HBase thread pool is slowing down our problem occurance [RES value increase]? You can go through the source of sun.nio.ch.Util. Every thread hold 3 softreference of direct buffer(mustangsrc) for reusage. The code names the 3 softreferences buffercache. If the buffer was all occupied or none was suitable in size, and new request comes, new direct buffer is allocated. After the service, the bigger one replaces the smaller one in buffercache. The replaced buffer is released. So I think we can add a parameter to change keepAliveTime of Htable thread pool. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4970) Add a parameter to change keepAliveTime of Htable thread pool.
[ https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13164276#comment-13164276 ] gaojinchao commented on HBASE-4970: --- Sorry, I didn't see the Lars's comment. I will try to backport HBASE-4805. Add a parameter to change keepAliveTime of Htable thread pool. --- Key: HBASE-4970 URL: https://issues.apache.org/jira/browse/HBASE-4970 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Priority: Trivial Fix For: 0.90.5 Attachments: HBASE-4970_Branch90.patch In my cluster, I changed keepAliveTime from 60 s to 3600 s. Increasing RES is slowed down. Why increasing keepAliveTime of HBase thread pool is slowing down our problem occurance [RES value increase]? You can go through the source of sun.nio.ch.Util. Every thread hold 3 softreference of direct buffer(mustangsrc) for reusage. The code names the 3 softreferences buffercache. If the buffer was all occupied or none was suitable in size, and new request comes, new direct buffer is allocated. After the service, the bigger one replaces the smaller one in buffercache. The replaced buffer is released. So I think we can add a parameter to change keepAliveTime of Htable thread pool. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4970) Add a parameter to change keepAliveTime of Htable thread pool.
[ https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13164364#comment-13164364 ] gaojinchao commented on HBASE-4970: --- Fixed Lars's comment. @Lars Please review firstly, I will test it in real cluster tomorrow. Add a parameter to change keepAliveTime of Htable thread pool. --- Key: HBASE-4970 URL: https://issues.apache.org/jira/browse/HBASE-4970 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Priority: Trivial Fix For: 0.90.5 Attachments: HBASE-4970_Branch90.patch, HBASE-4970_Branch90_V1_trial.patch In my cluster, I changed keepAliveTime from 60 s to 3600 s. Increasing RES is slowed down. Why increasing keepAliveTime of HBase thread pool is slowing down our problem occurance [RES value increase]? You can go through the source of sun.nio.ch.Util. Every thread hold 3 softreference of direct buffer(mustangsrc) for reusage. The code names the 3 softreferences buffercache. If the buffer was all occupied or none was suitable in size, and new request comes, new direct buffer is allocated. After the service, the bigger one replaces the smaller one in buffercache. The replaced buffer is released. So I think we can add a parameter to change keepAliveTime of Htable thread pool. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4970) Add a parameter to change keepAliveTime of Htable thread pool.
[ https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13164127#comment-13164127 ] gaojinchao commented on HBASE-4970: --- ok, No problem. Add a parameter to change keepAliveTime of Htable thread pool. --- Key: HBASE-4970 URL: https://issues.apache.org/jira/browse/HBASE-4970 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Priority: Trivial Fix For: 0.90.5 In my cluster, I changed keepAliveTime from 60 s to 3600 s. Increasing RES is slowed down. Why increasing keepAliveTime of HBase thread pool is slowing down our problem occurance [RES value increase]? You can go through the source of sun.nio.ch.Util. Every thread hold 3 softreference of direct buffer(mustangsrc) for reusage. The code names the 3 softreferences buffercache. If the buffer was all occupied or none was suitable in size, and new request comes, new direct buffer is allocated. After the service, the bigger one replaces the smaller one in buffercache. The replaced buffer is released. So I think we can add a parameter to change keepAliveTime of Htable thread pool. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4633) Potential memory leak in client RPC timeout mechanism
[ https://issues.apache.org/jira/browse/HBASE-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162538#comment-13162538 ] gaojinchao commented on HBASE-4633: --- Hbase version is 0.90.4 + patch. Cluseter number is 10 One HBase client process includes 50 threads, So the max threads connect to the RS is (50 * RS number). I have noticed some memory leak problems in my HBase client. RES has increased to 27g PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 12676 root 20 0 30.8g 27g 5092 S2 57.5 587:57.76 /opt/java/jre/bin/java -Djava.library.path=lib/. But I am not sure the leak comes from HBase Client jar itself or just our client code. This is some parameters of jvm. :-Xms15g -Xmn12g -Xmx15g -XX:PermSize=64m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=65 -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=1 -XX:+CMSParallelRemarkEnabled Potential memory leak in client RPC timeout mechanism - Key: HBASE-4633 URL: https://issues.apache.org/jira/browse/HBASE-4633 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.3 Environment: HBase version: 0.90.3 + Patches , Hadoop version: CDH3u0 Reporter: Shrijeet Paliwal Attachments: HBaseclientstack.png Relevant Jiras: https://issues.apache.org/jira/browse/HBASE-2937, https://issues.apache.org/jira/browse/HBASE-4003 We have been using the 'hbase.client.operation.timeout' knob introduced in 2937 for quite some time now. It helps us enforce SLA. We have two HBase clusters and two HBase client clusters. One of them is much busier than the other. We have seen a deterministic behavior of clients running in busy cluster. Their (client's) memory footprint increases consistently after they have been up for roughly 24 hours. This memory footprint almost doubles from its usual value (usual case == RPC timeout disabled). After much investigation nothing concrete came out and we had to put a hack which keep heap size in control even when RPC timeout is enabled. Also note , the same behavior is not observed in 'not so busy cluster. The patch is here : https://gist.github.com/1288023 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4633) Potential memory leak in client RPC timeout mechanism
[ https://issues.apache.org/jira/browse/HBASE-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13162596#comment-13162596 ] gaojinchao commented on HBASE-4633: --- I have tested, Memory does not increase when specified MaxDirectMemorySize with moderate value. In my cluster, nearly one hours , trigger a full GC. look this logs: 10022.210: [Full GC (System) 10022.210: [Tenured: 577566K-257349K(1048576K), 1.7515610 secs] 9651924K-257349K(14260672K), [Perm : 19161K-19161K(65536K)], 1.7518350 secs] [Times: user=1.75 sys=0.00, real=1.75 secs] . . 13532.930: [GC 13532.931: [ParNew: 12801558K-981626K(13212096K), 0.1414370 secs] 13111752K-1291828K(14260672K), 0.1416880 secs] [Times: user=1.90 sys=0.01, real=0.14 secs] 13624.630: [Full GC (System) 13624.630: [Tenured: 310202K-175378K(1048576K), 1.9529280 secs] 11581276K-175378K(14260672K), [Perm : 19225K-19225K(65536K)], 1.9531660 secs] [Times: user=1.94 sys=0.00, real=1.96 secs] I monitored the memory. It is stable. 7543 root 20 0 16.9g 15g 9892 S1 33.0 1258:59 java 7543 root 20 0 16.9g 15g 9892 S0 33.0 1258:59 java 7543 root 20 0 16.9g 15g 9892 S1 33.0 1258:59 java 7543 root 20 0 16.9g 15g 9892 S0 33.0 1258:59 java 7543 root 20 0 16.9g 15g 9892 S1 33.0 1258:59 java 7543 root 20 0 16.9g 15g 9892 S1 33.0 1259:00 java Potential memory leak in client RPC timeout mechanism - Key: HBASE-4633 URL: https://issues.apache.org/jira/browse/HBASE-4633 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.3 Environment: HBase version: 0.90.3 + Patches , Hadoop version: CDH3u0 Reporter: Shrijeet Paliwal Attachments: HBaseclientstack.png Relevant Jiras: https://issues.apache.org/jira/browse/HBASE-2937, https://issues.apache.org/jira/browse/HBASE-4003 We have been using the 'hbase.client.operation.timeout' knob introduced in 2937 for quite some time now. It helps us enforce SLA. We have two HBase clusters and two HBase client clusters. One of them is much busier than the other. We have seen a deterministic behavior of clients running in busy cluster. Their (client's) memory footprint increases consistently after they have been up for roughly 24 hours. This memory footprint almost doubles from its usual value (usual case == RPC timeout disabled). After much investigation nothing concrete came out and we had to put a hack which keep heap size in control even when RPC timeout is enabled. Also note , the same behavior is not observed in 'not so busy cluster. The patch is here : https://gist.github.com/1288023 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4773) HBaseAdmin leaks ZooKeeper connections
[ https://issues.apache.org/jira/browse/HBASE-4773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157352#comment-13157352 ] gaojinchao commented on HBASE-4773: --- In TRUNK, before throwing exception, we should call deleteStaleConnection to clean the dirty data HBaseAdmin leaks ZooKeeper connections -- Key: HBASE-4773 URL: https://issues.apache.org/jira/browse/HBASE-4773 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Priority: Critical Fix For: 0.90.5 Attachments: 4773.patch When master crashs, HBaseAdmin will leaks ZooKeeper connections I think we should close the zk connetion when throw MasterNotRunningException public HBaseAdmin(Configuration c) throws MasterNotRunningException, ZooKeeperConnectionException { this.conf = HBaseConfiguration.create(c); this.connection = HConnectionManager.getConnection(this.conf); this.pause = this.conf.getLong(hbase.client.pause, 1000); this.numRetries = this.conf.getInt(hbase.client.retries.number, 10); this.retryLongerMultiplier = this.conf.getInt(hbase.client.retries.longer.multiplier, 10); //we should add this code and close the zk connection try{ this.connection.getMaster(); }catch(MasterNotRunningException e){ HConnectionManager.deleteConnection(conf, false); throw e; } } -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4868) testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails
[ https://issues.apache.org/jira/browse/HBASE-4868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157353#comment-13157353 ] gaojinchao commented on HBASE-4868: --- Thanks reveiw, I will fix all the comments testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails - Key: HBASE-4868 URL: https://issues.apache.org/jira/browse/HBASE-4868 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.92.0 Reporter: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0 Attachments: HBASE-4868_trial.patch looks: https://builds.apache.org/job/HBase-TRUNK-security/7/testReport/org.apache.hadoop.hbase.util.hbck/TestOfflineMetaRebuildBase/testMetaRebuild/ Please review, see whether the method makes sense? If it makes sense, I will check other cases? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4868) testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails
[ https://issues.apache.org/jira/browse/HBASE-4868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157400#comment-13157400 ] gaojinchao commented on HBASE-4868: --- @Test(timeout = 12) public void testMetaRebuild() throws Exception { This code can't guarantee ?. I don't understand where should add? Please remind me :) -邮件原件- 发件人: Ted Yu (Commented) (JIRA) [mailto:j...@apache.org] 发送时间: 2011年11月26日 14:59 收件人: Gaojinchao 主题: [jira] [Commented] (HBASE-4868) testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails [ https://issues.apache.org/jira/browse/HBASE-4868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13157395#comment-13157395 ] Ted Yu commented on HBASE-4868: --- HadoopQA is having problem - it didn't run test suite. @Jinchao: Please also add the timeout parameter as Jonathan suggested. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails - Key: HBASE-4868 URL: https://issues.apache.org/jira/browse/HBASE-4868 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.92.0 Reporter: gaojinchao Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0 Attachments: HBASE-4868_trial.patch, HBASE-4868_trunkv2.patch looks: https://builds.apache.org/job/HBase-TRUNK-security/7/testReport/org.apache.hadoop.hbase.util.hbck/TestOfflineMetaRebuildBase/testMetaRebuild/ Please review, see whether the method makes sense? If it makes sense, I will check other cases? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13155050#comment-13155050 ] gaojinchao commented on HBASE-4739: --- This patch is not compatible, I added M_ZK_REGION_CLOSING and delete RS_ZK_REGION_CLOSING in EventHandler.java. I have another question, Can I delete below code block in function unassign(HRegionInfo region, boolean force) ? } catch (NotServingRegionException nsre) { LOG.info(Server + server + returned + nsre + for + region.getEncodedName()); // Presume that master has stale data. Presume remote side just split. // Presume that the split message when it comes in will fix up the master's // in memory cluster state. }catch (Throwable t) I think we should use the wrap of RemoteException. Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: 4739_trial2.patch, 4739_trialV3.patch, HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_trail5.patch, HBASE-4739_trial.patch, HBASE-4739_trial6.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13155600#comment-13155600 ] gaojinchao commented on HBASE-4739: --- @Ted It seems 0.90.5 logic is ok. 1. if RS_ZK_REGION_CLOSING is created, It says that RS has received the RPC 2. When RIT is timeout, There is two case, one RS is slow, in this case we don't need send RPC again. another case, closing the region has exception, we send rpc can't solve the problem, it may also fail. So I think we don't need fix anything. Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: 4739_trial2.patch, 4739_trialV3.patch, HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_trail5.patch, HBASE-4739_trial.patch, HBASE-4739_trial6.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13153969#comment-13153969 ] gaojinchao commented on HBASE-4739: --- HBASE-4739_trail5 made a few changes, Please review, if it makes sense, I will verify in a real cluster. Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: 4739_trial2.patch, 4739_trialV3.patch, HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_trail5.patch, HBASE-4739_trial.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13154010#comment-13154010 ] gaojinchao commented on HBASE-4739: --- I think we don't need handle RegionAlreadyInTransitionException exception. We only need update the timestamp of RIT,we have done. my reason is : 1. The moniter timeout is 30 minutes, There are enough time to close a region. 2. if the RS throws RegionAlreadyInTransitionException exception, we need update the timestamp of RIT and wait next timeout. Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: 4739_trial2.patch, 4739_trialV3.patch, HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_trail5.patch, HBASE-4739_trial.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151923#comment-13151923 ] gaojinchao commented on HBASE-4739: --- Only sends a RPC, I think state machine lose a state flag. Master doesn't distinguish pending or closing. Code structure is not well. But this patch is not compatible. Fixed other comments Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: 4739_trial2.patch, HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_trial.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13152021#comment-13152021 ] gaojinchao commented on HBASE-4739: --- Not completed the test ,tomorrow continue! Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: 4739_trial2.patch, 4739_trialV3.patch, HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_trial.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13152585#comment-13152585 ] gaojinchao commented on HBASE-4739: --- @J-D In 0.92 version, uses HBASE-4739_Trunk_V2 in timeout monitor for sending a CLOSING rpc.(I try to modify this patch) In trunk, uses patch 4739_trialV3. Hbase thousands of people in the use of, If we once, may appear more. So I think we need slove this isse. What do you say J-D? I will do some more detailed testing about these patches and give my test cases. Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: 4739_trial2.patch, 4739_trialV3.patch, HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_trial.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4654) [replication] Add a check to make sure we don't replicate to ourselves
[ https://issues.apache.org/jira/browse/HBASE-4654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151143#comment-13151143 ] gaojinchao commented on HBASE-4654: --- Do we need throw exceptin in api addPeer? [replication] Add a check to make sure we don't replicate to ourselves -- Key: HBASE-4654 URL: https://issues.apache.org/jira/browse/HBASE-4654 Project: HBase Issue Type: Improvement Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Fix For: 0.90.5 Attachments: 4654-trunk.txt It's currently possible to add a peer for replication and point it to the local cluster, which I believe could very well happen for those like us that use only one ZK ensemble per DC so that only the root znode changes when you want to set up replication intra-DC. I don't think comparing just the cluster ID would be enough because you would normally use a different one for another cluster and nothing will block you from pointing elsewhere. Comparing the ZK ensemble address doesn't work either when you have multiple DNS entries that point at the same place. I think this could be resolved by looking up the master address in the relevant znode as it should be exactly the same thing in the case where you have the same cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151712#comment-13151712 ] gaojinchao commented on HBASE-4739: --- trail2 fixed your comment. if you and Ram make sense, I will test it. Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: 4739_trial2.patch, HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_trial.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150310#comment-13150310 ] gaojinchao commented on HBASE-4739: --- about test failed: 1.org.apache.hadoop.hbase.io.hfile.TestHFileBlock Look this comments public void testBlockHeapSize() { // We have seen multiple possible values for this estimate of the heap size // of a ByteBuffer, presumably depending on the JDK version. assertTrue(HFileBlock.BYTE_BUFFER_HEAP_SIZE == 64 || HFileBlock.BYTE_BUFFER_HEAP_SIZE == 80); But in https://issues.apache.org/jira/browse/HBASE-4768 We add some code snippets: assertEquals(80, HFileBlock.BYTE_BUFFER_HEAP_SIZE); long byteBufferExpectedSize = ClassSize.align(ClassSize.estimateBase(buf.getClass(), true) + HFileBlock.HEADER_SIZE + size); 2.org.apache.hadoop.hbase.master.TestDistributedLogSplitting Because we choose a rs with 0 regions. // it said that regions is 0. 2011-11-15 03:53:11,215 INFO [Thread-2335] master.TestDistributedLogSplitting(211): #regions = 0 2011-11-15 03:53:11,215 DEBUG [RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer] wal.HLog$LogSyncer(1192): RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer interrupted while waiting for sync requests 2011-11-15 03:53:11,215 INFO [RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer] wal.HLog$LogSyncer(1194): RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer exiting 2011-11-15 03:53:11,215 DEBUG [Thread-2335] wal.HLog(967): closing hlog writer in hdfs://localhost:46229/user/jenkins/.logs/asf001.sp2.ygridcore.net,36721,1321329179789 2011-11-15 03:53:11,637 DEBUG [Thread-2335] master.SplitLogManager(233): Scheduling batch of logs to split Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150325#comment-13150325 ] gaojinchao commented on HBASE-4739: --- I made a issue for 2: https://issues.apache.org/jira/browse/HBASE-4790 Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149528#comment-13149528 ] gaojinchao commented on HBASE-4739: --- --- T E S T S --- --- T E S T S --- Running org.apache.hadoop.hbase.master.TestMasterFailover Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 99.82 sec Results : Tests run: 4, Failures: 0, Errors: 0, Skipped: 0 Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: HBASE-4739_Trunk.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13149532#comment-13149532 ] gaojinchao commented on HBASE-4739: --- Please review this patch. Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: HBASE-4739_Trunk.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150142#comment-13150142 ] gaojinchao commented on HBASE-4739: --- Why was the check for zk node existence in actOnTimeOut() at line 2531 removed ? This code snippets is used in branch 0.90, because the zk node RS_ZK_REGION_CLOSING is created by RS, But in trunk the zk node RS_ZK_REGION_CLOSING is created by master. So this conditin should be impossible. Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: HBASE-4739_Trunk.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150168#comment-13150168 ] gaojinchao commented on HBASE-4739: --- V2 fixs Ted's comment. In my local, replication test case passed. I will try to dig this issue https://builds.apache.org/job/PreCommit-HBASE-Build/243//testReport/org.apache.hadoop.hbase.replication/TestReplication/queueFailover/; Running org.apache.hadoop.hbase.replication.TestReplicationSource Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.897 sec Running org.apache.hadoop.hbase.replication.TestReplicationPeer Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 25.538 sec Running org.apache.hadoop.hbase.replication.regionserver.TestReplicationSourceManager Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.089 sec Running org.apache.hadoop.hbase.replication.regionserver.TestReplicationSink Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 22.178 sec Running org.apache.hadoop.hbase.replication.TestReplication Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 145.709 sec Running org.apache.hadoop.hbase.replication.TestMultiSlaveReplication Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 55.521 sec Running org.apache.hadoop.hbase.replication.TestMasterReplication Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 99.819 sec Running org.apache.hadoop.hbase.TestServerName Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150267#comment-13150267 ] gaojinchao commented on HBASE-4739: --- @Ram That is a good suggestion, But timeout monitor needs 30 minutes. should we wait so long time ? Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150290#comment-13150290 ] gaojinchao commented on HBASE-4739: --- In fact, I made another patch that is more complicated and is not compatible, so I didn't commit. another patch step: 1. Hmaster creater a zk node(M_ZK_REGION_PENDING_CLOSE) and set RIT to pendingclose state 2. send rpc to RS 3. RS change zk node to RS_ZK_REGION_CLOSING 4. Master changes RIT to closing if above steps, We can distinguish the state of RIT. if M_ZK_REGION_PENDING_CLOSE we can send a rpc. RS_ZK_REGION_CLOSING we can add to RIT. Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13148300#comment-13148300 ] gaojinchao commented on HBASE-4739: --- Yes, I am trying to reproduce this issue in trunk and make a patch. Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4681) StringIndexOutOfBoundsException parsing Hostname
[ https://issues.apache.org/jira/browse/HBASE-4681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13146878#comment-13146878 ] gaojinchao commented on HBASE-4681: --- I tested in 3 nodes cluseter and couldn't reproduce this issue. StringIndexOutOfBoundsException parsing Hostname Key: HBASE-4681 URL: https://issues.apache.org/jira/browse/HBASE-4681 Project: HBase Issue Type: Bug Reporter: stack Fix For: 0.92.0 Starting a 0.92 on 0.90 data with 0.90 zk ensemble I got this: {code} 2011-10-26 06:13:53,920 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node /hbase/master already exists and this is not a retry 2011-10-26 06:13:53,927 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown. java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1937) at org.apache.hadoop.hbase.ServerName.parseHostname(ServerName.java:81) at org.apache.hadoop.hbase.ServerName.init(ServerName.java:63) at org.apache.hadoop.hbase.master.ActiveMasterManager.blockUntilBecomingActiveMaster(ActiveMasterManager.java:148) at org.apache.hadoop.hbase.master.HMaster.becomeActiveMaster(HMaster.java:346) at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:301) at java.lang.Thread.run(Thread.java:662) 2011-10-26 06:13:53,929 INFO org.apache.hadoop.hbase.master.HMaster: Aborting 2011-10-26 06:13:53,929 DEBUG org.apache.hadoop.hbase.master.HMaster: Stopping service thre {code} I thought this had been fixed. Dig in . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147458#comment-13147458 ] gaojinchao commented on HBASE-4739: --- I think it's issue about RS. when RS is closing region and throws exception because RS has created a closing zk node. we close again, may not solve the problem. If we have any Rs's logs, it is better. If you don't want to lose data, we should close the RS and split the hlog to recover data. Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4654) [replication] Add a check to make sure we don't replicate to ourselves
[ https://issues.apache.org/jira/browse/HBASE-4654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147469#comment-13147469 ] gaojinchao commented on HBASE-4654: --- @J-D Can we fix only in trunk or 0.92? We can use ClusterId to judge whether is a same cluster. [replication] Add a check to make sure we don't replicate to ourselves -- Key: HBASE-4654 URL: https://issues.apache.org/jira/browse/HBASE-4654 Project: HBase Issue Type: Improvement Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Fix For: 0.90.5 It's currently possible to add a peer for replication and point it to the local cluster, which I believe could very well happen for those like us that use only one ZK ensemble per DC so that only the root znode changes when you want to set up replication intra-DC. I don't think comparing just the cluster ID would be enough because you would normally use a different one for another cluster and nothing will block you from pointing elsewhere. Comparing the ZK ensemble address doesn't work either when you have multiple DNS entries that point at the same place. I think this could be resolved by looking up the master address in the relevant znode as it should be exactly the same thing in the case where you have the same cluster. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147515#comment-13147515 ] gaojinchao commented on HBASE-4739: --- In 0.90 version, I think there is no this scenario, The closing zk node is only created by RS. look this code: public void process() { int expectedVersion = FAILED; if (this.zk) { expectedVersion = setClosingState(); if (expectedVersion == FAILED) return; } But in 0.92/trunk version, The problem looks like refactoring code, I think should be:1 1.master create a pending cose state flag 2.RS receives the close call and change zk node to cosing state. 3.When the master restarted , we should handle two state. Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13147543#comment-13147543 ] gaojinchao commented on HBASE-4739: --- It seems HBASE-3789 refactored the code Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4511) There is data loss when master failovers
[ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144982#comment-13144982 ] gaojinchao commented on HBASE-4511: --- +1 There is data loss when master failovers Key: HBASE-4511 URL: https://issues.apache.org/jira/browse/HBASE-4511 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.92.0 Reporter: gaojinchao Assignee: stack Priority: Minor Fix For: 0.92.0 Attachments: 4511-v2.txt, 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt It goes like this: Master crashed , at the same time RS with meta is crashing, but RS doesn't eixt. Master startups again and finds all living RS. Master verifies the meta failed, because this RS is crashing. Master reassigns the meta, but it doesn't split the Hlog. So some meta data is loss. About the logs of a failover test case fail. //It said that we want to kill a RS 2011-09-28 19:54:45,694 INFO [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test 2011-09-28 19:54:45,694 INFO [Thread-988] master.TestMasterFailover(1007): RS 192.168.2.102,54385,1317264874629 killed //Rs didn't crash. 2011-09-28 19:54:51,763 INFO [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629 2011-09-28 19:54:51,763 INFO [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error) 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487... //Meta verification failed and ressigned the meta. So all the regions in the meta is loss. 2011-09-28 19:54:51,773 INFO [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487... 2011-09-28 19:54:52,277 INFO [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487... 2011-09-28 19:54:52,782 INFO [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received
[jira] [Commented] (HBASE-4511) There is data loss when master failovers
[ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144604#comment-13144604 ] gaojinchao commented on HBASE-4511: --- I am still a little doubt, If Meta RS is dying RS, How to pass this his name to metaLocation? look this logs, it said that this.metaLocation is null . 2011-09-28 19:54:51,773 INFO [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487... 2011-09-28 19:54:52,277 INFO [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487... 2011-09-28 19:54:52,782 INFO [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state There is data loss when master failovers Key: HBASE-4511 URL: https://issues.apache.org/jira/browse/HBASE-4511 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.92.0 Reporter: gaojinchao Priority: Minor Fix For: 0.92.0 Attachments: 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt It goes like this: Master crashed , at the same time RS with meta is crashing, but RS doesn't eixt. Master startups again and finds all living RS. Master verifies the meta failed, because this RS is crashing. Master reassigns the meta, but it doesn't split the Hlog. So some meta data is loss. About the logs of a failover test case fail. //It said that we want to kill a RS 2011-09-28 19:54:45,694 INFO [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test 2011-09-28 19:54:45,694 INFO [Thread-988] master.TestMasterFailover(1007): RS 192.168.2.102,54385,1317264874629 killed //Rs didn't crash. 2011-09-28 19:54:51,763 INFO [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629 2011-09-28 19:54:51,763 INFO [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error) 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487... //Meta verification failed and ressigned the meta. So all the regions in the meta is loss. 2011-09-28 19:54:51,773 INFO
[jira] [Commented] (HBASE-4749) TestMasterFailover case occasional fails
[ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143990#comment-13143990 ] gaojinchao commented on HBASE-4749: --- It seems a bug for TRUNK. In version 0.90, We kill a RS and at same time start a Master, Master don't add a dying RS to online set. But in version 0.92 We will add a dying RS to online set. This will produce a lot of unusual scenarios: 1. if the root/meta is in a dying RS, we may lose data because don't split Hlog. looks issue: https://issues.apache.org/jira/browse/HBASE-4511. 2.In testMasterFailoverWithMockedRITOnDeadRScase , mocking scenarios will be invalid. look this logs: //we kill this RS(1320357166142 ) 2011-11-03 21:52:56,007 INFO [Thread-986] master.TestMasterFailover(1011): Killing RS juno.apache.org,60001,1320357166142 //we pick up this RS(1320357166142) through zk node. 2011-11-03 21:52:57,356 INFO [Master:0;juno.apache.org,51313,1320357176029] master.HMaster(464): Registering server found up in zk: juno.apache.org,60001,1320357166142 2011-11-03 21:52:57,357 INFO [Master:0;juno.apache.org,51313,1320357176029] master.ServerManager(239): Registering server=juno.apache.org,60001,1320357166142 So I think we should wait until killing RS is shut down and start a new hmaster. TestMasterFailover case occasional fails Key: HBASE-4749 URL: https://issues.apache.org/jira/browse/HBASE-4749 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.92.0 Reporter: gaojinchao Priority: Minor Fix For: 0.92.0 look this logs: https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4511) There is data loss when master failovers
[ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144523#comment-13144523 ] gaojinchao commented on HBASE-4511: --- Meta table is different from Root table. When Meta RS is dying, we should not use getMetaLocation. I think we should get the sn from root table. There is data loss when master failovers Key: HBASE-4511 URL: https://issues.apache.org/jira/browse/HBASE-4511 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.92.0 Reporter: gaojinchao Priority: Minor Fix For: 0.92.0 Attachments: org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt It goes like this: Master crashed , at the same time RS with meta is crashing, but RS doesn't eixt. Master startups again and finds all living RS. Master verifies the meta failed, because this RS is crashing. Master reassigns the meta, but it doesn't split the Hlog. So some meta data is loss. About the logs of a failover test case fail. //It said that we want to kill a RS 2011-09-28 19:54:45,694 INFO [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test 2011-09-28 19:54:45,694 INFO [Thread-988] master.TestMasterFailover(1007): RS 192.168.2.102,54385,1317264874629 killed //Rs didn't crash. 2011-09-28 19:54:51,763 INFO [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629 2011-09-28 19:54:51,763 INFO [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error) 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487... //Meta verification failed and ressigned the meta. So all the regions in the meta is loss. 2011-09-28 19:54:51,773 INFO [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487... 2011-09-28 19:54:52,277 INFO [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487... 2011-09-28 19:54:52,782 INFO [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state 2011-09-28 19:54:52,825 DEBUG
[jira] [Commented] (HBASE-4749) TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails
[ https://issues.apache.org/jira/browse/HBASE-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144526#comment-13144526 ] gaojinchao commented on HBASE-4749: --- There is this logs Caused by: java.io.IOException: Too many open files TestMasterFailover#testMasterFailoverWithMockedRITOnDeadRS occasionally fails - Key: HBASE-4749 URL: https://issues.apache.org/jira/browse/HBASE-4749 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.92.0 Reporter: gaojinchao Priority: Critical Fix For: 0.92.0 Attachments: 4749.txt look this logs: https://builds.apache.org/view/G-L/view/HBase/job/HBase-0.92/105/testReport/org.apache.hadoop.hbase.master/TestMasterFailover/testMasterFailoverWithMockedRITOnDeadRS/ -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4511) There is data loss when master failovers
[ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13144598#comment-13144598 ] gaojinchao commented on HBASE-4511: --- @stack Sorry , I am wrong. the patch makes sense. There is data loss when master failovers Key: HBASE-4511 URL: https://issues.apache.org/jira/browse/HBASE-4511 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.92.0 Reporter: gaojinchao Priority: Minor Fix For: 0.92.0 Attachments: 4511.txt, org.apache.hadoop.hbase.master.TestMasterFailover-output.rar, sketch.txt It goes like this: Master crashed , at the same time RS with meta is crashing, but RS doesn't eixt. Master startups again and finds all living RS. Master verifies the meta failed, because this RS is crashing. Master reassigns the meta, but it doesn't split the Hlog. So some meta data is loss. About the logs of a failover test case fail. //It said that we want to kill a RS 2011-09-28 19:54:45,694 INFO [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test 2011-09-28 19:54:45,694 INFO [Thread-988] master.TestMasterFailover(1007): RS 192.168.2.102,54385,1317264874629 killed //Rs didn't crash. 2011-09-28 19:54:51,763 INFO [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629 2011-09-28 19:54:51,763 INFO [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error) 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487... //Meta verification failed and ressigned the meta. So all the regions in the meta is loss. 2011-09-28 19:54:51,773 INFO [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487... 2011-09-28 19:54:52,277 INFO [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487... 2011-09-28 19:54:52,782 INFO [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received
[jira] [Commented] (HBASE-4577) Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB
[ https://issues.apache.org/jira/browse/HBASE-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142866#comment-13142866 ] gaojinchao commented on HBASE-4577: --- Yes,I think so. But I am diging. Because I am not familiar with MR and this issue is not very important to release 0.92 version, Please give me some time. Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB - Key: HBASE-4577 URL: https://issues.apache.org/jira/browse/HBASE-4577 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0 Attachments: HBASE-4577_trial_Trunk.patch, HBASE-4577_trunk.patch Minor issue while looking at the RS metrics: bq. numberOfStorefiles=8, storefileUncompressedSizeMB=2418, storefileSizeMB=2420, compressionRatio=1.0008 I guess there's a truncation somewhere when it's adding the numbers up. FWIW there's no compression on that table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4577) Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB
[ https://issues.apache.org/jira/browse/HBASE-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143172#comment-13143172 ] gaojinchao commented on HBASE-4577: --- I didn't find any exception log. It seems the test case has a bug. Processing steps: 1. Creating table with 5 regions 2. Producing data base on 5 regions 3. Changing table to 15 regions 4. Loading data to new table Some cases, the data of 1 region may become the data of 11 regions. but the parameter of hbase.bulkload.retries.number is set to 10,we only try 10 time, some data can't load to the region. If I am wrong, Please correct me! Thanks I think we should change table to 14 rather than 15 regions in this case. Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB - Key: HBASE-4577 URL: https://issues.apache.org/jira/browse/HBASE-4577 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0 Attachments: HBASE-4577_trial_Trunk.patch, HBASE-4577_trunk.patch Minor issue while looking at the RS metrics: bq. numberOfStorefiles=8, storefileUncompressedSizeMB=2418, storefileSizeMB=2420, compressionRatio=1.0008 I guess there's a truncation somewhere when it's adding the numbers up. FWIW there's no compression on that table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4577) Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB
[ https://issues.apache.org/jira/browse/HBASE-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13143174#comment-13143174 ] gaojinchao commented on HBASE-4577: --- look this logs: 2011-11-02 04:23:42,761 ERROR [main] mapreduce.LoadIncrementalHFiles(214): Retry attempted 10 times without completing, bailing out 2011-11-02 04:23:42,761 ERROR [main] mapreduce.LoadIncrementalHFiles(240): - Bulk load aborted with some files not yet loaded: - hdfs://localhost:45622/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/trunk/target/test-data/4660a8cd-aaa5-466a-8ff1-5b824cd4553e/testLocalMRIncrementalLoad/info-A/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/TestTable,27.bottom hdfs://localhost:45622/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/trunk/target/test-data/4660a8cd-aaa5-466a-8ff1-5b824cd4553e/testLocalMRIncrementalLoad/info-A/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/TestTable,27.top hdfs://localhost:45622/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/trunk/target/test-data/4660a8cd-aaa5-466a-8ff1-5b824cd4553e/testLocalMRIncrementalLoad/info-B/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/TestTable,28.bottom hdfs://localhost:45622/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/trunk/target/test-data/4660a8cd-aaa5-466a-8ff1-5b824cd4553e/testLocalMRIncrementalLoad/info-B/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/_tmp/TestTable,28.top Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB - Key: HBASE-4577 URL: https://issues.apache.org/jira/browse/HBASE-4577 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0 Attachments: HBASE-4577_trial_Trunk.patch, HBASE-4577_trunk.patch Minor issue while looking at the RS metrics: bq. numberOfStorefiles=8, storefileUncompressedSizeMB=2418, storefileSizeMB=2420, compressionRatio=1.0008 I guess there's a truncation somewhere when it's adding the numbers up. FWIW there's no compression on that table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4577) Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB
[ https://issues.apache.org/jira/browse/HBASE-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142008#comment-13142008 ] gaojinchao commented on HBASE-4577: --- Test failed, it seems not a patch problem. Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB - Key: HBASE-4577 URL: https://issues.apache.org/jira/browse/HBASE-4577 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0 Attachments: HBASE-4577_trial_Trunk.patch, HBASE-4577_trunk.patch Minor issue while looking at the RS metrics: bq. numberOfStorefiles=8, storefileUncompressedSizeMB=2418, storefileSizeMB=2420, compressionRatio=1.0008 I guess there's a truncation somewhere when it's adding the numbers up. FWIW there's no compression on that table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4577) Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB
[ https://issues.apache.org/jira/browse/HBASE-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13142012#comment-13142012 ] gaojinchao commented on HBASE-4577: --- My local test result: Running org.apache.hadoop.hbase.TestMultiVersions Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 39.045 sec Results : Failed tests: testHBaseFsck(org.apache.hadoop.hbase.util.TestHBaseFsck): expected:0 but was:1 Tests in error: testMasterFailoverWithMockedRITOnDeadRS(org.apache.hadoop.hbase.master.TestMasterFailover): test timed out after 18 milliseconds testEnableTableRoundRobinAssignment(org.apache.hadoop.hbase.client.TestAdmin): org.apache.hadoop.hbase.TableNotEnabledException: testEnableTableAssignment testBadOriginalRootLocation(org.apache.hadoop.hbase.catalog.TestCatalogTrackerOnCluster): unknown host: example.org Tests run: 1073, Failures: 1, Errors: 3, Skipped: 9 Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB - Key: HBASE-4577 URL: https://issues.apache.org/jira/browse/HBASE-4577 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0 Attachments: HBASE-4577_trial_Trunk.patch, HBASE-4577_trunk.patch Minor issue while looking at the RS metrics: bq. numberOfStorefiles=8, storefileUncompressedSizeMB=2418, storefileSizeMB=2420, compressionRatio=1.0008 I guess there's a truncation somewhere when it's adding the numbers up. FWIW there's no compression on that table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira