[jira] [Updated] (HBASE-5545) region can't be opened for a long time. Because the creating File failed.
[ https://issues.apache.org/jira/browse/HBASE-5545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-5545: -- Description: Scenario: 1. File is created 2. But while writing data, all datanodes might have crashed. So writing data will fail. 3. Now even if close is called in finally block, close also will fail and throw the Exception because writing data failed. 4. After this if RS try to create the same file again, then AlreadyBeingCreatedException will come. Suggestion to handle this scenario. --- 1. Check for the existence of the file, if exists delete the file and create new file. Here delete call for the file will not check whether the file is open or closed. Overwrite Option: 1. Overwrite option will be applicable if you are trying to overwrite a closed file. 2. If the file is not closed, then even with overwrite option Same AlreadyBeingCreatedException will be thrown. This is the expected behaviour to avoid the Multiple clients writing to same file. Region server logs: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/test1/12c01902324218d14b17a5880f24f64b/.tmp/.regioninfo for DFSClient_hb_rs_158-1-131-48,20020,1331107668635_1331107669061_-252463556_25 on client 158.1.132.19 because current leaseholder is trying to recreate file. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1570) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1440) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1382) at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:658) at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:547) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1137) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1133) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1131) at org.apache.hadoop.ipc.Client.call(Client.java:961) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:245) at $Proxy6.create(Unknown Source) at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invokeMethod(RPCRetryAndSwitchInvoker.java:201) at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invokeMethod(RPCRetryAndSwitchInvoker.java:189) at com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invoke(RPCRetryAndSwitchInvoker.java:77) at $Proxy6.create(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.init(DFSClient.java:3643) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:778) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:364) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:630) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:611) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:518) at org.apache.hadoop.hbase.regionserver.HRegion.checkRegioninfoOnFilesystem(HRegion.java:424) at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:340) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2672) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2658) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:330) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:116) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:158) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) [2012-03-07 20:51:45,858] [WARN ] [RS_OPEN_REGION-158-1-131-48,20020,1331107668635-23] [com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker 131] Retrying the method call: public abstract void org.apache.hadoop.hdfs.protocol.ClientProtocol.create(java.lang.String,org.apache.hadoop.fs.permission.FsPermission,java.lang.String,boolean,boolean,short,long) throws java.io.IOException with arguments of length: 7. The exisiting ActiveServerConnection is: ActiveServerConnectionInfo: Metadata:158-1-131-48/158.1.132.19:9000 Version:145720623220907 [2012-03-07 20:51:45,872] [DEBUG]
[jira] [Updated] (HBASE-5545) region can't be opened for a long time. Because the creating File failed.
[ https://issues.apache.org/jira/browse/HBASE-5545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-5545: -- Description: Scenario: 1. File is created 2. But while writing data, all datanodes might have crashed. So writing data will fail. 3. Now even if close is called in finally block, close also will fail and throw the Exception because writing data failed. 4. After this if RS try to create the same file again, then AlreadyBeingCreatedException will come. Suggestion to handle this scenario. --- 1. Check for the existence of the file, if exists delete the file and create new file. Here delete call for the file will not check whether the file is open or closed. Overwrite Option: 1. Overwrite option will be applicable if you are trying to overwrite a closed file. 2. If the file is not closed, then even with overwrite option Same AlreadyBeingCreatedException will be thrown. This is the expected behaviour to avoid the Multiple clients writing to same file. Region server logs: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/test1/12c01902324218d14b17a5880f24f64b/.tmp/.regioninfo for DFSClient_hb_rs_158-1-131-48,20020,1331107668635_1331107669061_-252463556_25 on client 158.1.132.19 because current leaseholder is trying to recreate file. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1570) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1440) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1382) at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:658) at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:547) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1137) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1133) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1131) at org.apache.hadoop.ipc.Client.call(Client.java:961) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:245) at $Proxy6.create(Unknown Source) at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at $Proxy6.create(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.init(DFSClient.java:3643) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:778) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:364) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:630) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:611) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:518) at org.apache.hadoop.hbase.regionserver.HRegion.checkRegioninfoOnFilesystem(HRegion.java:424) at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:340) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2672) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2658) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:330) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:116) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:158) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) [2012-03-07 20:51:45,858] [WARN ] [RS_OPEN_REGION-158-1-131-48,20020,1331107668635-23] [com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker 131] Retrying the method call: public abstract void org.apache.hadoop.hdfs.protocol.ClientProtocol.create(java.lang.String,org.apache.hadoop.fs.permission.FsPermission,java.lang.String,boolean,boolean,short,long) throws java.io.IOException with arguments of length: 7. The exisiting ActiveServerConnection is: ActiveServerConnectionInfo: Metadata:158-1-131-48/158.1.132.19:9000 Version:145720623220907 [2012-03-07 20:51:45,872] [DEBUG] [RS_OPEN_REGION-158-1-131-48,20020,1331107668635-20] [org.apache.hadoop.hbase.zookeeper.ZKAssign 849] regionserver:20020-0x135ec32b39e0002-0x135ec32b39e0002 Successfully transitioned node 91bf3e6f8adb2e4b335f061036353126 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING [2012-03-07 20:51:45,873] [DEBUG]
[jira] [Updated] (HBASE-5488) Fixed OfflineMetaRepair bug
[ https://issues.apache.org/jira/browse/HBASE-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-5488: -- Fix Version/s: 0.92.1 Status: Patch Available (was: Open) Fixed OfflineMetaRepair bug Key: HBASE-5488 URL: https://issues.apache.org/jira/browse/HBASE-5488 Project: HBase Issue Type: Bug Affects Versions: 0.90.6 Reporter: gaojinchao Assignee: gaojinchao Priority: Minor Fix For: 0.90.7, 0.92.1 Attachments: HBASE-5488-trunk.patch, HBASE-5488_branch90.txt I want to use OfflineMetaRepair tools and found onbody fix this bugs. I will make a patch. 12/01/05 23:23:30 ERROR util.HBaseFsck: Bailed out due to: java.lang.IllegalArgumentException: Wrong FS: hdfs:// us01-ciqps1-name01.carrieriq.com:9000/hbase/M2M-INTEGRATION-MM_TION-13 25190318714/0003d2ede27668737e192d8430dbe5d0/.regioninfo, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:352) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:368) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:126) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:398) at org.apache.hadoop.hbase.util.HBaseFsck.loadMetaEntry(HBaseFsck.java:256) at org.apache.hadoop.hbase.util.HBaseFsck.loadTableInfo(HBaseFsck.java:284) at org.apache.hadoop.hbase.util.HBaseFsck.rebuildMeta(HBaseFsck.java:402) at org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair.main(OfflineMetaRe -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5488) Fixed OfflineMetaRepair bug
[ https://issues.apache.org/jira/browse/HBASE-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-5488: -- Attachment: HBASE-5488-branch92.patch Fixed OfflineMetaRepair bug Key: HBASE-5488 URL: https://issues.apache.org/jira/browse/HBASE-5488 Project: HBase Issue Type: Bug Affects Versions: 0.90.6 Reporter: gaojinchao Assignee: gaojinchao Priority: Minor Fix For: 0.90.7, 0.92.1 Attachments: HBASE-5488-branch92.patch, HBASE-5488-trunk.patch, HBASE-5488_branch90.txt I want to use OfflineMetaRepair tools and found onbody fix this bugs. I will make a patch. 12/01/05 23:23:30 ERROR util.HBaseFsck: Bailed out due to: java.lang.IllegalArgumentException: Wrong FS: hdfs:// us01-ciqps1-name01.carrieriq.com:9000/hbase/M2M-INTEGRATION-MM_TION-13 25190318714/0003d2ede27668737e192d8430dbe5d0/.regioninfo, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:352) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:368) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:126) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:398) at org.apache.hadoop.hbase.util.HBaseFsck.loadMetaEntry(HBaseFsck.java:256) at org.apache.hadoop.hbase.util.HBaseFsck.loadTableInfo(HBaseFsck.java:284) at org.apache.hadoop.hbase.util.HBaseFsck.rebuildMeta(HBaseFsck.java:402) at org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair.main(OfflineMetaRe -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5488) Fixed OfflineMetaRepair bug
[ https://issues.apache.org/jira/browse/HBASE-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-5488: -- Attachment: HBASE-5488_branch90.txt Fixed OfflineMetaRepair bug Key: HBASE-5488 URL: https://issues.apache.org/jira/browse/HBASE-5488 Project: HBase Issue Type: Bug Affects Versions: 0.90.6 Reporter: gaojinchao Assignee: gaojinchao Priority: Minor Fix For: 0.90.7 Attachments: HBASE-5488_branch90.txt I want to use OfflineMetaRepair tools and found onbody fix this bugs. I will make a patch. 12/01/05 23:23:30 ERROR util.HBaseFsck: Bailed out due to: java.lang.IllegalArgumentException: Wrong FS: hdfs:// us01-ciqps1-name01.carrieriq.com:9000/hbase/M2M-INTEGRATION-MM_TION-13 25190318714/0003d2ede27668737e192d8430dbe5d0/.regioninfo, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:352) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:368) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:126) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:398) at org.apache.hadoop.hbase.util.HBaseFsck.loadMetaEntry(HBaseFsck.java:256) at org.apache.hadoop.hbase.util.HBaseFsck.loadTableInfo(HBaseFsck.java:284) at org.apache.hadoop.hbase.util.HBaseFsck.rebuildMeta(HBaseFsck.java:402) at org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair.main(OfflineMetaRe -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5379) Backport HBASE-4287 to 0.90 - If region opening fails, try to transition region back to offline in ZK
[ https://issues.apache.org/jira/browse/HBASE-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-5379: -- Attachment: HBASE-4287.text Backport HBASE-4287 to 0.90 - If region opening fails, try to transition region back to offline in ZK --- Key: HBASE-5379 URL: https://issues.apache.org/jira/browse/HBASE-5379 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 0.90.7 Attachments: HBASE-4287.text This issue is needed in 0.90 also. Else if region assignment fails then need to wait for 30 minutes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-5179: -- Attachment: Errorlog when region server checkin, I killed meta/root region. I found splitHlog is after meta was assigned. So I think may has another problem. tomorrow continue to dig . Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 5179-90v12.patch, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, Errorlog, hbase-5179.patch, hbase-5179v10.patch, hbase-5179v12.patch, hbase-5179v5.patch, hbase-5179v6.patch, hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-5179: -- Attachment: 5179-90v4.patch Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Priority: Critical Fix For: 0.92.0, 0.94.0, 0.90.6 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-5179: -- Attachment: 5179-90v3.patch Please review it first. I will verify tomorrow Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-5179: -- Attachment: (was: 5179-90v3.patch) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-90.txt, 5179-90v2.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss
[ https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-5179: -- Attachment: 5179-90v3.patch Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss --- Key: HBASE-5179 URL: https://issues.apache.org/jira/browse/HBASE-5179 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.2 Reporter: chunhui shen Assignee: chunhui shen Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch If master's processing its failover and ServerShutdownHandler's processing happen concurrently, it may appear following case. 1.master completed splitLogAfterStartup() 2.RegionserverA restarts, and ServerShutdownHandler is processing. 3.master starts to rebuildUserRegions, and RegionserverA is considered as dead server. 4.master starts to assign regions of RegionserverA because it is a dead server by step3. However, when doing step4(assigning region), ServerShutdownHandler may be doing split log, Therefore, it may cause data loss. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5178) Backport HBASE-4101 - Regionserver Deadlock
[ https://issues.apache.org/jira/browse/HBASE-5178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-5178: -- Attachment: HBASE-4101_0.90_1.patch Backport HBASE-4101 - Regionserver Deadlock --- Key: HBASE-5178 URL: https://issues.apache.org/jira/browse/HBASE-5178 Project: HBase Issue Type: Bug Reporter: ramkrishna.s.vasudevan Attachments: HBASE-4101_0.90_1.patch Critical issue not merged to 0.90. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler
[ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-5120: -- Attachment: HBASE-5120.patch Patch is attached so that i can access it at home. Not the final one and not fully tested in cluster. Timeout monitor races with table disable handler Key: HBASE-5120 URL: https://issues.apache.org/jira/browse/HBASE-5120 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Zhihong Yu Priority: Blocker Attachments: HBASE-5120.patch Here is what J-D described here: https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176 I think I will retract from my statement that it used to be extremely racy and caused more troubles than it fixed, on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it. First the region gets closed {quote} 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. {quote} 2 minutes later it times out: {quote} 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining) {quote} 100ms later the master finally gets the event: {quote} 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED {quote} At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign? {quote} 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791 {quote} Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on. The late ZK notification that the znode was deleted (but it got recreated after): {quote} 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted. {quote} Then it prints this, and much later tries to unassign it again: {quote} 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null ... 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null 2012-01-04
[jira] [Updated] (HBASE-5060) HBase client is blocked forever
[ https://issues.apache.org/jira/browse/HBASE-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-5060: -- Attachment: HBASE-5060_trunk.patch Patch for trunk HBase client is blocked forever --- Key: HBASE-5060 URL: https://issues.apache.org/jira/browse/HBASE-5060 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Priority: Critical Fix For: 0.90.6 Attachments: HBASE-5060_Branch90trial.patch, HBASE-5060_trunk.patch Since the client had a temporary network failure, After it recovered. I found my client thread was blocked. Looks below stack and logs, It said that we use a invalid CatalogTracker in function tableExists. Block stack: WriteHbaseThread33 prio=10 tid=0x7f76bc27a800 nid=0x2540 in Object.wait() [0x7f76af4f3000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:331) - locked 0x7f7a67817c98 (a java.util.concurrent.atomic.AtomicBoolean) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366) at org.apache.hadoop.hbase.catalog.MetaReader.tableExists(MetaReader.java:427) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:164) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source) - locked 0x7f7a4c5dc578 (a com.huawei.hdi.hbase.HbaseReOper) at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source) at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source) In ZooKeeperNodeTracker, We don't throw the KeeperException to high level. So in CatalogTracker level, We think ZooKeeperNodeTracker start success and continue to process . [WriteHbaseThread33]2011-12-16 17:07:33,153[WARN ] | hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Unable to get data of znode /hbase/root-region-server | org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:557) org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/root-region-server at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73) at org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136) at org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source) at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source) at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source) [WriteHbaseThread33]2011-12-16 17:07:33,361[ERROR] | hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Received unexpected KeeperException, re-throwing exception | org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.keeperException(ZooKeeperWatcher.java:385) org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/root-region-server at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73) at org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136) at org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source) at
[jira] [Updated] (HBASE-5060) HBase client is blocked forever
[ https://issues.apache.org/jira/browse/HBASE-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-5060: -- Fix Version/s: 0.92.1 Status: Patch Available (was: Open) HBase client is blocked forever --- Key: HBASE-5060 URL: https://issues.apache.org/jira/browse/HBASE-5060 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Priority: Critical Fix For: 0.92.1, 0.90.6 Attachments: HBASE-5060_Branch90trial.patch, HBASE-5060_trunk.patch Since the client had a temporary network failure, After it recovered. I found my client thread was blocked. Looks below stack and logs, It said that we use a invalid CatalogTracker in function tableExists. Block stack: WriteHbaseThread33 prio=10 tid=0x7f76bc27a800 nid=0x2540 in Object.wait() [0x7f76af4f3000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:331) - locked 0x7f7a67817c98 (a java.util.concurrent.atomic.AtomicBoolean) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366) at org.apache.hadoop.hbase.catalog.MetaReader.tableExists(MetaReader.java:427) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:164) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source) - locked 0x7f7a4c5dc578 (a com.huawei.hdi.hbase.HbaseReOper) at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source) at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source) In ZooKeeperNodeTracker, We don't throw the KeeperException to high level. So in CatalogTracker level, We think ZooKeeperNodeTracker start success and continue to process . [WriteHbaseThread33]2011-12-16 17:07:33,153[WARN ] | hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Unable to get data of znode /hbase/root-region-server | org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:557) org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/root-region-server at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73) at org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136) at org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source) at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source) at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source) [WriteHbaseThread33]2011-12-16 17:07:33,361[ERROR] | hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Received unexpected KeeperException, re-throwing exception | org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.keeperException(ZooKeeperWatcher.java:385) org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/root-region-server at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73) at org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136) at org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source) at
[jira] [Updated] (HBASE-5060) HBase client is blocked forever
[ https://issues.apache.org/jira/browse/HBASE-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-5060: -- Attachment: HBASE-5060_Branch90trial.patch 1.This issue is diffcult to fix. I made a trial version. 2.I have checked all using ZooKeeperNodeTracker, It seems fine. HBase client is blocked forever --- Key: HBASE-5060 URL: https://issues.apache.org/jira/browse/HBASE-5060 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Priority: Critical Fix For: 0.90.6 Attachments: HBASE-5060_Branch90trial.patch Since the client had a temporary network failure, After it recovered. I found my client thread was blocked. Looks below stack and logs, It said that we use a invalid CatalogTracker in function tableExists. Block stack: WriteHbaseThread33 prio=10 tid=0x7f76bc27a800 nid=0x2540 in Object.wait() [0x7f76af4f3000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:331) - locked 0x7f7a67817c98 (a java.util.concurrent.atomic.AtomicBoolean) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366) at org.apache.hadoop.hbase.catalog.MetaReader.tableExists(MetaReader.java:427) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:164) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source) - locked 0x7f7a4c5dc578 (a com.huawei.hdi.hbase.HbaseReOper) at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source) at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source) In ZooKeeperNodeTracker, We don't throw the KeeperException to high level. So in CatalogTracker level, We think ZooKeeperNodeTracker start success and continue to process . [WriteHbaseThread33]2011-12-16 17:07:33,153[WARN ] | hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Unable to get data of znode /hbase/root-region-server | org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:557) org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/root-region-server at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73) at org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136) at org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source) at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source) at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source) [WriteHbaseThread33]2011-12-16 17:07:33,361[ERROR] | hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Received unexpected KeeperException, re-throwing exception | org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.keeperException(ZooKeeperWatcher.java:385) org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/root-region-server at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73) at org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136) at org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at
[jira] [Updated] (HBASE-5060) HBase client is blocked forever
[ https://issues.apache.org/jira/browse/HBASE-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-5060: -- Fix Version/s: (was: 0.90.5) HBase client is blocked forever --- Key: HBASE-5060 URL: https://issues.apache.org/jira/browse/HBASE-5060 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Priority: Critical Fix For: 0.90.6 Since the client had a temporary network failure, After it recovered. I found my client thread was blocked. Looks below stack and logs, It said that we use a invalid CatalogTracker in function tableExists. Block stack: WriteHbaseThread33 prio=10 tid=0x7f76bc27a800 nid=0x2540 in Object.wait() [0x7f76af4f3000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:331) - locked 0x7f7a67817c98 (a java.util.concurrent.atomic.AtomicBoolean) at org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366) at org.apache.hadoop.hbase.catalog.MetaReader.tableExists(MetaReader.java:427) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:164) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source) - locked 0x7f7a4c5dc578 (a com.huawei.hdi.hbase.HbaseReOper) at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source) at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source) In ZooKeeperNodeTracker, We don't throw the KeeperException to high level. So in CatalogTracker level, We think ZooKeeperNodeTracker start success and continue to process . [WriteHbaseThread33]2011-12-16 17:07:33,153[WARN ] | hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Unable to get data of znode /hbase/root-region-server | org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:557) org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/root-region-server at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73) at org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136) at org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source) at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source) at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source) [WriteHbaseThread33]2011-12-16 17:07:33,361[ERROR] | hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Received unexpected KeeperException, re-throwing exception | org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.keeperException(ZooKeeperWatcher.java:385) org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/root-region-server at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549) at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73) at org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136) at org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111) at org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162) at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown Source) at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source) at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source) at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source) [WriteHbaseThread33]2011-12-16
[jira] [Updated] (HBASE-4970) Allow better control of resource consumption in HTable (backport HBASE-4805 to 0.90 branch)
[ https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4970: -- Attachment: HBASE-4970_Trunk_V2.patch Allow better control of resource consumption in HTable (backport HBASE-4805 to 0.90 branch) --- Key: HBASE-4970 URL: https://issues.apache.org/jira/browse/HBASE-4970 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Priority: Trivial Fix For: 0.90.6 Attachments: HBASE-4970_Branch90.patch, HBASE-4970_Branch90_V1_trial.patch, HBASE-4970_Branch90_V2.patch, HBASE-4970_Branch92_V2.patch, HBASE-4970_Trunk_V2.patch In my cluster, I changed keepAliveTime from 60 s to 3600 s. Increasing RES is slowed down. Why increasing keepAliveTime of HBase thread pool is slowing down our problem occurance [RES value increase]? You can go through the source of sun.nio.ch.Util. Every thread hold 3 softreference of direct buffer(mustangsrc) for reusage. The code names the 3 softreferences buffercache. If the buffer was all occupied or none was suitable in size, and new request comes, new direct buffer is allocated. After the service, the bigger one replaces the smaller one in buffercache. The replaced buffer is released. So I think we can add a parameter to change keepAliveTime of Htable thread pool. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4970) Allow better control of resource consumption in HTable (backport HBASE-4805 to 0.90 branch)
[ https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4970: -- Attachment: HBASE-4970_Branch92_V2.patch Allow better control of resource consumption in HTable (backport HBASE-4805 to 0.90 branch) --- Key: HBASE-4970 URL: https://issues.apache.org/jira/browse/HBASE-4970 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Priority: Trivial Fix For: 0.90.6 Attachments: HBASE-4970_Branch90.patch, HBASE-4970_Branch90_V1_trial.patch, HBASE-4970_Branch90_V2.patch, HBASE-4970_Branch92_V2.patch, HBASE-4970_Trunk_V2.patch In my cluster, I changed keepAliveTime from 60 s to 3600 s. Increasing RES is slowed down. Why increasing keepAliveTime of HBase thread pool is slowing down our problem occurance [RES value increase]? You can go through the source of sun.nio.ch.Util. Every thread hold 3 softreference of direct buffer(mustangsrc) for reusage. The code names the 3 softreferences buffercache. If the buffer was all occupied or none was suitable in size, and new request comes, new direct buffer is allocated. After the service, the bigger one replaces the smaller one in buffercache. The replaced buffer is released. So I think we can add a parameter to change keepAliveTime of Htable thread pool. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-5008) The clusters can't provide services because Region can't flush.
[ https://issues.apache.org/jira/browse/HBASE-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-5008: -- Attachment: HBASE-5008_Branch90.patch I made a patch, Please review The clusters can't provide services because Region can't flush. Key: HBASE-5008 URL: https://issues.apache.org/jira/browse/HBASE-5008 Project: HBase Issue Type: Bug Components: regionserver Reporter: gaojinchao Priority: Blocker Fix For: 0.90.6 Attachments: HBASE-5008_Branch90.patch Hbase version 0.90.4 + patches My analysis is as follows: //Started splitting region b24d8ccb852ff742f2a27d01b7f5853e and closed region. 2011-12-10 17:32:48,653 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e. 2011-12-10 17:32:49,759 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: disabling compactions flushes 2011-12-10 17:32:49,759 INFO org.apache.hadoop.hbase.regionserver.HRegion: Running close preflush of Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e. //Processed a flush request and skipped , But flushRequested had set to true 2011-12-10 17:33:06,963 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e., current region memstore size 12.6m 2011-12-10 17:33:17,277 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Skipping flush on Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e. because closing //split region b24d8ccb852ff742f2a27d01b7f5853 failed and rolled back, flushRequested flag was true, So all handle was blocked 2011-12-10 17:34:01,293 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Cleaned up old failed split transaction detritus: hdfs://193.195.18.121:9000/hbase/Htable_UFDR_004/b24d8ccb852ff742f2a27d01b7f5853e/splits 2011-12-10 17:34:01,294 INFO org.apache.hadoop.hbase.regionserver.HRegion: Onlined Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.; next sequenceid=15494173 2011-12-10 17:34:01,295 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: Successful rollback of failed split of Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e. 2011-12-10 17:43:10,147 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 19 on 20020' on region Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: memstore size 384.0m is = than blocking 384.0m size // All handles had been blocked. The clusters could not provide services 2011-12-10 17:34:01,295 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: Successful rollback of failed split of Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e. 2011-12-10 17:43:10,147 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 19 on 20020' on region Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: memstore size 384.0m is = than blocking 384.0m size 2011-12-10 17:43:10,192 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 34 on 20020' on region Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: memstore size 384.0m is = than blocking 384.0m size 2011-12-10 17:43:10,193 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 51 on 20020' on region Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: memstore size 384.0m is = than blocking 384.0m size 2011-12-10 17:43:10,196 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 85 on 20020' on region Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: memstore size 384.0m is = than blocking 384.0m size 2011-12-10 17:43:10,199 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 88 on 20020' on region Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: memstore size 384.0m is = than blocking 384.0m size 2011-12-10 17:43:10,202 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 44 on 20020' on region Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: memstore size 384.0m is = than blocking 384.0m size 2011-12-10 17:43:11,663 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 2 on 20020' on region
[jira] [Updated] (HBASE-4970) Add a parameter to change keepAliveTime of Htable thread pool.
[ https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4970: -- Attachment: HBASE-4970_Branch90.patch Add a parameter to change keepAliveTime of Htable thread pool. --- Key: HBASE-4970 URL: https://issues.apache.org/jira/browse/HBASE-4970 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Priority: Trivial Fix For: 0.90.5 Attachments: HBASE-4970_Branch90.patch In my cluster, I changed keepAliveTime from 60 s to 3600 s. Increasing RES is slowed down. Why increasing keepAliveTime of HBase thread pool is slowing down our problem occurance [RES value increase]? You can go through the source of sun.nio.ch.Util. Every thread hold 3 softreference of direct buffer(mustangsrc) for reusage. The code names the 3 softreferences buffercache. If the buffer was all occupied or none was suitable in size, and new request comes, new direct buffer is allocated. After the service, the bigger one replaces the smaller one in buffercache. The replaced buffer is released. So I think we can add a parameter to change keepAliveTime of Htable thread pool. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4970) Add a parameter to change keepAliveTime of Htable thread pool.
[ https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4970: -- Attachment: HBASE-4970_Branch90_V1_trial.patch Add a parameter to change keepAliveTime of Htable thread pool. --- Key: HBASE-4970 URL: https://issues.apache.org/jira/browse/HBASE-4970 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Priority: Trivial Fix For: 0.90.5 Attachments: HBASE-4970_Branch90.patch, HBASE-4970_Branch90_V1_trial.patch In my cluster, I changed keepAliveTime from 60 s to 3600 s. Increasing RES is slowed down. Why increasing keepAliveTime of HBase thread pool is slowing down our problem occurance [RES value increase]? You can go through the source of sun.nio.ch.Util. Every thread hold 3 softreference of direct buffer(mustangsrc) for reusage. The code names the 3 softreferences buffercache. If the buffer was all occupied or none was suitable in size, and new request comes, new direct buffer is allocated. After the service, the bigger one replaces the smaller one in buffercache. The replaced buffer is released. So I think we can add a parameter to change keepAliveTime of Htable thread pool. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4970) Allow better control of resource consumption in HTable (backport HBASE-4805 to 0.90 branch)
[ https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4970: -- Summary: Allow better control of resource consumption in HTable (backport HBASE-4805 to 0.90 branch) (was: Add a parameter to change keepAliveTime of Htable thread pool.) Allow better control of resource consumption in HTable (backport HBASE-4805 to 0.90 branch) --- Key: HBASE-4970 URL: https://issues.apache.org/jira/browse/HBASE-4970 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Priority: Trivial Fix For: 0.90.5 Attachments: HBASE-4970_Branch90.patch, HBASE-4970_Branch90_V1_trial.patch In my cluster, I changed keepAliveTime from 60 s to 3600 s. Increasing RES is slowed down. Why increasing keepAliveTime of HBase thread pool is slowing down our problem occurance [RES value increase]? You can go through the source of sun.nio.ch.Util. Every thread hold 3 softreference of direct buffer(mustangsrc) for reusage. The code names the 3 softreferences buffercache. If the buffer was all occupied or none was suitable in size, and new request comes, new direct buffer is allocated. After the service, the bigger one replaces the smaller one in buffercache. The replaced buffer is released. So I think we can add a parameter to change keepAliveTime of Htable thread pool. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4970) Add a parameter to change keepAliveTime of Htable thread pool.
[ https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4970: -- Affects Version/s: 0.90.4 Fix Version/s: 0.90.5 Add a parameter to change keepAliveTime of Htable thread pool. --- Key: HBASE-4970 URL: https://issues.apache.org/jira/browse/HBASE-4970 Project: HBase Issue Type: Improvement Components: client Affects Versions: 0.90.4 Reporter: gaojinchao Assignee: gaojinchao Priority: Trivial Fix For: 0.90.5 In my cluster, I changed keepAliveTime from 60 s to 3600 s. Increasing RES is slowed down. Why increasing keepAliveTime of HBase thread pool is slowing down our problem occurance [RES value increase]? You can go through the source of sun.nio.ch.Util. Every thread hold 3 softreference of direct buffer(mustangsrc) for reusage. The code names the 3 softreferences buffercache. If the buffer was all occupied or none was suitable in size, and new request comes, new direct buffer is allocated. After the service, the bigger one replaces the smaller one in buffercache. The replaced buffer is released. So I think we can add a parameter to change keepAliveTime of Htable thread pool. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4633) Potential memory leak in client RPC timeout mechanism
[ https://issues.apache.org/jira/browse/HBASE-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4633: -- Attachment: HBaseclientstack.png I attached the stack picture. Potential memory leak in client RPC timeout mechanism - Key: HBASE-4633 URL: https://issues.apache.org/jira/browse/HBASE-4633 Project: HBase Issue Type: Bug Components: client Affects Versions: 0.90.3 Environment: HBase version: 0.90.3 + Patches , Hadoop version: CDH3u0 Reporter: Shrijeet Paliwal Attachments: HBaseclientstack.png Relevant Jiras: https://issues.apache.org/jira/browse/HBASE-2937, https://issues.apache.org/jira/browse/HBASE-4003 We have been using the 'hbase.client.operation.timeout' knob introduced in 2937 for quite some time now. It helps us enforce SLA. We have two HBase clusters and two HBase client clusters. One of them is much busier than the other. We have seen a deterministic behavior of clients running in busy cluster. Their (client's) memory footprint increases consistently after they have been up for roughly 24 hours. This memory footprint almost doubles from its usual value (usual case == RPC timeout disabled). After much investigation nothing concrete came out and we had to put a hack which keep heap size in control even when RPC timeout is enabled. Also note , the same behavior is not observed in 'not so busy cluster. The patch is here : https://gist.github.com/1288023 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4868) testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails
[ https://issues.apache.org/jira/browse/HBASE-4868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4868: -- Attachment: HBASE-4868_trial.patch testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails - Key: HBASE-4868 URL: https://issues.apache.org/jira/browse/HBASE-4868 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.92.0 Reporter: gaojinchao Priority: Minor Fix For: 0.94.0 Attachments: HBASE-4868_trial.patch looks: https://builds.apache.org/job/HBase-TRUNK-security/7/testReport/org.apache.hadoop.hbase.util.hbck/TestOfflineMetaRebuildBase/testMetaRebuild/ Please review, see whether the method makes sense? If it makes sense, I will check other cases? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4868) testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails
[ https://issues.apache.org/jira/browse/HBASE-4868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4868: -- Attachment: HBASE-4868_trunkv2.patch Fixed the comments testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails - Key: HBASE-4868 URL: https://issues.apache.org/jira/browse/HBASE-4868 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.92.0 Reporter: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0 Attachments: HBASE-4868_trial.patch, HBASE-4868_trunkv2.patch looks: https://builds.apache.org/job/HBase-TRUNK-security/7/testReport/org.apache.hadoop.hbase.util.hbck/TestOfflineMetaRebuildBase/testMetaRebuild/ Please review, see whether the method makes sense? If it makes sense, I will check other cases? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4868) testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails
[ https://issues.apache.org/jira/browse/HBASE-4868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4868: -- Status: Open (was: Patch Available) testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails - Key: HBASE-4868 URL: https://issues.apache.org/jira/browse/HBASE-4868 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.92.0 Reporter: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0 Attachments: HBASE-4868_trial.patch, HBASE-4868_trunkv2.patch looks: https://builds.apache.org/job/HBase-TRUNK-security/7/testReport/org.apache.hadoop.hbase.util.hbck/TestOfflineMetaRebuildBase/testMetaRebuild/ Please review, see whether the method makes sense? If it makes sense, I will check other cases? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4868) testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails
[ https://issues.apache.org/jira/browse/HBASE-4868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4868: -- Assignee: gaojinchao Status: Patch Available (was: Open) testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails - Key: HBASE-4868 URL: https://issues.apache.org/jira/browse/HBASE-4868 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.92.0 Reporter: gaojinchao Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0 Attachments: HBASE-4868_trial.patch, HBASE-4868_trunkv2.patch looks: https://builds.apache.org/job/HBase-TRUNK-security/7/testReport/org.apache.hadoop.hbase.util.hbck/TestOfflineMetaRebuildBase/testMetaRebuild/ Please review, see whether the method makes sense? If it makes sense, I will check other cases? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4864) testRegionTransitionOperations occasional failures
[ https://issues.apache.org/jira/browse/HBASE-4864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4864: -- Attachment: HBASE-4864_Branch92.patch testRegionTransitionOperations occasional failures -- Key: HBASE-4864 URL: https://issues.apache.org/jira/browse/HBASE-4864 Project: HBase Issue Type: Bug Components: test Reporter: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0 Attachments: HBASE-4864_Branch92.patch looks this logs: https://builds.apache.org/job/HBase-TRUNK-security/ws/trunk/target/surefire-reports/ It seems that we should wait region is added to online region set. I made a patch, Please review. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4739: -- Attachment: HBASE-4739_Branch092.patch Patch is in branch 0.92 Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: 4739_trial2.patch, 4739_trialV3.patch, HBASE-4739_Branch092.patch, HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_V7.patch, HBASE-4739_trail5.patch, HBASE-4739_trial.patch, HBASE-4739_trial6.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4739: -- Attachment: HBASE-4739_trial6.patch Thanks for your review. Fixed all comments Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: 4739_trial2.patch, 4739_trialV3.patch, HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_trail5.patch, HBASE-4739_trial.patch, HBASE-4739_trial6.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4739: -- Attachment: HBASE-4739_V7.patch Fixed Ted's comment @Ted Do we need make a patch for 0.90.5 ? :) Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: 4739_trial2.patch, 4739_trialV3.patch, HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_V7.patch, HBASE-4739_trail5.patch, HBASE-4739_trial.patch, HBASE-4739_trial6.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4739: -- Attachment: 4739_trialV3.patch trialV3 fixed Ram' comment. Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: 4739_trial2.patch, 4739_trialV3.patch, HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_trial.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4739: -- Attachment: HBASE-4739_trial.patch trail version does not test and need improve Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_trial.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4739: -- Attachment: 4739_trial2.patch Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: 4739_trial2.patch, HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_trial.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4790) TestDistributedLogSplitting occasional failure
[ https://issues.apache.org/jira/browse/HBASE-4790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4790: -- Attachment: HBASE-4790_Trunk.patch TestDistributedLogSplitting occasional failure -- Key: HBASE-4790 URL: https://issues.apache.org/jira/browse/HBASE-4790 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.94.0 Reporter: gaojinchao Priority: Minor Fix For: 0.94.0 Attachments: HBASE-4790_Trunk.patch looks this link: https://builds.apache.org/job/PreCommit-HBASE-Build/253//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testRecoveredEdits/ // it said that regions is 0. 2011-11-15 03:53:11,215 INFO [Thread-2335] master.TestDistributedLogSplitting(211): #regions = 0 2011-11-15 03:53:11,215 DEBUG [RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer] wal.HLog$LogSyncer(1192): RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer interrupted while waiting for sync requests 2011-11-15 03:53:11,215 INFO [RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer] wal.HLog$LogSyncer(1194): RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer exiting 2011-11-15 03:53:11,215 DEBUG [Thread-2335] wal.HLog(967): closing hlog writer in hdfs://localhost:46229/user/jenkins/.logs/asf001.sp2.ygridcore.net,36721,1321329179789 2011-11-15 03:53:11,637 DEBUG [Thread-2335] master.SplitLogManager(233): Scheduling batch of logs to split -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4790) TestDistributedLogSplitting occasional failure
[ https://issues.apache.org/jira/browse/HBASE-4790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4790: -- Affects Version/s: (was: 0.94.0) 0.92.0 Fix Version/s: 0.92.0 Assignee: gaojinchao TestDistributedLogSplitting occasional failure -- Key: HBASE-4790 URL: https://issues.apache.org/jira/browse/HBASE-4790 Project: HBase Issue Type: Bug Components: test Affects Versions: 0.92.0 Reporter: gaojinchao Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0 Attachments: HBASE-4790_Trunk.patch looks this link: https://builds.apache.org/job/PreCommit-HBASE-Build/253//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testRecoveredEdits/ // it said that regions is 0. 2011-11-15 03:53:11,215 INFO [Thread-2335] master.TestDistributedLogSplitting(211): #regions = 0 2011-11-15 03:53:11,215 DEBUG [RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer] wal.HLog$LogSyncer(1192): RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer interrupted while waiting for sync requests 2011-11-15 03:53:11,215 INFO [RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer] wal.HLog$LogSyncer(1194): RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer exiting 2011-11-15 03:53:11,215 DEBUG [Thread-2335] wal.HLog(967): closing hlog writer in hdfs://localhost:46229/user/jenkins/.logs/asf001.sp2.ygridcore.net,36721,1321329179789 2011-11-15 03:53:11,637 DEBUG [Thread-2335] master.SplitLogManager(233): Scheduling batch of logs to split -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4739: -- Attachment: HBASE-4739_Trunk.patch Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: HBASE-4739_Trunk.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4739) Master dying while going to close a region can leave it in transition forever
[ https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4739: -- Attachment: HBASE-4739_Trunk_V2.patch Master dying while going to close a region can leave it in transition forever - Key: HBASE-4739 URL: https://issues.apache.org/jira/browse/HBASE-4739 Project: HBase Issue Type: Bug Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0, 0.94.0, 0.90.5 Attachments: HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when the master died it had just created the RIT znode for a region but didn't tell the RS to close it yet. When the master restarted it saw the znode and started printing this: {quote} 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948 2011-11-03 00:02:49,130 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for too long, this should eventually complete or the server will expire, doing nothing {quote} It's never going to happen, and it's blocking balancing. I'm marking this as minor since I believe this situation is pretty rare unless you hit other bugs while trying out stuff to root bugs out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4577) Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB
[ https://issues.apache.org/jira/browse/HBASE-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4577: -- Attachment: HBASE-4577_trunk.patch Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB - Key: HBASE-4577 URL: https://issues.apache.org/jira/browse/HBASE-4577 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Minor Fix For: 0.92.0 Attachments: HBASE-4577_trial_Trunk.patch, HBASE-4577_trunk.patch Minor issue while looking at the RS metrics: bq. numberOfStorefiles=8, storefileUncompressedSizeMB=2418, storefileSizeMB=2420, compressionRatio=1.0008 I guess there's a truncation somewhere when it's adding the numbers up. FWIW there's no compression on that table. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4695) WAL logs get deleted before region server can fully flush
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4695: -- Attachment: HBASE-4695_Trunk_V2.patch WAL logs get deleted before region server can fully flush - Key: HBASE-4695 URL: https://issues.apache.org/jira/browse/HBASE-4695 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.90.4 Reporter: jack levin Assignee: gaojinchao Priority: Blocker Fix For: 0.90.5 Attachments: HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt To replicate the problem do the following: 1. check /hbase/.logs/ directory to see if you have WAL logs for the region server you are shutting down. 2. executing kill pid (where pid is a regionserver pid) 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush: 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close 4. Check /hbase/.logs/ -- you will notice that it has dissapeared. 5. Check namenode logs: 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749 Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4695) WAL logs get deleted before region server can fully flush
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4695: -- Attachment: HBASE-4695_Branch90_V2.patch WAL logs get deleted before region server can fully flush - Key: HBASE-4695 URL: https://issues.apache.org/jira/browse/HBASE-4695 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.90.4 Reporter: jack levin Assignee: gaojinchao Priority: Blocker Fix For: 0.92.0, 0.90.5 Attachments: HBASE-4695_Branch90_V2.patch, HBASE-4695_Trunk_V2.patch, HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt To replicate the problem do the following: 1. check /hbase/.logs/ directory to see if you have WAL logs for the region server you are shutting down. 2. executing kill pid (where pid is a regionserver pid) 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush: 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close 4. Check /hbase/.logs/ -- you will notice that it has dissapeared. 5. Check namenode logs: 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749 Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4695) WAL logs get deleted before region server can fully flush
[ https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4695: -- Attachment: HBASE-4695_branch90_trial.patch Go back to company and verify this patch. If you are free, Please review it firstly. The patch seems simple. WAL logs get deleted before region server can fully flush - Key: HBASE-4695 URL: https://issues.apache.org/jira/browse/HBASE-4695 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.90.4 Reporter: jack levin Assignee: gaojinchao Priority: Blocker Fix For: 0.90.5 Attachments: HBASE-4695_branch90_trial.patch To replicate the problem do the following: 1. check /hbase/.logs/ directory to see if you have WAL logs for the region server you are shutting down. 2. executing kill pid (where pid is a regionserver pid) 3. Watch the regionserver log to start flushing, you will see how many regions are left to flush: 09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 489 regions to close 09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 116 regions to close 4. Check /hbase/.logs/ -- you will notice that it has dissapeared. 5. Check namenode logs: 09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=root ip=/10.101.1.5 cmd=delete src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749 Note that, if you kill -9 the RS now, and it crashes on flush, you won't have any WAL logs to replay. We need to make sure that logs are deleted or moved out only when RS has fully flushed. Otherwise its possible to lose data. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4578) NPE when altering a table that has moving regions
[ https://issues.apache.org/jira/browse/HBASE-4578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4578: -- Attachment: HBASE-4578_Trunk_V1.patch NPE when altering a table that has moving regions - Key: HBASE-4578 URL: https://issues.apache.org/jira/browse/HBASE-4578 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: gaojinchao Priority: Blocker Fix For: 0.92.0 Attachments: HBASE-4578_Trunk_V1.patch, HBASE-4578_trial_Trunk.patch I'm still not a 100% sure on the source of this error, but here's what I was able to get twice while altering a table that was doing a bunch of splits: {quote} 2011-10-11 23:48:59,344 INFO org.apache.hadoop.hbase.master.handler.SplitRegionHandler: Handled SPLIT report); parent=TestTable,0002608338,1318376880454.a75d6815fdfc513fb1c8aabe086c6763. daughter a=TestTable,0002608338,1318376938764.ef170ff6cd8695dc8aec92e542dc9ac1.daughter b=TestTable,0003301408,1318376938764.36eb2530341bd46888ede312c5559b5d. 2011-10-11 23:49:09,579 DEBUG org.apache.hadoop.hbase.master.handler.TableEventHandler: Ignoring table not disabled exception for supporting online schema changes. 2011-10-11 23:49:09,580 INFO org.apache.hadoop.hbase.master.handler.TableEventHandler: Handling table operation C_M_MODIFY_TABLE on table TestTable 2011-10-11 23:49:09,612 INFO org.apache.hadoop.hbase.util.FSUtils: TableInfoPath = hdfs://sv4r11s38:9100/hbase/TestTable/.tableinfo tmpPath = hdfs://sv4r11s38:9100/hbase/TestTable/.tmp/.tableinfo.1318376949612 2011-10-11 23:49:09,692 INFO org.apache.hadoop.hbase.util.FSUtils: TableDescriptor stored. TableInfoPath = hdfs://sv4r11s38:9100/hbase/TestTable/.tableinfo 2011-10-11 23:49:09,693 INFO org.apache.hadoop.hbase.util.FSUtils: Updated tableinfo=hdfs://sv4r11s38:9100/hbase/TestTable/.tableinfo to blah 2011-10-11 23:49:09,695 INFO org.apache.hadoop.hbase.master.handler.TableEventHandler: Bucketing regions by region server... 2011-10-11 23:49:09,695 DEBUG org.apache.hadoop.hbase.client.MetaScanner: Scanning .META. starting at row=TestTable,,00 for max=2147483647 rows 2011-10-11 23:49:09,709 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: The connection to hconnection-0x132f043bbde02e9 has been closed. 2011-10-11 23:49:09,709 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event C_M_MODIFY_TABLE java.lang.NullPointerException at java.util.TreeMap.getEntry(TreeMap.java:324) at java.util.TreeMap.containsKey(TreeMap.java:209) at org.apache.hadoop.hbase.master.handler.TableEventHandler.reOpenAllRegions(TableEventHandler.java:114) at org.apache.hadoop.hbase.master.handler.TableEventHandler.process(TableEventHandler.java:90) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:168) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {quote} The first time the shell reported that all the regions were updated correctly, the second time it got stuck for a while: {quote} 6/14 regions updated. 0/14 regions updated. ... 0/14 regions updated. 2/16 regions updated. ... 2/16 regions updated. 8/9 regions updated. ... 8/9 regions updated. {quote} After which I killed it, redid the alter and it worked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4578) NPE when altering a table that has moving regions
[ https://issues.apache.org/jira/browse/HBASE-4578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4578: -- Attachment: HBASE-4578_trial_Trunk.patch NPE when altering a table that has moving regions - Key: HBASE-4578 URL: https://issues.apache.org/jira/browse/HBASE-4578 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Priority: Blocker Fix For: 0.92.0 Attachments: HBASE-4578_trial_Trunk.patch I'm still not a 100% sure on the source of this error, but here's what I was able to get twice while altering a table that was doing a bunch of splits: {quote} 2011-10-11 23:48:59,344 INFO org.apache.hadoop.hbase.master.handler.SplitRegionHandler: Handled SPLIT report); parent=TestTable,0002608338,1318376880454.a75d6815fdfc513fb1c8aabe086c6763. daughter a=TestTable,0002608338,1318376938764.ef170ff6cd8695dc8aec92e542dc9ac1.daughter b=TestTable,0003301408,1318376938764.36eb2530341bd46888ede312c5559b5d. 2011-10-11 23:49:09,579 DEBUG org.apache.hadoop.hbase.master.handler.TableEventHandler: Ignoring table not disabled exception for supporting online schema changes. 2011-10-11 23:49:09,580 INFO org.apache.hadoop.hbase.master.handler.TableEventHandler: Handling table operation C_M_MODIFY_TABLE on table TestTable 2011-10-11 23:49:09,612 INFO org.apache.hadoop.hbase.util.FSUtils: TableInfoPath = hdfs://sv4r11s38:9100/hbase/TestTable/.tableinfo tmpPath = hdfs://sv4r11s38:9100/hbase/TestTable/.tmp/.tableinfo.1318376949612 2011-10-11 23:49:09,692 INFO org.apache.hadoop.hbase.util.FSUtils: TableDescriptor stored. TableInfoPath = hdfs://sv4r11s38:9100/hbase/TestTable/.tableinfo 2011-10-11 23:49:09,693 INFO org.apache.hadoop.hbase.util.FSUtils: Updated tableinfo=hdfs://sv4r11s38:9100/hbase/TestTable/.tableinfo to blah 2011-10-11 23:49:09,695 INFO org.apache.hadoop.hbase.master.handler.TableEventHandler: Bucketing regions by region server... 2011-10-11 23:49:09,695 DEBUG org.apache.hadoop.hbase.client.MetaScanner: Scanning .META. starting at row=TestTable,,00 for max=2147483647 rows 2011-10-11 23:49:09,709 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: The connection to hconnection-0x132f043bbde02e9 has been closed. 2011-10-11 23:49:09,709 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event C_M_MODIFY_TABLE java.lang.NullPointerException at java.util.TreeMap.getEntry(TreeMap.java:324) at java.util.TreeMap.containsKey(TreeMap.java:209) at org.apache.hadoop.hbase.master.handler.TableEventHandler.reOpenAllRegions(TableEventHandler.java:114) at org.apache.hadoop.hbase.master.handler.TableEventHandler.process(TableEventHandler.java:90) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:168) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {quote} The first time the shell reported that all the regions were updated correctly, the second time it got stuck for a while: {quote} 6/14 regions updated. 0/14 regions updated. ... 0/14 regions updated. 2/16 regions updated. ... 2/16 regions updated. 8/9 regions updated. ... 8/9 regions updated. {quote} After which I killed it, redid the alter and it worked. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4219) Add Per-Column Family Metrics
[ https://issues.apache.org/jira/browse/HBASE-4219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4219: -- Attachment: HBASE-4219_fixtestcasefail.patch HBASE-4219_fixtestcasefail.patch fix TestHFileBlock failed [INFO] Surefire report directory: /opt/g56562/trunk/target/surefire-reports --- T E S T S --- --- T E S T S --- Running org.apache.hadoop.hbase.io.hfile.TestHFileBlock Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 27.732 sec Results : Tests run: 7, Failures: 0, Errors: 0, Skipped: 0 Add Per-Column Family Metrics - Key: HBASE-4219 URL: https://issues.apache.org/jira/browse/HBASE-4219 Project: HBase Issue Type: New Feature Affects Versions: 0.92.0 Reporter: Nicolas Spiegelberg Assignee: David Goode Fix For: 0.92.0 Attachments: 4219-v2.txt, 4219-v3.txt, HBASE-4219-v4.patch, HBASE-4219-v5.patch, HBASE-4219_fixtestcasefail.patch, HBASE-4219_percfmetrics_1.patch Right now, we have region server level statistics. However, the read/write flow varies a lot based on the column family involved. We should add dynamic, per column family metrics to JMX so we can track each column family individually. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4219) Add Per-Column Family Metrics
[ https://issues.apache.org/jira/browse/HBASE-4219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4219: -- Attachment: (was: HBASE-4219_fixtestcasefail.patch) Add Per-Column Family Metrics - Key: HBASE-4219 URL: https://issues.apache.org/jira/browse/HBASE-4219 Project: HBase Issue Type: New Feature Affects Versions: 0.92.0 Reporter: Nicolas Spiegelberg Assignee: David Goode Fix For: 0.92.0 Attachments: 4219-v2.txt, 4219-v3.txt, HBASE-4219-v4.patch, HBASE-4219-v5.patch, HBASE-4219_fixtestcasefail.patch, HBASE-4219_percfmetrics_1.patch Right now, we have region server level statistics. However, the read/write flow varies a lot based on the column family involved. We should add dynamic, per column family metrics to JMX so we can track each column family individually. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4580) Some invalid zk nodes were created when a clean cluster restarts
[ https://issues.apache.org/jira/browse/HBASE-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4580: -- Attachment: HBASE-4580_TrunkV3.patch Some invalid zk nodes were created when a clean cluster restarts Key: HBASE-4580 URL: https://issues.apache.org/jira/browse/HBASE-4580 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.92.0 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.92.0 Attachments: HBASE-4580_TrunkV1.patch, HBASE-4580_TrunkV2.patch, HBASE-4580_TrunkV3.patch The below logs said that we created a invalid zk node when restarted a cluster. it mistakenly believed that the regions belong to a dead server. 2011-10-11 05:05:29,127 INFO org.apache.hadoop.hbase.master.HMaster: Meta updated status = true 2011-10-11 05:05:29,127 INFO org.apache.hadoop.hbase.master.HMaster: ROOT/Meta already up-to date with new HRI. 2011-10-11 05:05:29,151 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 771d63e9327383159553619a4f2dc74f with OFFLINE state 2011-10-11 05:05:29,161 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 3cf860dd323fe6360f571aeafc129f95 with OFFLINE state 2011-10-11 05:05:29,170 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 4065350214452a9d5c55243c734bef08 with OFFLINE state 2011-10-11 05:05:29,178 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 4e81613f82a39fc6e5e89f96e7b3ccc4 with OFFLINE state 2011-10-11 05:05:29,187 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for e21b9e1545a28953aba0098fda5c9cd9 with OFFLINE state 2011-10-11 05:05:29,195 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 5cd9f55eecd43d088bbd505f6795131f with OFFLINE state 2011-10-11 05:05:29,229 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for db5f641452a70b09b85a92970e4198c7 with OFFLINE state 2011-10-11 05:05:29,237 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for a7b20a653919e7f41bfb2ed349af7d21 with OFFLINE state 2011-10-11 05:05:29,253 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for c9385619425f737eab1a6624d2e097a8 with OFFLINE state // we cleaned all zk nodes. 2011-10-11 05:05:29,262 INFO org.apache.hadoop.hbase.master.AssignmentManager: Clean cluster startup. Assigning userregions 2011-10-11 05:05:29,262 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Deleting any existing unassigned nodes 2011-10-11 05:05:29,367 INFO org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 9 region(s) across 1 server(s), retainAssignment=true 2011-10-11 05:05:29,369 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Timeout-on-RIT=9000 2011-10-11 05:05:29,369 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 9 region(s) to C3S3,54366,1318323920153 2011-10-11 05:05:29,369 INFO org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Async create of unassigned node for 771d63e9327383159553619a4f2dc74f with OFFLINE state 2011-10-11 05:05:29,371 INFO org.apache.hadoop.hbase.master.HMaster: Master has completed initialization 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Async create of unassigned node for 3cf860dd323fe6360f571aeafc129f95 with OFFLINE state 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Async create of unassigned node for 4065350214452a9d5c55243c734bef08 with OFFLINE state 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Async create of unassigned node for 4e81613f82a39fc6e5e89f96e7b3ccc4 with OFFLINE state 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Async create of unassigned node for e21b9e1545a28953aba0098fda5c9cd9 with OFFLINE state 2011-10-11 05:05:29,372 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
[jira] [Updated] (HBASE-4580) Some invalid zk nodes were created when a clean cluster restarts
[ https://issues.apache.org/jira/browse/HBASE-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4580: -- Attachment: HBASE-4580_TrunkV2.patch V2 has reviewed Some invalid zk nodes were created when a clean cluster restarts Key: HBASE-4580 URL: https://issues.apache.org/jira/browse/HBASE-4580 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.92.0 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.92.0 Attachments: HBASE-4580_TrunkV1.patch, HBASE-4580_TrunkV2.patch The below logs said that we created a invalid zk node when restarted a cluster. it mistakenly believed that the regions belong to a dead server. 2011-10-11 05:05:29,127 INFO org.apache.hadoop.hbase.master.HMaster: Meta updated status = true 2011-10-11 05:05:29,127 INFO org.apache.hadoop.hbase.master.HMaster: ROOT/Meta already up-to date with new HRI. 2011-10-11 05:05:29,151 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 771d63e9327383159553619a4f2dc74f with OFFLINE state 2011-10-11 05:05:29,161 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 3cf860dd323fe6360f571aeafc129f95 with OFFLINE state 2011-10-11 05:05:29,170 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 4065350214452a9d5c55243c734bef08 with OFFLINE state 2011-10-11 05:05:29,178 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 4e81613f82a39fc6e5e89f96e7b3ccc4 with OFFLINE state 2011-10-11 05:05:29,187 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for e21b9e1545a28953aba0098fda5c9cd9 with OFFLINE state 2011-10-11 05:05:29,195 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 5cd9f55eecd43d088bbd505f6795131f with OFFLINE state 2011-10-11 05:05:29,229 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for db5f641452a70b09b85a92970e4198c7 with OFFLINE state 2011-10-11 05:05:29,237 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for a7b20a653919e7f41bfb2ed349af7d21 with OFFLINE state 2011-10-11 05:05:29,253 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for c9385619425f737eab1a6624d2e097a8 with OFFLINE state // we cleaned all zk nodes. 2011-10-11 05:05:29,262 INFO org.apache.hadoop.hbase.master.AssignmentManager: Clean cluster startup. Assigning userregions 2011-10-11 05:05:29,262 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Deleting any existing unassigned nodes 2011-10-11 05:05:29,367 INFO org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 9 region(s) across 1 server(s), retainAssignment=true 2011-10-11 05:05:29,369 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Timeout-on-RIT=9000 2011-10-11 05:05:29,369 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 9 region(s) to C3S3,54366,1318323920153 2011-10-11 05:05:29,369 INFO org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Async create of unassigned node for 771d63e9327383159553619a4f2dc74f with OFFLINE state 2011-10-11 05:05:29,371 INFO org.apache.hadoop.hbase.master.HMaster: Master has completed initialization 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Async create of unassigned node for 3cf860dd323fe6360f571aeafc129f95 with OFFLINE state 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Async create of unassigned node for 4065350214452a9d5c55243c734bef08 with OFFLINE state 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Async create of unassigned node for 4e81613f82a39fc6e5e89f96e7b3ccc4 with OFFLINE state 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Async create of unassigned node for e21b9e1545a28953aba0098fda5c9cd9 with OFFLINE state 2011-10-11 05:05:29,372 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:
[jira] [Updated] (HBASE-4511) There is data loss when master failovers
[ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4511: -- Priority: Minor (was: Critical) There is data loss when master failovers Key: HBASE-4511 URL: https://issues.apache.org/jira/browse/HBASE-4511 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.92.0 Reporter: gaojinchao Priority: Minor Fix For: 0.92.0 Attachments: org.apache.hadoop.hbase.master.TestMasterFailover-output.rar It goes like this: Master crashed , at the same time RS with meta is crashing, but RS doesn't eixt. Master startups again and finds all living RS. Master verifies the meta failed, because this RS is crashing. Master reassigns the meta, but it doesn't split the Hlog. So some meta data is loss. About the logs of a failover test case fail. //It said that we want to kill a RS 2011-09-28 19:54:45,694 INFO [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test 2011-09-28 19:54:45,694 INFO [Thread-988] master.TestMasterFailover(1007): RS 192.168.2.102,54385,1317264874629 killed //Rs didn't crash. 2011-09-28 19:54:51,763 INFO [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629 2011-09-28 19:54:51,763 INFO [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error) 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487... //Meta verification failed and ressigned the meta. So all the regions in the meta is loss. 2011-09-28 19:54:51,773 INFO [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487... 2011-09-28 19:54:52,277 INFO [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487... 2011-09-28 19:54:52,782 INFO [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event, type=NodeCreated, state=SyncConnected, path=/hbase/unassigned/1028785192 //It said
[jira] [Updated] (HBASE-4580) Create some invalid zk nodes when a clean cluster start.
[ https://issues.apache.org/jira/browse/HBASE-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4580: -- Attachment: HBASE-4580_TrunkV1.patch Create some invalid zk nodes when a clean cluster start. Key: HBASE-4580 URL: https://issues.apache.org/jira/browse/HBASE-4580 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.92.0 Reporter: gaojinchao Assignee: gaojinchao Fix For: 0.92.0 Attachments: HBASE-4580_TrunkV1.patch The below logs said that we created a invalid zk node when restarted a cluster. it mistakenly believed that the regions belong to a dead server. 2011-10-11 05:05:29,127 INFO org.apache.hadoop.hbase.master.HMaster: Meta updated status = true 2011-10-11 05:05:29,127 INFO org.apache.hadoop.hbase.master.HMaster: ROOT/Meta already up-to date with new HRI. 2011-10-11 05:05:29,151 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 771d63e9327383159553619a4f2dc74f with OFFLINE state 2011-10-11 05:05:29,161 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 3cf860dd323fe6360f571aeafc129f95 with OFFLINE state 2011-10-11 05:05:29,170 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 4065350214452a9d5c55243c734bef08 with OFFLINE state 2011-10-11 05:05:29,178 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 4e81613f82a39fc6e5e89f96e7b3ccc4 with OFFLINE state 2011-10-11 05:05:29,187 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for e21b9e1545a28953aba0098fda5c9cd9 with OFFLINE state 2011-10-11 05:05:29,195 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 5cd9f55eecd43d088bbd505f6795131f with OFFLINE state 2011-10-11 05:05:29,229 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for db5f641452a70b09b85a92970e4198c7 with OFFLINE state 2011-10-11 05:05:29,237 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for a7b20a653919e7f41bfb2ed349af7d21 with OFFLINE state 2011-10-11 05:05:29,253 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Creating (or updating) unassigned node for c9385619425f737eab1a6624d2e097a8 with OFFLINE state // we cleaned all zk nodes. 2011-10-11 05:05:29,262 INFO org.apache.hadoop.hbase.master.AssignmentManager: Clean cluster startup. Assigning userregions 2011-10-11 05:05:29,262 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Deleting any existing unassigned nodes 2011-10-11 05:05:29,367 INFO org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 9 region(s) across 1 server(s), retainAssignment=true 2011-10-11 05:05:29,369 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Timeout-on-RIT=9000 2011-10-11 05:05:29,369 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 9 region(s) to C3S3,54366,1318323920153 2011-10-11 05:05:29,369 INFO org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Async create of unassigned node for 771d63e9327383159553619a4f2dc74f with OFFLINE state 2011-10-11 05:05:29,371 INFO org.apache.hadoop.hbase.master.HMaster: Master has completed initialization 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Async create of unassigned node for 3cf860dd323fe6360f571aeafc129f95 with OFFLINE state 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Async create of unassigned node for 4065350214452a9d5c55243c734bef08 with OFFLINE state 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Async create of unassigned node for 4e81613f82a39fc6e5e89f96e7b3ccc4 with OFFLINE state 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Async create of unassigned node for e21b9e1545a28953aba0098fda5c9cd9 with OFFLINE state 2011-10-11 05:05:29,372 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:58198-0x132f23a9a38 Async create of unassigned node for
[jira] [Updated] (HBASE-4511) There is data loss when master failovers
[ https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] gaojinchao updated HBASE-4511: -- Attachment: org.apache.hadoop.hbase.master.TestMasterFailover-output.rar Some logs are provided by Ted There is data loss when master failovers Key: HBASE-4511 URL: https://issues.apache.org/jira/browse/HBASE-4511 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.92.0 Reporter: gaojinchao Priority: Critical Fix For: 0.92.0 Attachments: org.apache.hadoop.hbase.master.TestMasterFailover-output.rar It goes like this: Master crashed , at the same time RS with meta is crashing, but RS doesn't eixt. Master startups again and finds all living RS. Master verifies the meta failed, because this RS is crashing. Master reassigns the meta, but it doesn't split the Hlog. So some meta data is loss. About the logs of a failover test case fail. //It said that we want to kill a RS 2011-09-28 19:54:45,694 INFO [Thread-988] regionserver.HRegionServer(1443): STOPPED: Killing for unit test 2011-09-28 19:54:45,694 INFO [Thread-988] master.TestMasterFailover(1007): RS 192.168.2.102,54385,1317264874629 killed //Rs didn't crash. 2011-09-28 19:54:51,763 INFO [Master:0;192.168.2.102,54557,1317264885720] master.HMaster(458): Registering server found up in zk: 192.168.2.102,54385,1317264874629 2011-09-28 19:54:51,763 INFO [Master:0;192.168.2.102,54557,1317264885720] master.ServerManager(232): Registering server=192.168.2.102,54385,1317264874629 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of znode /hbase/unassigned/1028785192 because node does not exist (not an error) 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487... //Meta verification failed and ressigned the meta. So all the regions in the meta is loss. 2011-09-28 19:54:51,773 INFO [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487... 2011-09-28 19:54:52,277 INFO [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) of data from znode /hbase/root-region-server and set watcher; 192.168.2.102,54383,131726487... 2011-09-28 19:54:52,782 INFO [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(476): Failed verification of .META.,,1 at address=192.168.2.102,54385,1317264874629; org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 192.168.2.102,54385,1317264874629 not running, aborting 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] catalog.CatalogTracker(316): new .META. server: 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or updating) unassigned node for 1028785192 with OFFLINE state 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received ZooKeeper Event,