from:"gaojinchao \(Updated\) \(JIRA\)"

[jira] [Updated] (HBASE-5545) region can't be opened for a long time. Because the creating File failed.

2012-03-08 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-5545:
--

Description: 
Scenario:

1. File is created 
2. But while writing data, all datanodes might have crashed. So writing data 
will fail.
3. Now even if close is called in finally block, close also will fail and throw 
the Exception because writing data failed.
4. After this if RS try to create the same file again, then 
AlreadyBeingCreatedException will come.

Suggestion to handle this scenario.
---
1. Check for the existence of the file, if exists delete the file and create 
new file. 

Here delete call for the file will not check whether the file is open or closed.

Overwrite Option:

1. Overwrite option will be applicable if you are trying to overwrite a closed 
file.
2. If the file is not closed, then even with overwrite option Same 
AlreadyBeingCreatedException will be thrown.
This is the expected behaviour to avoid the Multiple clients writing to same 
file.


Region server logs:

org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create 
file /hbase/test1/12c01902324218d14b17a5880f24f64b/.tmp/.regioninfo for 
DFSClient_hb_rs_158-1-131-48,20020,1331107668635_1331107669061_-252463556_25 on 
client 158.1.132.19 because current leaseholder is trying to recreate file.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1570)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1440)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1382)
at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:658)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:547)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1137)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1133)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1131)

at org.apache.hadoop.ipc.Client.call(Client.java:961)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:245)
at $Proxy6.create(Unknown Source)
at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invokeMethod(RPCRetryAndSwitchInvoker.java:201)
at 
com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invokeMethod(RPCRetryAndSwitchInvoker.java:189)
at 
com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker.invoke(RPCRetryAndSwitchInvoker.java:77)
at $Proxy6.create(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.init(DFSClient.java:3643)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:778)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:364)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:630)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:611)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:518)
at 
org.apache.hadoop.hbase.regionserver.HRegion.checkRegioninfoOnFilesystem(HRegion.java:424)
at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:340)
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2672)
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2658)
at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:330)
at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:116)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:158)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2012-03-07 20:51:45,858] [WARN ] 
[RS_OPEN_REGION-158-1-131-48,20020,1331107668635-23] 
[com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker 131] Retrying the 
method call: public abstract void 
org.apache.hadoop.hdfs.protocol.ClientProtocol.create(java.lang.String,org.apache.hadoop.fs.permission.FsPermission,java.lang.String,boolean,boolean,short,long)
 throws java.io.IOException with arguments of length: 7. The exisiting 
ActiveServerConnection is:
ActiveServerConnectionInfo:
Metadata:158-1-131-48/158.1.132.19:9000
Version:145720623220907

[2012-03-07 20:51:45,872] [DEBUG]

[jira] [Updated] (HBASE-5545) region can't be opened for a long time. Because the creating File failed.

2012-03-08 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-5545:
--

Description: 
Scenario:

1. File is created 
2. But while writing data, all datanodes might have crashed. So writing data 
will fail.
3. Now even if close is called in finally block, close also will fail and throw 
the Exception because writing data failed.
4. After this if RS try to create the same file again, then 
AlreadyBeingCreatedException will come.

Suggestion to handle this scenario.
---
1. Check for the existence of the file, if exists delete the file and create 
new file. 

Here delete call for the file will not check whether the file is open or closed.

Overwrite Option:

1. Overwrite option will be applicable if you are trying to overwrite a closed 
file.
2. If the file is not closed, then even with overwrite option Same 
AlreadyBeingCreatedException will be thrown.
This is the expected behaviour to avoid the Multiple clients writing to same 
file.


Region server logs:

org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create 
file /hbase/test1/12c01902324218d14b17a5880f24f64b/.tmp/.regioninfo for 
DFSClient_hb_rs_158-1-131-48,20020,1331107668635_1331107669061_-252463556_25 on 
client 158.1.132.19 because current leaseholder is trying to recreate file.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1570)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1440)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1382)
at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:658)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:547)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1137)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1133)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1131)

at org.apache.hadoop.ipc.Client.call(Client.java:961)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:245)
at $Proxy6.create(Unknown Source)
at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at $Proxy6.create(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.init(DFSClient.java:3643)
at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:778)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:364)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:630)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:611)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:518)
at 
org.apache.hadoop.hbase.regionserver.HRegion.checkRegioninfoOnFilesystem(HRegion.java:424)
at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:340)
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2672)
at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2658)
at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:330)
at 
org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:116)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:158)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2012-03-07 20:51:45,858] [WARN ] 
[RS_OPEN_REGION-158-1-131-48,20020,1331107668635-23] 
[com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker 131] Retrying the 
method call: public abstract void 
org.apache.hadoop.hdfs.protocol.ClientProtocol.create(java.lang.String,org.apache.hadoop.fs.permission.FsPermission,java.lang.String,boolean,boolean,short,long)
 throws java.io.IOException with arguments of length: 7. The exisiting 
ActiveServerConnection is:
ActiveServerConnectionInfo:
Metadata:158-1-131-48/158.1.132.19:9000
Version:145720623220907

[2012-03-07 20:51:45,872] [DEBUG] 
[RS_OPEN_REGION-158-1-131-48,20020,1331107668635-20] 
[org.apache.hadoop.hbase.zookeeper.ZKAssign 849] 
regionserver:20020-0x135ec32b39e0002-0x135ec32b39e0002 Successfully 
transitioned node 91bf3e6f8adb2e4b335f061036353126 from M_ZK_REGION_OFFLINE to 
RS_ZK_REGION_OPENING
[2012-03-07 20:51:45,873] [DEBUG]

[jira] [Updated] (HBASE-5488) Fixed OfflineMetaRepair bug

2012-02-29 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-5488:
--

Fix Version/s: 0.92.1
   Status: Patch Available  (was: Open)

 Fixed OfflineMetaRepair bug 
 

 Key: HBASE-5488
 URL: https://issues.apache.org/jira/browse/HBASE-5488
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.6
Reporter: gaojinchao
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.90.7, 0.92.1

 Attachments: HBASE-5488-trunk.patch, HBASE-5488_branch90.txt


 I want to use OfflineMetaRepair tools and found onbody fix this bugs. I 
 will make a patch.
  12/01/05 23:23:30 ERROR util.HBaseFsck: Bailed out due to:
  java.lang.IllegalArgumentException: Wrong FS: hdfs:// 
  us01-ciqps1-name01.carrieriq.com:9000/hbase/M2M-INTEGRATION-MM_TION-13
  25190318714/0003d2ede27668737e192d8430dbe5d0/.regioninfo,
  expected: file:///
 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:352)
 at
  org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
 at
  org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:368)
 at
  org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
 at
  org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:126)
 at
  org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:398)
 at
  org.apache.hadoop.hbase.util.HBaseFsck.loadMetaEntry(HBaseFsck.java:256)
 at
  org.apache.hadoop.hbase.util.HBaseFsck.loadTableInfo(HBaseFsck.java:284)
 at
  org.apache.hadoop.hbase.util.HBaseFsck.rebuildMeta(HBaseFsck.java:402)
 at
  org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair.main(OfflineMetaRe

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5488) Fixed OfflineMetaRepair bug

2012-02-29 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-5488:
--

Attachment: HBASE-5488-branch92.patch

 Fixed OfflineMetaRepair bug 
 

 Key: HBASE-5488
 URL: https://issues.apache.org/jira/browse/HBASE-5488
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.6
Reporter: gaojinchao
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.90.7, 0.92.1

 Attachments: HBASE-5488-branch92.patch, HBASE-5488-trunk.patch, 
 HBASE-5488_branch90.txt


 I want to use OfflineMetaRepair tools and found onbody fix this bugs. I 
 will make a patch.
  12/01/05 23:23:30 ERROR util.HBaseFsck: Bailed out due to:
  java.lang.IllegalArgumentException: Wrong FS: hdfs:// 
  us01-ciqps1-name01.carrieriq.com:9000/hbase/M2M-INTEGRATION-MM_TION-13
  25190318714/0003d2ede27668737e192d8430dbe5d0/.regioninfo,
  expected: file:///
 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:352)
 at
  org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
 at
  org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:368)
 at
  org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
 at
  org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:126)
 at
  org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:398)
 at
  org.apache.hadoop.hbase.util.HBaseFsck.loadMetaEntry(HBaseFsck.java:256)
 at
  org.apache.hadoop.hbase.util.HBaseFsck.loadTableInfo(HBaseFsck.java:284)
 at
  org.apache.hadoop.hbase.util.HBaseFsck.rebuildMeta(HBaseFsck.java:402)
 at
  org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair.main(OfflineMetaRe

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5488) Fixed OfflineMetaRepair bug

2012-02-28 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-5488:
--

Attachment: HBASE-5488_branch90.txt

 Fixed OfflineMetaRepair bug 
 

 Key: HBASE-5488
 URL: https://issues.apache.org/jira/browse/HBASE-5488
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.6
Reporter: gaojinchao
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.90.7

 Attachments: HBASE-5488_branch90.txt


 I want to use OfflineMetaRepair tools and found onbody fix this bugs. I 
 will make a patch.
  12/01/05 23:23:30 ERROR util.HBaseFsck: Bailed out due to:
  java.lang.IllegalArgumentException: Wrong FS: hdfs:// 
  us01-ciqps1-name01.carrieriq.com:9000/hbase/M2M-INTEGRATION-MM_TION-13
  25190318714/0003d2ede27668737e192d8430dbe5d0/.regioninfo,
  expected: file:///
 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:352)
 at
  org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
 at
  org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:368)
 at
  org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
 at
  org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.init(ChecksumFileSystem.java:126)
 at
  org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:284)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:398)
 at
  org.apache.hadoop.hbase.util.HBaseFsck.loadMetaEntry(HBaseFsck.java:256)
 at
  org.apache.hadoop.hbase.util.HBaseFsck.loadTableInfo(HBaseFsck.java:284)
 at
  org.apache.hadoop.hbase.util.HBaseFsck.rebuildMeta(HBaseFsck.java:402)
 at
  org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair.main(OfflineMetaRe

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5379) Backport HBASE-4287 to 0.90 - If region opening fails, try to transition region back to offline in ZK

2012-02-11 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-5379:
--

Attachment: HBASE-4287.text

 Backport HBASE-4287 to 0.90 - If region opening fails, try to transition 
 region back to offline in ZK
 ---

 Key: HBASE-5379
 URL: https://issues.apache.org/jira/browse/HBASE-5379
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
Assignee: ramkrishna.s.vasudevan
 Fix For: 0.90.7

 Attachments: HBASE-4287.text


 This issue is needed in 0.90 also.   Else if region assignment fails then 
 need to wait for 30 minutes.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-19 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-5179:
--

Attachment: Errorlog

when region server checkin, I killed meta/root region. I found splitHlog is 
after meta was assigned. So I think may has another problem. tomorrow continue 
to dig .

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v10.patch, 5179-90v11.patch, 
 5179-90v12.patch, 5179-90v2.patch, 5179-90v3.patch, 5179-90v4.patch, 
 5179-90v5.patch, 5179-90v6.patch, 5179-90v7.patch, 5179-90v8.patch, 
 5179-90v9.patch, 5179-v11-92.txt, 5179-v11.txt, 5179-v2.txt, 5179-v3.txt, 
 5179-v4.txt, Errorlog, hbase-5179.patch, hbase-5179v10.patch, 
 hbase-5179v12.patch, hbase-5179v5.patch, hbase-5179v6.patch, 
 hbase-5179v7.patch, hbase-5179v8.patch, hbase-5179v9.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region to be assigned before log splitting is completed, causing data loss

2012-01-14 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-5179:
--

Attachment: 5179-90v4.patch

 Concurrent processing of processFaileOver and ServerShutdownHandler may cause 
 region to be assigned before log splitting is completed, causing data loss
 

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
Priority: Critical
 Fix For: 0.92.0, 0.94.0, 0.90.6

 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-90v4.patch, 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, 
 hbase-5179v5.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss

2012-01-12 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-5179:
--

Attachment: 5179-90v3.patch

Please review it first. I will verify tomorrow

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss

2012-01-12 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-5179:
--

Attachment: (was: 5179-90v3.patch)

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-v2.txt, 5179-v3.txt, 
 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5179) Concurrent processing of processFaileOver and ServerShutdownHandler may cause region is assigned before completing split log, it would cause data loss

2012-01-12 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-5179:
--

Attachment: 5179-90v3.patch

 Concurrent processing of processFaileOver and ServerShutdownHandler  may 
 cause region is assigned before completing split log, it would cause data loss
 ---

 Key: HBASE-5179
 URL: https://issues.apache.org/jira/browse/HBASE-5179
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.90.2
Reporter: chunhui shen
Assignee: chunhui shen
 Attachments: 5179-90.txt, 5179-90v2.patch, 5179-90v3.patch, 
 5179-v2.txt, 5179-v3.txt, 5179-v4.txt, hbase-5179.patch, hbase-5179v5.patch


 If master's processing its failover and ServerShutdownHandler's processing 
 happen concurrently, it may appear following  case.
 1.master completed splitLogAfterStartup()
 2.RegionserverA restarts, and ServerShutdownHandler is processing.
 3.master starts to rebuildUserRegions, and RegionserverA is considered as 
 dead server.
 4.master starts to assign regions of RegionserverA because it is a dead 
 server by step3.
 However, when doing step4(assigning region), ServerShutdownHandler may be 
 doing split log, Therefore, it may cause data loss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5178) Backport HBASE-4101 - Regionserver Deadlock

2012-01-10 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-5178:
--

Attachment: HBASE-4101_0.90_1.patch

 Backport HBASE-4101 - Regionserver Deadlock
 ---

 Key: HBASE-5178
 URL: https://issues.apache.org/jira/browse/HBASE-5178
 Project: HBase
  Issue Type: Bug
Reporter: ramkrishna.s.vasudevan
 Attachments: HBASE-4101_0.90_1.patch


 Critical issue not merged to 0.90.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5120) Timeout monitor races with table disable handler

2012-01-04 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-5120:
--

Attachment: HBASE-5120.patch

Patch is attached so that i can access it at home.  Not the final one and not 
fully tested in cluster.

 Timeout monitor races with table disable handler
 

 Key: HBASE-5120
 URL: https://issues.apache.org/jira/browse/HBASE-5120
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Zhihong Yu
Priority: Blocker
 Attachments: HBASE-5120.patch


 Here is what J-D described here:
 https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176
 I think I will retract from my statement that it used to be extremely racy 
 and caused more troubles than it fixed, on my first test I got a stuck 
 region in transition instead of being able to recover. The timeout was set to 
 2 minutes to be sure I hit it.
 First the region gets closed
 {quote}
 2012-01-04 00:16:25,811 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to 
 sv4r5s38,62023,1325635980913 for region 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 {quote}
 2 minutes later it times out:
 {quote}
 2012-01-04 00:18:30,026 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 state=PENDING_CLOSE, ts=1325636185810, server=null
 2012-01-04 00:18:30,026 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been 
 PENDING_CLOSE for too long, running forced unassign again on 
 region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 2012-01-04 00:18:30,027 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of 
 region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 (offlining)
 {quote}
 100ms later the master finally gets the event:
 {quote}
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Handling 
 transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, 
 region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED 
 event for 1a4b111bcc228043e89f59c4c3f6a791
 2012-01-04 00:18:30,129 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so 
 deleting ZK node and removing from regions in transition, skipping assignment 
 of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791.
 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Deleting existing unassigned node for 
 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED
 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Successfully deleted unassigned node for 
 region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED
 {quote}
 At this point everything is fine, the region was processed as closed. But 
 wait, remember that line where it said it was going to force an unassign?
 {quote}
 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:62003-0x134589d3db03587 Creating unassigned node for 
 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state
 2012-01-04 00:18:30,328 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Server null returned 
 java.lang.NullPointerException: Passed server is null for 
 1a4b111bcc228043e89f59c4c3f6a791
 {quote}
 Now the master is confused, it recreated the RIT znode but the region doesn't 
 even exist anymore. It even tries to shut it down but is blocked by NPEs. Now 
 this is what's going on.
 The late ZK notification that the znode was deleted (but it got recreated 
 after):
 {quote}
 2012-01-04 00:19:33,285 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: The znode of region 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been 
 deleted.
 {quote}
 Then it prints this, and much later tries to unassign it again:
 {quote}
 2012-01-04 00:19:46,607 DEBUG 
 org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
 to clear regions in transition; 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 state=PENDING_CLOSE, ts=1325636310328, server=null
 ...
 2012-01-04 00:20:39,623 DEBUG 
 org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on  region 
 to clear regions in transition; 
 test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 
 state=PENDING_CLOSE, ts=1325636310328, server=null
 2012-01-04

[jira] [Updated] (HBASE-5060) HBase client is blocked forever

2011-12-18 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-5060:
--

Attachment: HBASE-5060_trunk.patch

Patch for trunk


 HBase client is blocked forever
 ---

 Key: HBASE-5060
 URL: https://issues.apache.org/jira/browse/HBASE-5060
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.4
Reporter: gaojinchao
Assignee: gaojinchao
Priority: Critical
 Fix For: 0.90.6

 Attachments: HBASE-5060_Branch90trial.patch, HBASE-5060_trunk.patch


 Since the client had a temporary network failure, After it recovered.
 I found my client thread was blocked. 
 Looks below stack and logs, It said that we use a invalid CatalogTracker in 
 function tableExists.
 Block stack:
 WriteHbaseThread33 prio=10 tid=0x7f76bc27a800 nid=0x2540 in 
 Object.wait() [0x7f76af4f3000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:331)
  - locked 0x7f7a67817c98 (a 
 java.util.concurrent.atomic.AtomicBoolean)
  at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366)
  at 
 org.apache.hadoop.hbase.catalog.MetaReader.tableExists(MetaReader.java:427)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:164)
  at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown 
 Source)
  at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source)
  - locked 0x7f7a4c5dc578 (a com.huawei.hdi.hbase.HbaseReOper)
  at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source)
  at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source)
 In ZooKeeperNodeTracker, We don't throw the KeeperException to high level.
 So in CatalogTracker level, We think ZooKeeperNodeTracker start success and
 continue to process .
 [WriteHbaseThread33]2011-12-16 17:07:33,153[WARN ]  | 
 hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Unable to 
 get data of znode /hbase/root-region-server | 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:557)
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss for /hbase/root-region-server
  at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
  at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931)
  at 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
  at 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73)
  at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162)
  at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown 
 Source)
  at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source)
  at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source)
  at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source)
 [WriteHbaseThread33]2011-12-16 17:07:33,361[ERROR]  | 
 hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Received 
 unexpected KeeperException, re-throwing exception | 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.keeperException(ZooKeeperWatcher.java:385)
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss for /hbase/root-region-server
  at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
  at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931)
  at 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
  at 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73)
  at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162)
  at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown 
 Source)
  at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source)
  at

[jira] [Updated] (HBASE-5060) HBase client is blocked forever

2011-12-18 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-5060:
--

Fix Version/s: 0.92.1
   Status: Patch Available  (was: Open)

 HBase client is blocked forever
 ---

 Key: HBASE-5060
 URL: https://issues.apache.org/jira/browse/HBASE-5060
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.4
Reporter: gaojinchao
Assignee: gaojinchao
Priority: Critical
 Fix For: 0.92.1, 0.90.6

 Attachments: HBASE-5060_Branch90trial.patch, HBASE-5060_trunk.patch


 Since the client had a temporary network failure, After it recovered.
 I found my client thread was blocked. 
 Looks below stack and logs, It said that we use a invalid CatalogTracker in 
 function tableExists.
 Block stack:
 WriteHbaseThread33 prio=10 tid=0x7f76bc27a800 nid=0x2540 in 
 Object.wait() [0x7f76af4f3000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:331)
  - locked 0x7f7a67817c98 (a 
 java.util.concurrent.atomic.AtomicBoolean)
  at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366)
  at 
 org.apache.hadoop.hbase.catalog.MetaReader.tableExists(MetaReader.java:427)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:164)
  at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown 
 Source)
  at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source)
  - locked 0x7f7a4c5dc578 (a com.huawei.hdi.hbase.HbaseReOper)
  at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source)
  at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source)
 In ZooKeeperNodeTracker, We don't throw the KeeperException to high level.
 So in CatalogTracker level, We think ZooKeeperNodeTracker start success and
 continue to process .
 [WriteHbaseThread33]2011-12-16 17:07:33,153[WARN ]  | 
 hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Unable to 
 get data of znode /hbase/root-region-server | 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:557)
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss for /hbase/root-region-server
  at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
  at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931)
  at 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
  at 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73)
  at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162)
  at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown 
 Source)
  at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source)
  at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source)
  at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source)
 [WriteHbaseThread33]2011-12-16 17:07:33,361[ERROR]  | 
 hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Received 
 unexpected KeeperException, re-throwing exception | 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.keeperException(ZooKeeperWatcher.java:385)
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss for /hbase/root-region-server
  at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
  at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931)
  at 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
  at 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73)
  at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162)
  at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown 
 Source)
  at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source)
  at

[jira] [Updated] (HBASE-5060) HBase client is blocked forever

2011-12-17 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-5060:
--

Attachment: HBASE-5060_Branch90trial.patch

1.This issue is diffcult to fix. I made a trial version.
2.I have checked all using ZooKeeperNodeTracker, It seems fine.

 HBase client is blocked forever
 ---

 Key: HBASE-5060
 URL: https://issues.apache.org/jira/browse/HBASE-5060
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.4
Reporter: gaojinchao
Assignee: gaojinchao
Priority: Critical
 Fix For: 0.90.6

 Attachments: HBASE-5060_Branch90trial.patch


 Since the client had a temporary network failure, After it recovered.
 I found my client thread was blocked. 
 Looks below stack and logs, It said that we use a invalid CatalogTracker in 
 function tableExists.
 Block stack:
 WriteHbaseThread33 prio=10 tid=0x7f76bc27a800 nid=0x2540 in 
 Object.wait() [0x7f76af4f3000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:331)
  - locked 0x7f7a67817c98 (a 
 java.util.concurrent.atomic.AtomicBoolean)
  at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366)
  at 
 org.apache.hadoop.hbase.catalog.MetaReader.tableExists(MetaReader.java:427)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:164)
  at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown 
 Source)
  at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source)
  - locked 0x7f7a4c5dc578 (a com.huawei.hdi.hbase.HbaseReOper)
  at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source)
  at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source)
 In ZooKeeperNodeTracker, We don't throw the KeeperException to high level.
 So in CatalogTracker level, We think ZooKeeperNodeTracker start success and
 continue to process .
 [WriteHbaseThread33]2011-12-16 17:07:33,153[WARN ]  | 
 hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Unable to 
 get data of znode /hbase/root-region-server | 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:557)
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss for /hbase/root-region-server
  at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
  at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931)
  at 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
  at 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73)
  at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162)
  at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown 
 Source)
  at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source)
  at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source)
  at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source)
 [WriteHbaseThread33]2011-12-16 17:07:33,361[ERROR]  | 
 hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Received 
 unexpected KeeperException, re-throwing exception | 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.keeperException(ZooKeeperWatcher.java:385)
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss for /hbase/root-region-server
  at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
  at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931)
  at 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
  at 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73)
  at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162)
  at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown 
 Source)
  at

[jira] [Updated] (HBASE-5060) HBase client is blocked forever

2011-12-16 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-5060:
--

Fix Version/s: (was: 0.90.5)

 HBase client is blocked forever
 ---

 Key: HBASE-5060
 URL: https://issues.apache.org/jira/browse/HBASE-5060
 Project: HBase
  Issue Type: Bug
  Components: client
Affects Versions: 0.90.4
Reporter: gaojinchao
Priority: Critical
 Fix For: 0.90.6


 Since the client had a temporary network failure, After it recovered.
 I found my client thread was blocked. 
 Looks below stack and logs, It said that we use a invalid CatalogTracker in 
 function tableExists.
 Block stack:
 WriteHbaseThread33 prio=10 tid=0x7f76bc27a800 nid=0x2540 in 
 Object.wait() [0x7f76af4f3000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
  at java.lang.Object.wait(Native Method)
  at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:331)
  - locked 0x7f7a67817c98 (a 
 java.util.concurrent.atomic.AtomicBoolean)
  at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:366)
  at 
 org.apache.hadoop.hbase.catalog.MetaReader.tableExists(MetaReader.java:427)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:164)
  at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown 
 Source)
  at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source)
  - locked 0x7f7a4c5dc578 (a com.huawei.hdi.hbase.HbaseReOper)
  at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source)
  at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source)
 In ZooKeeperNodeTracker, We don't throw the KeeperException to high level.
 So in CatalogTracker level, We think ZooKeeperNodeTracker start success and
 continue to process .
 [WriteHbaseThread33]2011-12-16 17:07:33,153[WARN ]  | 
 hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Unable to 
 get data of znode /hbase/root-region-server | 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:557)
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss for /hbase/root-region-server
  at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
  at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931)
  at 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
  at 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73)
  at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162)
  at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown 
 Source)
  at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source)
  at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source)
  at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source)
 [WriteHbaseThread33]2011-12-16 17:07:33,361[ERROR]  | 
 hconnection-0x334129cf6890051-0x334129cf6890051-0x334129cf6890051 Received 
 unexpected KeeperException, re-throwing exception | 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.keeperException(ZooKeeperWatcher.java:385)
 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
 = ConnectionLoss for /hbase/root-region-server
  at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
  at 
 org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
  at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:931)
  at 
 org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:549)
  at 
 org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:73)
  at 
 org.apache.hadoop.hbase.catalog.CatalogTracker.start(CatalogTracker.java:136)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.getCatalogTracker(HBaseAdmin.java:111)
  at 
 org.apache.hadoop.hbase.client.HBaseAdmin.tableExists(HBaseAdmin.java:162)
  at com.huawei.hdi.hbase.HbaseFileOperate.checkHtableState(Unknown 
 Source)
  at com.huawei.hdi.hbase.HbaseReOper.reCreateHtable(Unknown Source)
  at com.huawei.hdi.hbase.HbaseFileOperate.writeToHbase(Unknown Source)
  at com.huawei.hdi.hbase.WriteHbaseThread.run(Unknown Source)
 [WriteHbaseThread33]2011-12-16

[jira] [Updated] (HBASE-4970) Allow better control of resource consumption in HTable (backport HBASE-4805 to 0.90 branch)

2011-12-13 Thread gaojinchao (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

gaojinchao updated HBASE-4970:
--

Attachment: HBASE-4970_Trunk_V2.patch

Allow better control of resource consumption in HTable (backport HBASE-4805
to 0.90 branch)
---

Key: HBASE-4970
URL: https://issues.apache.org/jira/browse/HBASE-4970
Project: HBase
Issue Type: Improvement
Components: client
Affects Versions: 0.90.4
Reporter: gaojinchao
Assignee: gaojinchao
Priority: Trivial
Fix For: 0.90.6

Attachments: HBASE-4970_Branch90.patch,
HBASE-4970_Branch90_V1_trial.patch, HBASE-4970_Branch90_V2.patch,
HBASE-4970_Branch92_V2.patch, HBASE-4970_Trunk_V2.patch

In my cluster, I changed keepAliveTime from 60 s to 3600 s. Increasing RES
is slowed down.
Why increasing keepAliveTime of HBase thread pool is slowing down our problem
occurance [RES value increase]?
You can go through the source of sun.nio.ch.Util. Every thread hold 3
softreference of direct buffer(mustangsrc) for reusage. The code names the 3
softreferences buffercache. If the buffer was all occupied or none was
suitable in size, and new request comes, new direct buffer is allocated.
After the service, the bigger one replaces the smaller one in buffercache.
The replaced buffer is released.
So I think we can add a parameter to change keepAliveTime of Htable thread
pool.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4970) Allow better control of resource consumption in HTable (backport HBASE-4805 to 0.90 branch)

2011-12-13 Thread gaojinchao (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

gaojinchao updated HBASE-4970:
--

Attachment: HBASE-4970_Branch92_V2.patch

Allow better control of resource consumption in HTable (backport HBASE-4805
to 0.90 branch)
---

Attachments: HBASE-4970_Branch90.patch,
HBASE-4970_Branch90_V1_trial.patch, HBASE-4970_Branch90_V2.patch,
HBASE-4970_Branch92_V2.patch, HBASE-4970_Trunk_V2.patch

[jira] [Updated] (HBASE-5008) The clusters can't provide services because Region can't flush.

2011-12-11 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-5008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-5008:
--

Attachment: HBASE-5008_Branch90.patch

I made a patch, Please review

 The clusters can't  provide services because Region can't flush.
 

 Key: HBASE-5008
 URL: https://issues.apache.org/jira/browse/HBASE-5008
 Project: HBase
  Issue Type: Bug
  Components: regionserver
Reporter: gaojinchao
Priority: Blocker
 Fix For: 0.90.6

 Attachments: HBASE-5008_Branch90.patch


 Hbase version 0.90.4 + patches
 My analysis is as follows:
 //Started splitting region b24d8ccb852ff742f2a27d01b7f5853e and closed region.
 2011-12-10 17:32:48,653 INFO 
 org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of 
 region Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.
 2011-12-10 17:32:49,759 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Closing 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: 
 disabling compactions  flushes
 2011-12-10 17:32:49,759 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Running close preflush of 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.
 //Processed a flush request and skipped , But flushRequested had set to true
 2011-12-10 17:33:06,963 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Started memstore flush for 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e., 
 current region memstore size 12.6m
 2011-12-10 17:33:17,277 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
 Skipping flush on 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e. because 
 closing
 //split region b24d8ccb852ff742f2a27d01b7f5853 failed and rolled back, 
 flushRequested flag was true, So all handle was blocked 
 2011-12-10 17:34:01,293 INFO 
 org.apache.hadoop.hbase.regionserver.SplitTransaction: Cleaned up old failed 
 split transaction detritus: 
 hdfs://193.195.18.121:9000/hbase/Htable_UFDR_004/b24d8ccb852ff742f2a27d01b7f5853e/splits
 2011-12-10 17:34:01,294 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Onlined 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.; next 
 sequenceid=15494173
 2011-12-10 17:34:01,295 INFO 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Successful rollback 
 of failed split of 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.
 2011-12-10 17:43:10,147 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Blocking updates for 'IPC Server handler 19 on 20020' on region 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: 
 memstore size 384.0m is = than blocking 384.0m size
 // All handles had been blocked. The clusters could not provide services
 2011-12-10 17:34:01,295 INFO 
 org.apache.hadoop.hbase.regionserver.CompactSplitThread: Successful rollback 
 of failed split of 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.
 2011-12-10 17:43:10,147 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Blocking updates for 'IPC Server handler 19 on 20020' on region 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: 
 memstore size 384.0m is = than blocking 384.0m size
 2011-12-10 17:43:10,192 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Blocking updates for 'IPC Server handler 34 on 20020' on region 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: 
 memstore size 384.0m is = than blocking 384.0m size
 2011-12-10 17:43:10,193 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Blocking updates for 'IPC Server handler 51 on 20020' on region 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: 
 memstore size 384.0m is = than blocking 384.0m size
 2011-12-10 17:43:10,196 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Blocking updates for 'IPC Server handler 85 on 20020' on region 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: 
 memstore size 384.0m is = than blocking 384.0m size
 2011-12-10 17:43:10,199 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Blocking updates for 'IPC Server handler 88 on 20020' on region 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: 
 memstore size 384.0m is = than blocking 384.0m size
 2011-12-10 17:43:10,202 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Blocking updates for 'IPC Server handler 44 on 20020' on region 
 Htable_UFDR_004,09781,1323508582833.b24d8ccb852ff742f2a27d01b7f5853e.: 
 memstore size 384.0m is = than blocking 384.0m size
 2011-12-10 17:43:11,663 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
 Blocking updates for 'IPC Server handler 2 on 20020' on region

[jira] [Updated] (HBASE-4970) Add a parameter to change keepAliveTime of Htable thread pool.

2011-12-07 Thread gaojinchao (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

gaojinchao updated HBASE-4970:
--

Attachment: HBASE-4970_Branch90.patch

Add a parameter to change keepAliveTime of Htable thread pool.
---

Attachments: HBASE-4970_Branch90.patch

[jira] [Updated] (HBASE-4970) Add a parameter to change keepAliveTime of Htable thread pool.

2011-12-07 Thread gaojinchao (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

gaojinchao updated HBASE-4970:
--

Attachment: HBASE-4970_Branch90_V1_trial.patch

Add a parameter to change keepAliveTime of Htable thread pool.
---

Attachments: HBASE-4970_Branch90.patch,
HBASE-4970_Branch90_V1_trial.patch

[jira] [Updated] (HBASE-4970) Allow better control of resource consumption in HTable (backport HBASE-4805 to 0.90 branch)

2011-12-07 Thread gaojinchao (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

gaojinchao updated HBASE-4970:
--

Summary: Allow better control of resource consumption in HTable (backport
HBASE-4805 to 0.90 branch) (was: Add a parameter to change keepAliveTime of
Htable thread pool.)

Allow better control of resource consumption in HTable (backport HBASE-4805
to 0.90 branch)
---

Attachments: HBASE-4970_Branch90.patch,
HBASE-4970_Branch90_V1_trial.patch

[jira] [Updated] (HBASE-4970) Add a parameter to change keepAliveTime of Htable thread pool.

2011-12-06 Thread gaojinchao (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

gaojinchao updated HBASE-4970:
--

Affects Version/s: 0.90.4
Fix Version/s: 0.90.5

Add a parameter to change keepAliveTime of Htable thread pool.
---

[jira] [Updated] (HBASE-4633) Potential memory leak in client RPC timeout mechanism

2011-12-04 Thread gaojinchao (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

gaojinchao updated HBASE-4633:
--

Attachment: HBaseclientstack.png

I attached the stack picture.

Potential memory leak in client RPC timeout mechanism
-

Key: HBASE-4633
URL: https://issues.apache.org/jira/browse/HBASE-4633
Project: HBase
Issue Type: Bug
Components: client
Affects Versions: 0.90.3
Environment: HBase version: 0.90.3 + Patches , Hadoop version: CDH3u0
Reporter: Shrijeet Paliwal
Attachments: HBaseclientstack.png

Relevant Jiras: https://issues.apache.org/jira/browse/HBASE-2937,
https://issues.apache.org/jira/browse/HBASE-4003
We have been using the 'hbase.client.operation.timeout' knob
introduced in 2937 for quite some time now. It helps us enforce SLA.
We have two HBase clusters and two HBase client clusters. One of them
is much busier than the other.
We have seen a deterministic behavior of clients running in busy
cluster. Their (client's) memory footprint increases consistently
after they have been up for roughly 24 hours.
This memory footprint almost doubles from its usual value (usual case
== RPC timeout disabled). After much investigation nothing concrete
came out and we had to put a hack
which keep heap size in control even when RPC timeout is enabled. Also
note , the same behavior is not observed in 'not so busy
cluster.
The patch is here : https://gist.github.com/1288023

[jira] [Updated] (HBASE-4868) testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails

2011-11-25 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4868:
--

Attachment: HBASE-4868_trial.patch

 testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails
 -

 Key: HBASE-4868
 URL: https://issues.apache.org/jira/browse/HBASE-4868
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.92.0
Reporter: gaojinchao
Priority: Minor
 Fix For: 0.94.0

 Attachments: HBASE-4868_trial.patch


 looks: 
 https://builds.apache.org/job/HBase-TRUNK-security/7/testReport/org.apache.hadoop.hbase.util.hbck/TestOfflineMetaRebuildBase/testMetaRebuild/
 Please review, see whether the method makes sense? 
 If it makes sense, I will check other cases?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4868) testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails

2011-11-25 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4868:
--

Attachment: HBASE-4868_trunkv2.patch

Fixed the comments


 testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails
 -

 Key: HBASE-4868
 URL: https://issues.apache.org/jira/browse/HBASE-4868
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.92.0
Reporter: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0

 Attachments: HBASE-4868_trial.patch, HBASE-4868_trunkv2.patch


 looks: 
 https://builds.apache.org/job/HBase-TRUNK-security/7/testReport/org.apache.hadoop.hbase.util.hbck/TestOfflineMetaRebuildBase/testMetaRebuild/
 Please review, see whether the method makes sense? 
 If it makes sense, I will check other cases?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4868) testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails

2011-11-25 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4868:
--

Status: Open  (was: Patch Available)

 testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails
 -

 Key: HBASE-4868
 URL: https://issues.apache.org/jira/browse/HBASE-4868
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.92.0
Reporter: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0

 Attachments: HBASE-4868_trial.patch, HBASE-4868_trunkv2.patch


 looks: 
 https://builds.apache.org/job/HBase-TRUNK-security/7/testReport/org.apache.hadoop.hbase.util.hbck/TestOfflineMetaRebuildBase/testMetaRebuild/
 Please review, see whether the method makes sense? 
 If it makes sense, I will check other cases?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4868) testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails

2011-11-25 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4868:
--

Assignee: gaojinchao
  Status: Patch Available  (was: Open)

 testMetaRebuild#TestOfflineMetaRebuildBase occasionally fails
 -

 Key: HBASE-4868
 URL: https://issues.apache.org/jira/browse/HBASE-4868
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.92.0
Reporter: gaojinchao
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0

 Attachments: HBASE-4868_trial.patch, HBASE-4868_trunkv2.patch


 looks: 
 https://builds.apache.org/job/HBase-TRUNK-security/7/testReport/org.apache.hadoop.hbase.util.hbck/TestOfflineMetaRebuildBase/testMetaRebuild/
 Please review, see whether the method makes sense? 
 If it makes sense, I will check other cases?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4864) testRegionTransitionOperations occasional failures

2011-11-24 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4864:
--

Attachment: HBASE-4864_Branch92.patch

 testRegionTransitionOperations occasional failures
 --

 Key: HBASE-4864
 URL: https://issues.apache.org/jira/browse/HBASE-4864
 Project: HBase
  Issue Type: Bug
  Components: test
Reporter: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0

 Attachments: HBASE-4864_Branch92.patch


 looks this logs:
 https://builds.apache.org/job/HBase-TRUNK-security/ws/trunk/target/surefire-reports/
 It seems that we should wait region is added to online region set.
 I made a patch, Please review.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-23 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4739:
--

Attachment: HBASE-4739_Branch092.patch

Patch is in branch 0.92

 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: 4739_trial2.patch, 4739_trialV3.patch, 
 HBASE-4739_Branch092.patch, HBASE-4739_Trunk.patch, 
 HBASE-4739_Trunk_V2.patch, HBASE-4739_V7.patch, HBASE-4739_trail5.patch, 
 HBASE-4739_trial.patch, HBASE-4739_trial6.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-22 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4739:
--

Attachment: HBASE-4739_trial6.patch

Thanks for your review. Fixed all comments


 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: 4739_trial2.patch, 4739_trialV3.patch, 
 HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_trail5.patch, 
 HBASE-4739_trial.patch, HBASE-4739_trial6.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-22 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4739:
--

Attachment: HBASE-4739_V7.patch

Fixed Ted's comment
@Ted
Do we need make a patch for 0.90.5 ? :)

 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: 4739_trial2.patch, 4739_trialV3.patch, 
 HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_V7.patch, 
 HBASE-4739_trail5.patch, HBASE-4739_trial.patch, HBASE-4739_trial6.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-17 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4739:
--

Attachment: 4739_trialV3.patch

trialV3 fixed Ram' comment. 

 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: 4739_trial2.patch, 4739_trialV3.patch, 
 HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, HBASE-4739_trial.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-16 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4739:
--

Attachment: HBASE-4739_trial.patch

trail version does not test and need improve

 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch, 
 HBASE-4739_trial.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-16 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4739:
--

Attachment: 4739_trial2.patch

 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: 4739_trial2.patch, HBASE-4739_Trunk.patch, 
 HBASE-4739_Trunk_V2.patch, HBASE-4739_trial.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4790) TestDistributedLogSplitting occasional failure

2011-11-15 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4790:
--

Attachment: HBASE-4790_Trunk.patch

 TestDistributedLogSplitting occasional failure
 --

 Key: HBASE-4790
 URL: https://issues.apache.org/jira/browse/HBASE-4790
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.94.0
Reporter: gaojinchao
Priority: Minor
 Fix For: 0.94.0

 Attachments: HBASE-4790_Trunk.patch


 looks this link:
 https://builds.apache.org/job/PreCommit-HBASE-Build/253//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testRecoveredEdits/
 // it said that regions is 0.
 2011-11-15 03:53:11,215 INFO  [Thread-2335] 
 master.TestDistributedLogSplitting(211): #regions = 0
 2011-11-15 03:53:11,215 DEBUG 
 [RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer] 
 wal.HLog$LogSyncer(1192): 
 RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer 
 interrupted while waiting for sync requests
 2011-11-15 03:53:11,215 INFO  
 [RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer] 
 wal.HLog$LogSyncer(1194): 
 RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer exiting
 2011-11-15 03:53:11,215 DEBUG [Thread-2335] wal.HLog(967): closing hlog 
 writer in 
 hdfs://localhost:46229/user/jenkins/.logs/asf001.sp2.ygridcore.net,36721,1321329179789
 2011-11-15 03:53:11,637 DEBUG [Thread-2335] master.SplitLogManager(233): 
 Scheduling batch of logs to split

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4790) TestDistributedLogSplitting occasional failure

2011-11-15 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4790:
--

Affects Version/s: (was: 0.94.0)
   0.92.0
Fix Version/s: 0.92.0
 Assignee: gaojinchao

 TestDistributedLogSplitting occasional failure
 --

 Key: HBASE-4790
 URL: https://issues.apache.org/jira/browse/HBASE-4790
 Project: HBase
  Issue Type: Bug
  Components: test
Affects Versions: 0.92.0
Reporter: gaojinchao
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0

 Attachments: HBASE-4790_Trunk.patch


 looks this link:
 https://builds.apache.org/job/PreCommit-HBASE-Build/253//testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testRecoveredEdits/
 // it said that regions is 0.
 2011-11-15 03:53:11,215 INFO  [Thread-2335] 
 master.TestDistributedLogSplitting(211): #regions = 0
 2011-11-15 03:53:11,215 DEBUG 
 [RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer] 
 wal.HLog$LogSyncer(1192): 
 RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer 
 interrupted while waiting for sync requests
 2011-11-15 03:53:11,215 INFO  
 [RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer] 
 wal.HLog$LogSyncer(1194): 
 RegionServer:0;asf001.sp2.ygridcore.net,36721,1321329179789.logSyncer exiting
 2011-11-15 03:53:11,215 DEBUG [Thread-2335] wal.HLog(967): closing hlog 
 writer in 
 hdfs://localhost:46229/user/jenkins/.logs/asf001.sp2.ygridcore.net,36721,1321329179789
 2011-11-15 03:53:11,637 DEBUG [Thread-2335] master.SplitLogManager(233): 
 Scheduling batch of logs to split

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-14 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4739:
--

Attachment: HBASE-4739_Trunk.patch

 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: HBASE-4739_Trunk.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4739) Master dying while going to close a region can leave it in transition forever

2011-11-14 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4739:
--

Attachment: HBASE-4739_Trunk_V2.patch

 Master dying while going to close a region can leave it in transition forever
 -

 Key: HBASE-4739
 URL: https://issues.apache.org/jira/browse/HBASE-4739
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.90.4
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0, 0.94.0, 0.90.5

 Attachments: HBASE-4739_Trunk.patch, HBASE-4739_Trunk_V2.patch


 I saw this in the aftermath of HBASE-4729 on a 0.92 refreshed yesterday, when 
 the master died it had just created the RIT znode for a region but didn't 
 tell the RS to close it yet.
 When the master restarted it saw the znode and started printing this:
 {quote}
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
 out:  TestTable,0007560564,1320253568406.f76899564cabe7e9857c3aeb526ec9dc. 
 state=CLOSING, ts=1320253605285, server=sv4r11s38,62003,1320195046948
 2011-11-03 00:02:49,130 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Region has been CLOSING for 
 too long, this should eventually complete or the server will expire, doing 
 nothing
 {quote}
 It's never going to happen, and it's blocking balancing.
 I'm marking this as minor since I believe this situation is pretty rare 
 unless you hit other bugs while trying out stuff to root bugs out.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4577) Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB

2011-11-01 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4577:
--

Attachment: HBASE-4577_trunk.patch

 Region server reports storefileSizeMB bigger than storefileUncompressedSizeMB
 -

 Key: HBASE-4577
 URL: https://issues.apache.org/jira/browse/HBASE-4577
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Minor
 Fix For: 0.92.0

 Attachments: HBASE-4577_trial_Trunk.patch, HBASE-4577_trunk.patch


 Minor issue while looking at the RS metrics:
 bq. numberOfStorefiles=8, storefileUncompressedSizeMB=2418, 
 storefileSizeMB=2420, compressionRatio=1.0008
 I guess there's a truncation somewhere when it's adding the numbers up.
 FWIW there's no compression on that table.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4695) WAL logs get deleted before region server can fully flush

2011-10-31 Thread gaojinchao (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

gaojinchao updated HBASE-4695:
--

Attachment: HBASE-4695_Trunk_V2.patch

WAL logs get deleted before region server can fully flush
-

Key: HBASE-4695
URL: https://issues.apache.org/jira/browse/HBASE-4695
Project: HBase
Issue Type: Bug
Components: wal
Affects Versions: 0.90.4
Reporter: jack levin
Assignee: gaojinchao
Priority: Blocker
Fix For: 0.90.5

Attachments: HBASE-4695_Trunk_V2.patch,
HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt

To replicate the problem do the following:
1. check /hbase/.logs/ directory to see if you have WAL logs for the
region server you are shutting down.
2. executing kill pid (where pid is a regionserver pid)
3. Watch the regionserver log to start flushing, you will see how many
regions are left to flush:
09:36:54,665 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting
on 489 regions to close
09:56:35,779 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting
on 116 regions to close
4. Check /hbase/.logs/ -- you will notice that it has dissapeared.
5. Check namenode logs:
09:26:41,607 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
ugi=root ip=/10.101.1.5 cmd=delete
src=/hbase/.logs/rdaa5.prod.imageshack.com,60020,1319749
Note that, if you kill -9 the RS now, and it crashes on flush, you won't have
any WAL logs to replay. We need to make sure that logs are deleted or moved
out only when RS has fully flushed. Otherwise its possible to lose data.

[jira] [Updated] (HBASE-4695) WAL logs get deleted before region server can fully flush

2011-10-31 Thread gaojinchao (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

gaojinchao updated HBASE-4695:
--

Attachment: HBASE-4695_Branch90_V2.patch

WAL logs get deleted before region server can fully flush
-

Attachments: HBASE-4695_Branch90_V2.patch, HBASE-4695_Trunk_V2.patch,
HBASE-4695_branch90_trial.patch, hbase-4695-0.92.txt

[jira] [Updated] (HBASE-4695) WAL logs get deleted before region server can fully flush

2011-10-29 Thread gaojinchao (Updated) (JIRA)

[
https://issues.apache.org/jira/browse/HBASE-4695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

gaojinchao updated HBASE-4695:
--

Attachment: HBASE-4695_branch90_trial.patch

Go back to company and verify this patch.
If you are free, Please review it firstly.

The patch seems simple.

WAL logs get deleted before region server can fully flush
-

Attachments: HBASE-4695_branch90_trial.patch

[jira] [Updated] (HBASE-4578) NPE when altering a table that has moving regions

2011-10-24 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4578:
--

Attachment: HBASE-4578_Trunk_V1.patch

 NPE when altering a table that has moving regions
 -

 Key: HBASE-4578
 URL: https://issues.apache.org/jira/browse/HBASE-4578
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jean-Daniel Cryans
Assignee: gaojinchao
Priority: Blocker
 Fix For: 0.92.0

 Attachments: HBASE-4578_Trunk_V1.patch, HBASE-4578_trial_Trunk.patch


 I'm still not a 100% sure on the source of this error, but here's what I was 
 able to get twice while altering a table that was doing a bunch of splits:
 {quote}
 2011-10-11 23:48:59,344 INFO 
 org.apache.hadoop.hbase.master.handler.SplitRegionHandler: Handled SPLIT 
 report); 
 parent=TestTable,0002608338,1318376880454.a75d6815fdfc513fb1c8aabe086c6763. 
 daughter 
 a=TestTable,0002608338,1318376938764.ef170ff6cd8695dc8aec92e542dc9ac1.daughter
  b=TestTable,0003301408,1318376938764.36eb2530341bd46888ede312c5559b5d.
 2011-10-11 23:49:09,579 DEBUG 
 org.apache.hadoop.hbase.master.handler.TableEventHandler: Ignoring table not 
 disabled exception for supporting online schema changes.
 2011-10-11 23:49:09,580 INFO 
 org.apache.hadoop.hbase.master.handler.TableEventHandler: Handling table 
 operation C_M_MODIFY_TABLE on table TestTable
 2011-10-11 23:49:09,612 INFO org.apache.hadoop.hbase.util.FSUtils: 
 TableInfoPath = hdfs://sv4r11s38:9100/hbase/TestTable/.tableinfo tmpPath = 
 hdfs://sv4r11s38:9100/hbase/TestTable/.tmp/.tableinfo.1318376949612
 2011-10-11 23:49:09,692 INFO org.apache.hadoop.hbase.util.FSUtils: 
 TableDescriptor stored. TableInfoPath = 
 hdfs://sv4r11s38:9100/hbase/TestTable/.tableinfo
 2011-10-11 23:49:09,693 INFO org.apache.hadoop.hbase.util.FSUtils: Updated 
 tableinfo=hdfs://sv4r11s38:9100/hbase/TestTable/.tableinfo to blah
 2011-10-11 23:49:09,695 INFO 
 org.apache.hadoop.hbase.master.handler.TableEventHandler: Bucketing regions 
 by region server...
 2011-10-11 23:49:09,695 DEBUG org.apache.hadoop.hbase.client.MetaScanner: 
 Scanning .META. starting at row=TestTable,,00 for max=2147483647 
 rows
 2011-10-11 23:49:09,709 DEBUG 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
 The connection to hconnection-0x132f043bbde02e9 has been closed.
 2011-10-11 23:49:09,709 ERROR org.apache.hadoop.hbase.executor.EventHandler: 
 Caught throwable while processing event C_M_MODIFY_TABLE
 java.lang.NullPointerException
   at java.util.TreeMap.getEntry(TreeMap.java:324)
   at java.util.TreeMap.containsKey(TreeMap.java:209)
   at 
 org.apache.hadoop.hbase.master.handler.TableEventHandler.reOpenAllRegions(TableEventHandler.java:114)
   at 
 org.apache.hadoop.hbase.master.handler.TableEventHandler.process(TableEventHandler.java:90)
   at 
 org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:168)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {quote}
 The first time the shell reported that all the regions were updated 
 correctly, the second time it got stuck for a while:
 {quote}
 6/14 regions updated.
 0/14 regions updated.
 ...
 0/14 regions updated.
 2/16 regions updated.
 ...
 2/16 regions updated.
 8/9 regions updated.
 ...
 8/9 regions updated.
 {quote}
 After which I killed it, redid the alter and it worked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4578) NPE when altering a table that has moving regions

2011-10-21 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4578:
--

Attachment: HBASE-4578_trial_Trunk.patch

 NPE when altering a table that has moving regions
 -

 Key: HBASE-4578
 URL: https://issues.apache.org/jira/browse/HBASE-4578
 Project: HBase
  Issue Type: Bug
Affects Versions: 0.92.0
Reporter: Jean-Daniel Cryans
Priority: Blocker
 Fix For: 0.92.0

 Attachments: HBASE-4578_trial_Trunk.patch


 I'm still not a 100% sure on the source of this error, but here's what I was 
 able to get twice while altering a table that was doing a bunch of splits:
 {quote}
 2011-10-11 23:48:59,344 INFO 
 org.apache.hadoop.hbase.master.handler.SplitRegionHandler: Handled SPLIT 
 report); 
 parent=TestTable,0002608338,1318376880454.a75d6815fdfc513fb1c8aabe086c6763. 
 daughter 
 a=TestTable,0002608338,1318376938764.ef170ff6cd8695dc8aec92e542dc9ac1.daughter
  b=TestTable,0003301408,1318376938764.36eb2530341bd46888ede312c5559b5d.
 2011-10-11 23:49:09,579 DEBUG 
 org.apache.hadoop.hbase.master.handler.TableEventHandler: Ignoring table not 
 disabled exception for supporting online schema changes.
 2011-10-11 23:49:09,580 INFO 
 org.apache.hadoop.hbase.master.handler.TableEventHandler: Handling table 
 operation C_M_MODIFY_TABLE on table TestTable
 2011-10-11 23:49:09,612 INFO org.apache.hadoop.hbase.util.FSUtils: 
 TableInfoPath = hdfs://sv4r11s38:9100/hbase/TestTable/.tableinfo tmpPath = 
 hdfs://sv4r11s38:9100/hbase/TestTable/.tmp/.tableinfo.1318376949612
 2011-10-11 23:49:09,692 INFO org.apache.hadoop.hbase.util.FSUtils: 
 TableDescriptor stored. TableInfoPath = 
 hdfs://sv4r11s38:9100/hbase/TestTable/.tableinfo
 2011-10-11 23:49:09,693 INFO org.apache.hadoop.hbase.util.FSUtils: Updated 
 tableinfo=hdfs://sv4r11s38:9100/hbase/TestTable/.tableinfo to blah
 2011-10-11 23:49:09,695 INFO 
 org.apache.hadoop.hbase.master.handler.TableEventHandler: Bucketing regions 
 by region server...
 2011-10-11 23:49:09,695 DEBUG org.apache.hadoop.hbase.client.MetaScanner: 
 Scanning .META. starting at row=TestTable,,00 for max=2147483647 
 rows
 2011-10-11 23:49:09,709 DEBUG 
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: 
 The connection to hconnection-0x132f043bbde02e9 has been closed.
 2011-10-11 23:49:09,709 ERROR org.apache.hadoop.hbase.executor.EventHandler: 
 Caught throwable while processing event C_M_MODIFY_TABLE
 java.lang.NullPointerException
   at java.util.TreeMap.getEntry(TreeMap.java:324)
   at java.util.TreeMap.containsKey(TreeMap.java:209)
   at 
 org.apache.hadoop.hbase.master.handler.TableEventHandler.reOpenAllRegions(TableEventHandler.java:114)
   at 
 org.apache.hadoop.hbase.master.handler.TableEventHandler.process(TableEventHandler.java:90)
   at 
 org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:168)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 {quote}
 The first time the shell reported that all the regions were updated 
 correctly, the second time it got stuck for a while:
 {quote}
 6/14 regions updated.
 0/14 regions updated.
 ...
 0/14 regions updated.
 2/16 regions updated.
 ...
 2/16 regions updated.
 8/9 regions updated.
 ...
 8/9 regions updated.
 {quote}
 After which I killed it, redid the alter and it worked.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4219) Add Per-Column Family Metrics

2011-10-20 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4219:
--

Attachment: HBASE-4219_fixtestcasefail.patch
HBASE-4219_fixtestcasefail.patch

fix TestHFileBlock failed

[INFO] Surefire report directory: /opt/g56562/trunk/target/surefire-reports

---
 T E S T S
---

---
 T E S T S
---
Running org.apache.hadoop.hbase.io.hfile.TestHFileBlock
Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 27.732 sec

Results :

Tests run: 7, Failures: 0, Errors: 0, Skipped: 0


 Add Per-Column Family Metrics
 -

 Key: HBASE-4219
 URL: https://issues.apache.org/jira/browse/HBASE-4219
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.92.0
Reporter: Nicolas Spiegelberg
Assignee: David Goode
 Fix For: 0.92.0

 Attachments: 4219-v2.txt, 4219-v3.txt, HBASE-4219-v4.patch, 
 HBASE-4219-v5.patch, HBASE-4219_fixtestcasefail.patch, 
 HBASE-4219_percfmetrics_1.patch


 Right now, we have region server level statistics.  However, the read/write 
 flow varies a lot based on the column family involved.  We should add 
 dynamic, per column family metrics to JMX so we can track each column family 
 individually.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4219) Add Per-Column Family Metrics

2011-10-20 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4219:
--

Attachment: (was: HBASE-4219_fixtestcasefail.patch)

 Add Per-Column Family Metrics
 -

 Key: HBASE-4219
 URL: https://issues.apache.org/jira/browse/HBASE-4219
 Project: HBase
  Issue Type: New Feature
Affects Versions: 0.92.0
Reporter: Nicolas Spiegelberg
Assignee: David Goode
 Fix For: 0.92.0

 Attachments: 4219-v2.txt, 4219-v3.txt, HBASE-4219-v4.patch, 
 HBASE-4219-v5.patch, HBASE-4219_fixtestcasefail.patch, 
 HBASE-4219_percfmetrics_1.patch


 Right now, we have region server level statistics.  However, the read/write 
 flow varies a lot based on the column family involved.  We should add 
 dynamic, per column family metrics to JMX so we can track each column family 
 individually.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-4580) Some invalid zk nodes were created when a clean cluster restarts

2011-10-19 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4580:
--

Attachment: HBASE-4580_TrunkV3.patch

 Some invalid zk nodes were created when a clean cluster restarts
 

 Key: HBASE-4580
 URL: https://issues.apache.org/jira/browse/HBASE-4580
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.92.0
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.92.0

 Attachments: HBASE-4580_TrunkV1.patch, HBASE-4580_TrunkV2.patch, 
 HBASE-4580_TrunkV3.patch


 The below logs said that we created a invalid zk node when restarted a 
 cluster.
 it mistakenly believed that the regions belong to a dead server.
 2011-10-11 05:05:29,127 INFO org.apache.hadoop.hbase.master.HMaster: Meta 
 updated status = true
 2011-10-11 05:05:29,127 INFO org.apache.hadoop.hbase.master.HMaster: 
 ROOT/Meta already up-to date with new HRI.
 2011-10-11 05:05:29,151 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 771d63e9327383159553619a4f2dc74f with OFFLINE state
 2011-10-11 05:05:29,161 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 3cf860dd323fe6360f571aeafc129f95 with OFFLINE state
 2011-10-11 05:05:29,170 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 4065350214452a9d5c55243c734bef08 with OFFLINE state
 2011-10-11 05:05:29,178 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 4e81613f82a39fc6e5e89f96e7b3ccc4 with OFFLINE state
 2011-10-11 05:05:29,187 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 e21b9e1545a28953aba0098fda5c9cd9 with OFFLINE state
 2011-10-11 05:05:29,195 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 5cd9f55eecd43d088bbd505f6795131f with OFFLINE state
 2011-10-11 05:05:29,229 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 db5f641452a70b09b85a92970e4198c7 with OFFLINE state
 2011-10-11 05:05:29,237 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 a7b20a653919e7f41bfb2ed349af7d21 with OFFLINE state
 2011-10-11 05:05:29,253 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 c9385619425f737eab1a6624d2e097a8 with OFFLINE state
 // we cleaned all zk nodes.
 2011-10-11 05:05:29,262 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Clean cluster startup. 
 Assigning userregions
 2011-10-11 05:05:29,262 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Deleting any existing unassigned nodes
 2011-10-11 05:05:29,367 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 9 region(s) 
 across 1 server(s), retainAssignment=true
 2011-10-11 05:05:29,369 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Timeout-on-RIT=9000
 2011-10-11 05:05:29,369 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 9 region(s) 
 to C3S3,54366,1318323920153
 2011-10-11 05:05:29,369 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done
 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Async create of unassigned node for 
 771d63e9327383159553619a4f2dc74f with OFFLINE state
 2011-10-11 05:05:29,371 INFO org.apache.hadoop.hbase.master.HMaster: Master 
 has completed initialization
 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Async create of unassigned node for 
 3cf860dd323fe6360f571aeafc129f95 with OFFLINE state
 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Async create of unassigned node for 
 4065350214452a9d5c55243c734bef08 with OFFLINE state
 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Async create of unassigned node for 
 4e81613f82a39fc6e5e89f96e7b3ccc4 with OFFLINE state
 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Async create of unassigned node for 
 e21b9e1545a28953aba0098fda5c9cd9 with OFFLINE state
 2011-10-11 05:05:29,372 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:

[jira] [Updated] (HBASE-4580) Some invalid zk nodes were created when a clean cluster restarts

2011-10-18 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4580:
--

Attachment: HBASE-4580_TrunkV2.patch

V2 has reviewed

 Some invalid zk nodes were created when a clean cluster restarts
 

 Key: HBASE-4580
 URL: https://issues.apache.org/jira/browse/HBASE-4580
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.92.0
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.92.0

 Attachments: HBASE-4580_TrunkV1.patch, HBASE-4580_TrunkV2.patch


 The below logs said that we created a invalid zk node when restarted a 
 cluster.
 it mistakenly believed that the regions belong to a dead server.
 2011-10-11 05:05:29,127 INFO org.apache.hadoop.hbase.master.HMaster: Meta 
 updated status = true
 2011-10-11 05:05:29,127 INFO org.apache.hadoop.hbase.master.HMaster: 
 ROOT/Meta already up-to date with new HRI.
 2011-10-11 05:05:29,151 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 771d63e9327383159553619a4f2dc74f with OFFLINE state
 2011-10-11 05:05:29,161 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 3cf860dd323fe6360f571aeafc129f95 with OFFLINE state
 2011-10-11 05:05:29,170 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 4065350214452a9d5c55243c734bef08 with OFFLINE state
 2011-10-11 05:05:29,178 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 4e81613f82a39fc6e5e89f96e7b3ccc4 with OFFLINE state
 2011-10-11 05:05:29,187 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 e21b9e1545a28953aba0098fda5c9cd9 with OFFLINE state
 2011-10-11 05:05:29,195 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 5cd9f55eecd43d088bbd505f6795131f with OFFLINE state
 2011-10-11 05:05:29,229 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 db5f641452a70b09b85a92970e4198c7 with OFFLINE state
 2011-10-11 05:05:29,237 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 a7b20a653919e7f41bfb2ed349af7d21 with OFFLINE state
 2011-10-11 05:05:29,253 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 c9385619425f737eab1a6624d2e097a8 with OFFLINE state
 // we cleaned all zk nodes.
 2011-10-11 05:05:29,262 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Clean cluster startup. 
 Assigning userregions
 2011-10-11 05:05:29,262 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Deleting any existing unassigned nodes
 2011-10-11 05:05:29,367 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 9 region(s) 
 across 1 server(s), retainAssignment=true
 2011-10-11 05:05:29,369 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Timeout-on-RIT=9000
 2011-10-11 05:05:29,369 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 9 region(s) 
 to C3S3,54366,1318323920153
 2011-10-11 05:05:29,369 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done
 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Async create of unassigned node for 
 771d63e9327383159553619a4f2dc74f with OFFLINE state
 2011-10-11 05:05:29,371 INFO org.apache.hadoop.hbase.master.HMaster: Master 
 has completed initialization
 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Async create of unassigned node for 
 3cf860dd323fe6360f571aeafc129f95 with OFFLINE state
 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Async create of unassigned node for 
 4065350214452a9d5c55243c734bef08 with OFFLINE state
 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Async create of unassigned node for 
 4e81613f82a39fc6e5e89f96e7b3ccc4 with OFFLINE state
 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Async create of unassigned node for 
 e21b9e1545a28953aba0098fda5c9cd9 with OFFLINE state
 2011-10-11 05:05:29,372 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign:

[jira] [Updated] (HBASE-4511) There is data loss when master failovers

2011-10-16 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4511:
--

Priority: Minor  (was: Critical)

 There is data loss when master failovers
 

 Key: HBASE-4511
 URL: https://issues.apache.org/jira/browse/HBASE-4511
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.92.0
Reporter: gaojinchao
Priority: Minor
 Fix For: 0.92.0

 Attachments: 
 org.apache.hadoop.hbase.master.TestMasterFailover-output.rar


 It goes like this:
 Master crashed ,  at the same time RS with meta is crashing, but RS doesn't 
 eixt.
 Master startups again and finds all living RS. 
 Master verifies the meta failed,  because this RS is crashing.
 Master reassigns the meta, but it doesn't split the Hlog. 
 So some meta data is loss.
 About the logs of a failover test case fail. 
 //It said that we want to kill a RS
 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): 
 STOPPED: Killing for unit test
 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
 RS 192.168.2.102,54385,1317264874629 killed 
 //Rs didn't crash. 
 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 master.HMaster(458): Registering server found up in zk: 
 192.168.2.102,54385,1317264874629
 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 master.ServerManager(232): Registering 
 server=192.168.2.102,54385,1317264874629
 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of 
 znode /hbase/unassigned/1028785192 because node does not exist (not an error)
 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) 
 of data from znode /hbase/root-region-server and set watcher; 
 192.168.2.102,54383,131726487...
 //Meta verification failed and ressigned the meta. So all the regions in the 
 meta is loss.
 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
 address=192.168.2.102,54385,1317264874629; 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
 192.168.2.102,54385,1317264874629 not running, aborting
 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(316): new .META. server: 
 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) 
 of data from znode /hbase/root-region-server and set watcher; 
 192.168.2.102,54383,131726487...
 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
 address=192.168.2.102,54385,1317264874629; 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
 192.168.2.102,54385,1317264874629 not running, aborting
 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(316): new .META. server: 
 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) 
 of data from znode /hbase/root-region-server and set watcher; 
 192.168.2.102,54383,131726487...
 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
 address=192.168.2.102,54385,1317264874629; 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
 192.168.2.102,54385,1317264874629 not running, aborting
 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(316): new .META. server: 
 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or 
 updating) unassigned node for 1028785192 with OFFLINE state
 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] 
 zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received 
 ZooKeeper Event, type=NodeCreated, state=SyncConnected, 
 path=/hbase/unassigned/1028785192
 //It said

[jira] [Updated] (HBASE-4580) Create some invalid zk nodes when a clean cluster start.

2011-10-11 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4580:
--

Attachment: HBASE-4580_TrunkV1.patch

 Create some invalid zk nodes when a clean cluster start.
 

 Key: HBASE-4580
 URL: https://issues.apache.org/jira/browse/HBASE-4580
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.92.0
Reporter: gaojinchao
Assignee: gaojinchao
 Fix For: 0.92.0

 Attachments: HBASE-4580_TrunkV1.patch


 The below logs said that we created a invalid zk node when restarted a 
 cluster.
 it mistakenly believed that the regions belong to a dead server.
 2011-10-11 05:05:29,127 INFO org.apache.hadoop.hbase.master.HMaster: Meta 
 updated status = true
 2011-10-11 05:05:29,127 INFO org.apache.hadoop.hbase.master.HMaster: 
 ROOT/Meta already up-to date with new HRI.
 2011-10-11 05:05:29,151 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 771d63e9327383159553619a4f2dc74f with OFFLINE state
 2011-10-11 05:05:29,161 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 3cf860dd323fe6360f571aeafc129f95 with OFFLINE state
 2011-10-11 05:05:29,170 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 4065350214452a9d5c55243c734bef08 with OFFLINE state
 2011-10-11 05:05:29,178 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 4e81613f82a39fc6e5e89f96e7b3ccc4 with OFFLINE state
 2011-10-11 05:05:29,187 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 e21b9e1545a28953aba0098fda5c9cd9 with OFFLINE state
 2011-10-11 05:05:29,195 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 5cd9f55eecd43d088bbd505f6795131f with OFFLINE state
 2011-10-11 05:05:29,229 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 db5f641452a70b09b85a92970e4198c7 with OFFLINE state
 2011-10-11 05:05:29,237 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 a7b20a653919e7f41bfb2ed349af7d21 with OFFLINE state
 2011-10-11 05:05:29,253 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Creating (or updating) unassigned node for 
 c9385619425f737eab1a6624d2e097a8 with OFFLINE state
 // we cleaned all zk nodes.
 2011-10-11 05:05:29,262 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Clean cluster startup. 
 Assigning userregions
 2011-10-11 05:05:29,262 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Deleting any existing unassigned nodes
 2011-10-11 05:05:29,367 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 9 region(s) 
 across 1 server(s), retainAssignment=true
 2011-10-11 05:05:29,369 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Timeout-on-RIT=9000
 2011-10-11 05:05:29,369 DEBUG 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 9 region(s) 
 to C3S3,54366,1318323920153
 2011-10-11 05:05:29,369 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning done
 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Async create of unassigned node for 
 771d63e9327383159553619a4f2dc74f with OFFLINE state
 2011-10-11 05:05:29,371 INFO org.apache.hadoop.hbase.master.HMaster: Master 
 has completed initialization
 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Async create of unassigned node for 
 3cf860dd323fe6360f571aeafc129f95 with OFFLINE state
 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Async create of unassigned node for 
 4065350214452a9d5c55243c734bef08 with OFFLINE state
 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Async create of unassigned node for 
 4e81613f82a39fc6e5e89f96e7b3ccc4 with OFFLINE state
 2011-10-11 05:05:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Async create of unassigned node for 
 e21b9e1545a28953aba0098fda5c9cd9 with OFFLINE state
 2011-10-11 05:05:29,372 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
 master:58198-0x132f23a9a38 Async create of unassigned node for

[jira] [Updated] (HBASE-4511) There is data loss when master failovers

2011-09-29 Thread gaojinchao (Updated) (JIRA)


 [ 
https://issues.apache.org/jira/browse/HBASE-4511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gaojinchao updated HBASE-4511:
--

Attachment: org.apache.hadoop.hbase.master.TestMasterFailover-output.rar

Some logs are provided by Ted

 There is data loss when master failovers
 

 Key: HBASE-4511
 URL: https://issues.apache.org/jira/browse/HBASE-4511
 Project: HBase
  Issue Type: Bug
  Components: master
Affects Versions: 0.92.0
Reporter: gaojinchao
Priority: Critical
 Fix For: 0.92.0

 Attachments: 
 org.apache.hadoop.hbase.master.TestMasterFailover-output.rar


 It goes like this:
 Master crashed ,  at the same time RS with meta is crashing, but RS doesn't 
 eixt.
 Master startups again and finds all living RS. 
 Master verifies the meta failed,  because this RS is crashing.
 Master reassigns the meta, but it doesn't split the Hlog. 
 So some meta data is loss.
 About the logs of a failover test case fail. 
 //It said that we want to kill a RS
 2011-09-28 19:54:45,694 INFO  [Thread-988] regionserver.HRegionServer(1443): 
 STOPPED: Killing for unit test
 2011-09-28 19:54:45,694 INFO  [Thread-988] master.TestMasterFailover(1007): 
 RS 192.168.2.102,54385,1317264874629 killed 
 //Rs didn't crash. 
 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 master.HMaster(458): Registering server found up in zk: 
 192.168.2.102,54385,1317264874629
 2011-09-28 19:54:51,763 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 master.ServerManager(232): Registering 
 server=192.168.2.102,54385,1317264874629
 2011-09-28 19:54:51,770 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(491): master:54557-0x132b31adbb30005 Unable to get data of 
 znode /hbase/unassigned/1028785192 because node does not exist (not an error)
 2011-09-28 19:54:51,771 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) 
 of data from znode /hbase/root-region-server and set watcher; 
 192.168.2.102,54383,131726487...
 //Meta verification failed and ressigned the meta. So all the regions in the 
 meta is loss.
 2011-09-28 19:54:51,773 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
 address=192.168.2.102,54385,1317264874629; 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
 192.168.2.102,54385,1317264874629 not running, aborting
 2011-09-28 19:54:51,773 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(316): new .META. server: 
 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
 2011-09-28 19:54:52,274 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) 
 of data from znode /hbase/root-region-server and set watcher; 
 192.168.2.102,54383,131726487...
 2011-09-28 19:54:52,277 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
 address=192.168.2.102,54385,1317264874629; 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
 192.168.2.102,54385,1317264874629 not running, aborting
 2011-09-28 19:54:52,277 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(316): new .META. server: 
 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
 2011-09-28 19:54:52,778 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKUtil(1003): master:54557-0x132b31adbb30005 Retrieved 33 byte(s) 
 of data from znode /hbase/root-region-server and set watcher; 
 192.168.2.102,54383,131726487...
 2011-09-28 19:54:52,782 INFO  [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(476): Failed verification of .META.,,1 at 
 address=192.168.2.102,54385,1317264874629; 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: 
 org.apache.hadoop.hbase.regionserver.RegionServerStoppedException: Server 
 192.168.2.102,54385,1317264874629 not running, aborting
 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 catalog.CatalogTracker(316): new .META. server: 
 192.168.2.102,54385,1317264874629 isn't valid. Cached .META. server: null
 2011-09-28 19:54:52,782 DEBUG [Master:0;192.168.2.102,54557,1317264885720] 
 zookeeper.ZKAssign(264): master:54557-0x132b31adbb30005 Creating (or 
 updating) unassigned node for 1028785192 with OFFLINE state
 2011-09-28 19:54:52,825 DEBUG [Thread-988-EventThread] 
 zookeeper.ZooKeeperWatcher(233): master:54557-0x132b31adbb30005 Received 
 ZooKeeper Event,

53 matches

Mail list logo