[jira] [Assigned] (HBASE-5840) Open Region FAILED_OPEN doesn't clear the TaskMonitor Status, keeps showing the old status
[ https://issues.apache.org/jira/browse/HBASE-5840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-5840: - Assignee: ramkrishna.s.vasudevan Open Region FAILED_OPEN doesn't clear the TaskMonitor Status, keeps showing the old status -- Key: HBASE-5840 URL: https://issues.apache.org/jira/browse/HBASE-5840 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.94.0 Reporter: Gopinathan A Assignee: ramkrishna.s.vasudevan Fix For: 0.96.0, 0.94.1 TaskMonitor Status will not be cleared in case Regions FAILED_OPEN. This will keeps showing old status. This will miss leads the user. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-5809) Avoid move api to take the destination server same as the source server.
[ https://issues.apache.org/jira/browse/HBASE-5809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-5809: - Assignee: rajeshbabu Avoid move api to take the destination server same as the source server. Key: HBASE-5809 URL: https://issues.apache.org/jira/browse/HBASE-5809 Project: HBase Issue Type: Improvement Affects Versions: 0.92.1 Reporter: ramkrishna.s.vasudevan Assignee: rajeshbabu Priority: Minor Labels: patch Fix For: 0.96.0 Attachments: HBASE-5809.patch In Move currently we take any destination specified and if the destination is same as the source we still do unassign and assign. Here we can have problems due to RegionAlreadyInTransitionException and thus hanging the region in RIT for long time. We can avoid this scenario by not allowing the move to happen in this scenario. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-5545) region can't be opened for a long time. Because the creating File failed.
[ https://issues.apache.org/jira/browse/HBASE-5545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-5545: - Assignee: ramkrishna.s.vasudevan (was: gaojinchao) region can't be opened for a long time. Because the creating File failed. - Key: HBASE-5545 URL: https://issues.apache.org/jira/browse/HBASE-5545 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 0.90.6 Reporter: gaojinchao Assignee: ramkrishna.s.vasudevan Fix For: 0.90.7, 0.92.2, 0.94.0 Attachments: HBASE-5545.patch, HBASE-5545.patch Scenario: 1. File is created 2. But while writing data, all datanodes might have crashed. So writing data will fail. 3. Now even if close is called in finally block, close also will fail and throw the Exception because writing data failed. 4. After this if RS try to create the same file again, then AlreadyBeingCreatedException will come. Suggestion to handle this scenario. --- 1. Check for the existence of the file, if exists delete the file and create new file. Here delete call for the file will not check whether the file is open or closed. Overwrite Option: 1. Overwrite option will be applicable if you are trying to overwrite a closed file. 2. If the file is not closed, then even with overwrite option Same AlreadyBeingCreatedException will be thrown. This is the expected behaviour to avoid the Multiple clients writing to same file. Region server logs: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/test1/12c01902324218d14b17a5880f24f64b/.tmp/.regioninfo for DFSClient_hb_rs_158-1-131-48,20020,1331107668635_1331107669061_-252463556_25 on client 158.1.132.19 because current leaseholder is trying to recreate file. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1570) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1440) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:1382) at org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:658) at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:547) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1137) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1133) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1131) at org.apache.hadoop.ipc.Client.call(Client.java:961) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:245) at $Proxy6.create(Unknown Source) at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at $Proxy6.create(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.init(DFSClient.java:3643) at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:778) at org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:364) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:630) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:611) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:518) at org.apache.hadoop.hbase.regionserver.HRegion.checkRegioninfoOnFilesystem(HRegion.java:424) at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:340) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2672) at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:2658) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:330) at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:116) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:158) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) [2012-03-07 20:51:45,858] [WARN ] [RS_OPEN_REGION-158-1-131-48,20020,1331107668635-23] [com.huawei.isap.ump.ha.client.RPCRetryAndSwitchInvoker 131] Retrying the method call: public abstract void
[jira] [Assigned] (HBASE-5816) Balancer and ServerShutdownHandler concurrently reassigning the same region
[ https://issues.apache.org/jira/browse/HBASE-5816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-5816: - Assignee: ramkrishna.s.vasudevan Balancer and ServerShutdownHandler concurrently reassigning the same region --- Key: HBASE-5816 URL: https://issues.apache.org/jira/browse/HBASE-5816 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.6 Reporter: Maryann Xue Assignee: ramkrishna.s.vasudevan Priority: Critical Attachments: HBASE-5816.patch The first assign thread exits with success after updating the RegionState to PENDING_OPEN, while the second assign follows immediately into assign and fails the RegionState check in setOfflineInZooKeeper(). This causes the master to abort. In the below case, the two concurrent assigns occurred when AM tried to assign a region to a dying/dead RS, and meanwhile the ShutdownServerHandler tried to assign this region (from the region plan) spontaneously. 2012-04-17 05:44:57,648 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., src=hadoop05.sh.intel.com,60020,1334544902186, dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 2012-04-17 05:44:57,648 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. (offlining) 2012-04-17 05:44:57,648 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to serverName=hadoop05.sh.intel.com,60020,1334544902186, load=(requests=0, regions=0, usedHeap=0, maxHeap=0) for region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. 2012-04-17 05:44:57,666 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling new unassigned node: /hbase/unassigned/fe38fe31caf40b6e607a3e6bbed6404b (region=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., server=hadoop05.sh.intel.com,60020,1334544902186, state=RS_ZK_REGION_CLOSING) 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. state=CLOSED, ts=1334612697672, server=hadoop05.sh.intel.com,60020,1334544902186 2012-04-17 05:52:58,984 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:6-0x236b912e9b3000e Creating (or updating) unassigned node for fe38fe31caf40b6e607a3e6bbed6404b with OFFLINE state 2012-04-17 05:52:59,096 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b.; plan=hri=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b., src=hadoop05.sh.intel.com,60020,1334544902186, dest=xmlqa-clv16.sh.intel.com,60020,1334612497253 2012-04-17 05:52:59,096 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to xmlqa-clv16.sh.intel.com,60020,1334612497253 2012-04-17 05:54:19,159 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. state=PENDING_OPEN, ts=1334613179096, server=xmlqa-clv16.sh.intel.com,60020,1334612497253 2012-04-17 05:54:59,033 WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of TABLE_ORDER_CUSTOMER,,1334017820846.fe38fe31caf40b6e607a3e6bbed6404b. to serverName=xmlqa-clv16.sh.intel.com,60020,1334612497253, load=(requests=0, regions=0, usedHeap=0, maxHeap=0), trying to assign elsewhere instead; retry=0 java.net.SocketTimeoutException: Call to /10.239.47.87:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 12 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.239.47.89:41302 remote=/10.239.47.87:60020] at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:805) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:778) at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:283) at $Proxy7.openRegion(Unknown Source) at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:573) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1127) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:912) at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:892)
[jira] [Assigned] (HBASE-5782) Not all the regions are getting assigned after the log splitting.
[ https://issues.apache.org/jira/browse/HBASE-5782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-5782: - Assignee: ramkrishna.s.vasudevan Not all the regions are getting assigned after the log splitting. - Key: HBASE-5782 URL: https://issues.apache.org/jira/browse/HBASE-5782 Project: HBase Issue Type: Bug Components: wal Affects Versions: 0.94.0 Reporter: Gopinathan A Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.94.0 Create a table with 1000 splits, after the region assignemnt, kill the regionserver wich contains META table. Here few regions are missing after the log splitting and region assigment. HBCK report shows multiple region holes are got created. Same scenario was verified mulitple times in 0.92.1, no issues. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-5651) [findbugs] Address wait/notify synchronization/inconsistency in sync
[ https://issues.apache.org/jira/browse/HBASE-5651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-5651: - Assignee: ramkrishna.s.vasudevan [findbugs] Address wait/notify synchronization/inconsistency in sync Key: HBASE-5651 URL: https://issues.apache.org/jira/browse/HBASE-5651 Project: HBase Issue Type: Sub-task Components: scripts Reporter: Jonathan Hsieh Assignee: ramkrishna.s.vasudevan See https://builds.apache.org/job/PreCommit-HBASE-Build/1313//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html#Warnings_MT_CORRECTNESS fix classes IS,LI,MWM, NN, SWL, UG, UW -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-5595) Fix NoSuchMethodException in 0.92 when running on local filesystem
[ https://issues.apache.org/jira/browse/HBASE-5595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-5595: - Assignee: Uma Mahesh Thanks for the patch Uma. Added you to the contributor list. Fix NoSuchMethodException in 0.92 when running on local filesystem -- Key: HBASE-5595 URL: https://issues.apache.org/jira/browse/HBASE-5595 Project: HBase Issue Type: Bug Reporter: stack Assignee: Uma Mahesh Priority: Critical Fix For: 0.92.2 Attachments: HBASE-5595.patch Fix this ugly exception that shows when running 0.92.1 when on local filesystem: {code} 2012-03-16 10:54:48,351 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: getNumCurrentReplicas--HDFS-826 not available; hdfs_out=org.apache.hadoop.fs.FSDataOutputStream@301abf87 java.lang.NoSuchMethodException: org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.getNumCurrentReplicas() at java.lang.Class.getDeclaredMethod(Class.java:1937) at org.apache.hadoop.hbase.regionserver.wal.HLog.getGetNumCurrentReplicas(HLog.java:425) at org.apache.hadoop.hbase.regionserver.wal.HLog.init(HLog.java:408) at org.apache.hadoop.hbase.regionserver.wal.HLog.init(HLog.java:331) at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateHLog(HRegionServer.java:1229) at org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1218) at org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:937) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:648) at java.lang.Thread.run(Thread.java:680) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-5595) Fix NoSuchMethodException in 0.92 when running on local filesystem
[ https://issues.apache.org/jira/browse/HBASE-5595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-5595: - Assignee: Uma Maheswara Rao G (was: Uma Mahesh) Fix NoSuchMethodException in 0.92 when running on local filesystem -- Key: HBASE-5595 URL: https://issues.apache.org/jira/browse/HBASE-5595 Project: HBase Issue Type: Bug Reporter: stack Assignee: Uma Maheswara Rao G Priority: Critical Fix For: 0.92.2 Attachments: HBASE-5595.patch Fix this ugly exception that shows when running 0.92.1 when on local filesystem: {code} 2012-03-16 10:54:48,351 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: getNumCurrentReplicas--HDFS-826 not available; hdfs_out=org.apache.hadoop.fs.FSDataOutputStream@301abf87 java.lang.NoSuchMethodException: org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.getNumCurrentReplicas() at java.lang.Class.getDeclaredMethod(Class.java:1937) at org.apache.hadoop.hbase.regionserver.wal.HLog.getGetNumCurrentReplicas(HLog.java:425) at org.apache.hadoop.hbase.regionserver.wal.HLog.init(HLog.java:408) at org.apache.hadoop.hbase.regionserver.wal.HLog.init(HLog.java:331) at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateHLog(HRegionServer.java:1229) at org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1218) at org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:937) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:648) at java.lang.Thread.run(Thread.java:680) {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-5531) Maven hadoop profile (version 23) needs to be updated with latest 23 snapshot
[ https://issues.apache.org/jira/browse/HBASE-5531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-5531: - Assignee: ramkrishna.s.vasudevan Maven hadoop profile (version 23) needs to be updated with latest 23 snapshot - Key: HBASE-5531 URL: https://issues.apache.org/jira/browse/HBASE-5531 Project: HBase Issue Type: Bug Components: build Affects Versions: 0.92.2 Reporter: Laxman Assignee: ramkrishna.s.vasudevan Labels: build Fix For: 0.92.2, 0.94.0, 0.96.0 Attachments: HBASE-5531-trunk.patch, HBASE-5531.patch Current profile is still pointing to 0.23.1-SNAPSHOT. This is failing to build as 23.1 is already released and snapshot is not available anymore. We can update this to 0.23.2-SNAPSHOT. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-5531) Maven hadoop profile (version 23) needs to be updated with latest 23 snapshot
[ https://issues.apache.org/jira/browse/HBASE-5531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-5531: - Assignee: Laxman (was: ramkrishna.s.vasudevan) Maven hadoop profile (version 23) needs to be updated with latest 23 snapshot - Key: HBASE-5531 URL: https://issues.apache.org/jira/browse/HBASE-5531 Project: HBase Issue Type: Bug Components: build Affects Versions: 0.92.2 Reporter: Laxman Assignee: Laxman Labels: build Fix For: 0.92.2, 0.94.0, 0.96.0 Attachments: HBASE-5531-trunk.patch, HBASE-5531.patch Current profile is still pointing to 0.23.1-SNAPSHOT. This is failing to build as 23.1 is already released and snapshot is not available anymore. We can update this to 0.23.2-SNAPSHOT. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-5510) Change in LB.randomAssignment(ListServerName servers) API
[ https://issues.apache.org/jira/browse/HBASE-5510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-5510: - Assignee: ramkrishna.s.vasudevan Change in LB.randomAssignment(ListServerName servers) API --- Key: HBASE-5510 URL: https://issues.apache.org/jira/browse/HBASE-5510 Project: HBase Issue Type: Improvement Affects Versions: 0.92.0 Reporter: Anoop Sam John Assignee: ramkrishna.s.vasudevan In LB there is randomAssignment(ListServerName servers) API which will be used by AM to assign a region from a down RS. [This will be also used in other cases like call to assign() API from client] I feel it would be better to pass the HRegionInfo also into this method. When the LB making a choice for a region assignment, when one RS is down, it would be nice that the LB knows for which region it is doing this server selection. +Scenario+ While one RS down, we wanted the regions to get moved to other RSs but a set of regions stay together. We are having custom load balancer but with the current way of LB interface this is not possible. Another way is I can allow a random assignment of the regions at the RS down time. Later with a cluster balance I can balance the regions as I need. But this might make regions assign 1st to one RS and then again move to another. Also for some time period my business use case can not get satisfied. Also I have seen some issue in JIRA which speaks about making sure that Root and META regions always sit in some specific RSs. With the current LB API this wont be possible in future. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-5323) Need to handle assertion error while splitting log through ServerShutDownHandler by shutting down the master
[ https://issues.apache.org/jira/browse/HBASE-5323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-5323: - Assignee: ramkrishna.s.vasudevan Need to handle assertion error while splitting log through ServerShutDownHandler by shutting down the master Key: HBASE-5323 URL: https://issues.apache.org/jira/browse/HBASE-5323 Project: HBase Issue Type: Bug Affects Versions: 0.90.5 Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan Fix For: 0.94.0, 0.90.7 We know that while parsing the HLog we expect the proper length from HDFS. In WALReaderFSDataInputStream {code} assert(realLength = this.length); {code} We are trying to come out if the above condition is not satisfied. But if SSH.splitLog() gets this problem then it lands in the run method of EventHandler. This kills the SSH thread and so further assignment does not happen. If ROOT and META are to be assigned they cannot be. I think in this condition we abort the master by catching such exceptions. Please do suggest. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-5120) Timeout monitor races with table disable handler
[ https://issues.apache.org/jira/browse/HBASE-5120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-5120: - Assignee: ramkrishna.s.vasudevan Timeout monitor races with table disable handler Key: HBASE-5120 URL: https://issues.apache.org/jira/browse/HBASE-5120 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Zhihong Yu Assignee: ramkrishna.s.vasudevan Priority: Blocker Fix For: 0.94.0, 0.92.1 Attachments: HBASE-5120.patch, HBASE-5120_1.patch, HBASE-5120_2.patch, HBASE-5120_3.patch, HBASE-5120_4.patch, HBASE-5120_5.patch Here is what J-D described here: https://issues.apache.org/jira/browse/HBASE-5119?focusedCommentId=13179176page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13179176 I think I will retract from my statement that it used to be extremely racy and caused more troubles than it fixed, on my first test I got a stuck region in transition instead of being able to recover. The timeout was set to 2 minutes to be sure I hit it. First the region gets closed {quote} 2012-01-04 00:16:25,811 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r5s38,62023,1325635980913 for region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. {quote} 2 minutes later it times out: {quote} 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636185810, server=null 2012-01-04 00:18:30,026 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,027 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. (offlining) {quote} 100ms later the master finally gets the event: {quote} 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r5s38,62023,1325635980913, region=1a4b111bcc228043e89f59c4c3f6a791, which is more than 15 seconds late 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 1a4b111bcc228043e89f59c4c3f6a791 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Table being disabled so deleting ZK node and removing from regions in transition, skipping assignment of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. 2012-01-04 00:18:30,129 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Deleting existing unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 that is in expected state RS_ZK_REGION_CLOSED 2012-01-04 00:18:30,166 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Successfully deleted unassigned node for region 1a4b111bcc228043e89f59c4c3f6a791 in expected state RS_ZK_REGION_CLOSED {quote} At this point everything is fine, the region was processed as closed. But wait, remember that line where it said it was going to force an unassign? {quote} 2012-01-04 00:18:30,322 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db03587 Creating unassigned node for 1a4b111bcc228043e89f59c4c3f6a791 in a CLOSING state 2012-01-04 00:18:30,328 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server null returned java.lang.NullPointerException: Passed server is null for 1a4b111bcc228043e89f59c4c3f6a791 {quote} Now the master is confused, it recreated the RIT znode but the region doesn't even exist anymore. It even tries to shut it down but is blocked by NPEs. Now this is what's going on. The late ZK notification that the znode was deleted (but it got recreated after): {quote} 2012-01-04 00:19:33,285 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: The znode of region test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. has been deleted. {quote} Then it prints this, and much later tries to unassign it again: {quote} 2012-01-04 00:19:46,607 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition; test1,089cd0c9,1325635015491.1a4b111bcc228043e89f59c4c3f6a791. state=PENDING_CLOSE, ts=1325636310328, server=null ... 2012-01-04 00:20:39,623 DEBUG org.apache.hadoop.hbase.master.handler.DeleteTableHandler: Waiting on region to clear regions in transition;
[jira] [Assigned] (HBASE-4988) MetaServer crash cause all splitting regionserver abort
[ https://issues.apache.org/jira/browse/HBASE-4988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-4988: - Assignee: chunhui shen MetaServer crash cause all splitting regionserver abort --- Key: HBASE-4988 URL: https://issues.apache.org/jira/browse/HBASE-4988 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-4988v1.patch If metaserver crash now, All the splitting regionserver will abort theirself. Becasue the code {code} this.journal.add(JournalEntry.PONR); MetaEditor.offlineParentInMeta(server.getCatalogTracker(), this.parent.getRegionInfo(), a.getRegionInfo(), b.getRegionInfo()); {code} If the JournalEntry is PONR, split's roll back will abort itselef. It is terrible in huge putting environment when metaserver crash -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-5121) MajorCompaction may affect scan's correctness
[ https://issues.apache.org/jira/browse/HBASE-5121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-5121: - Assignee: chunhui shen MajorCompaction may affect scan's correctness - Key: HBASE-5121 URL: https://issues.apache.org/jira/browse/HBASE-5121 Project: HBase Issue Type: Bug Components: regionserver Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-5121.patch In our test, there are two families' keyvalue for one row. But we could find a infrequent problem when doing scan's next if majorCompaction happens concurrently. In the client's two continuous doing scan.next(): 1.First time, scan's next returns the result where family A is null. 2.Second time, scan's next returns the result where family B is null. The two next()'s result have the same row. If there are more families, I think the scenario will be more strange... We find the reason is that storescanner.peek() is changed after majorCompaction if there are delete type KeyValue. This change causes the PriorityQueueKeyValueScanner of RegionScanner's heap is not sure to be sorted. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-5097) RegionObserver implementation whose preScannerOpen and postScannerOpen Impl return null can stall the system initialization through NPE
[ https://issues.apache.org/jira/browse/HBASE-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-5097: - Assignee: ramkrishna.s.vasudevan RegionObserver implementation whose preScannerOpen and postScannerOpen Impl return null can stall the system initialization through NPE --- Key: HBASE-5097 URL: https://issues.apache.org/jira/browse/HBASE-5097 Project: HBase Issue Type: Bug Components: coprocessors Reporter: ramkrishna.s.vasudevan Assignee: ramkrishna.s.vasudevan In HRegionServer.java openScanner() {code} r.prepareScanner(scan); RegionScanner s = null; if (r.getCoprocessorHost() != null) { s = r.getCoprocessorHost().preScannerOpen(scan); } if (s == null) { s = r.getScanner(scan); } if (r.getCoprocessorHost() != null) { s = r.getCoprocessorHost().postScannerOpen(scan, s); } {code} If we dont have implemention for postScannerOpen the RegionScanner is null and so throwing nullpointer {code} java.lang.NullPointerException at java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:881) at org.apache.hadoop.hbase.regionserver.HRegionServer.addScanner(HRegionServer.java:2282) at org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:2272) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1326) {code} Making this defect as blocker.. Pls feel free to change the priority if am wrong. Also correct me if my way of trying out coprocessors without implementing postScannerOpen is wrong. Am just a learner. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-5086) Reopening a region on a RS can leave it in PENDING_OPEN
[ https://issues.apache.org/jira/browse/HBASE-5086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-5086: - Assignee: ramkrishna.s.vasudevan Reopening a region on a RS can leave it in PENDING_OPEN --- Key: HBASE-5086 URL: https://issues.apache.org/jira/browse/HBASE-5086 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: ramkrishna.s.vasudevan Fix For: 0.92.1 I got this twice during the same test. If the region servers are slow enough and you run an online alter, it's possible for the RS to change the znode status to CLOSED and have the master send an OPEN before the region server is able to remove the region from it's list of RITs. This is what the master sees: {quote} 011-12-21 22:24:09,498 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting unassignment of region test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. (offlining) 2011-12-21 22:24:09,498 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db033f7 Creating unassigned node for 43123e2e3fc83ec25fe2a76b4f09077f in a CLOSING state 2011-12-21 22:24:09,524 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to sv4r25s44,62023,1324494325099 for region test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. 2011-12-21 22:24:15,656 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=sv4r25s44,62023,1324494325099, region=43123e2e3fc83ec25fe2a76b4f09077f 2011-12-21 22:24:15,656 DEBUG org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: Handling CLOSED event for 43123e2e3fc83ec25fe2a76b4f09077f 2011-12-21 22:24:15,656 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE; was=test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. state=CLOSED, ts=1324506255629, server=sv4r25s44,62023,1324494325099 2011-12-21 22:24:15,656 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x134589d3db033f7 Creating (or updating) unassigned node for 43123e2e3fc83ec25fe2a76b4f09077f with OFFLINE state 2011-12-21 22:24:15,663 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Found an existing plan for test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. destination server is + sv4r25s44,62023,1324494325099 2011-12-21 22:24:15,663 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Using pre-existing plan for region test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f.; plan=hri=test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f., src=, dest=sv4r25s44,62023,1324494325099 2011-12-21 22:24:15,663 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning region test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. to sv4r25s44,62023,1324494325099 2011-12-21 22:24:15,664 ERROR org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment in: sv4r25s44,62023,1324494325099 due to org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: Received:OPEN for the region:test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. ,which we are already trying to CLOSE. {quote} After that the master abandons. And the region server: {quote} 2011-12-21 22:24:09,523 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received close region: test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. 2011-12-21 22:24:09,523 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. 2011-12-21 22:24:09,524 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Closing test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f.: disabling compactions flushes 2011-12-21 22:24:09,524 INFO org.apache.hadoop.hbase.regionserver.HRegion: Running close preflush of test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f. 2011-12-21 22:24:09,524 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Started memstore flush for test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f., current region memstore size 40.5m 2011-12-21 22:24:09,524 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished snapshotting test1,db6db6b4,1324501004642.43123e2e3fc83ec25fe2a76b4f09077f., commencing wait for mvcc, flushsize=42482936 2011-12-21 22:24:13,368 DEBUG org.apache.hadoop.hbase.regionserver.Store: Renaming flushed file at hdfs://sv4r11s38:9100/hbase/test1/43123e2e3fc83ec25fe2a76b4f09077f/.tmp/87d6944c54c7417e9a34a9f9542bcb72 to
[jira] [Assigned] (HBASE-4951) master process can not be stopped when it is initializing
[ https://issues.apache.org/jira/browse/HBASE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-4951: - Assignee: ramkrishna.s.vasudevan master process can not be stopped when it is initializing - Key: HBASE-4951 URL: https://issues.apache.org/jira/browse/HBASE-4951 Project: HBase Issue Type: Bug Components: master Affects Versions: 0.90.3 Reporter: xufeng Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.92.0, 0.90.5 It is easy to reproduce by following step: step1:start master process.(do not start regionserver process in the cluster). the master will wait the regionserver to check in: org.apache.hadoop.hbase.master.ServerManager: Waiting on regionserver(s) to checkin step2:stop the master by sh command bin/hbase master stop result:the master process will never die because catalogTracker.waitForRoot() method will block unitl the root region assigned. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-4729) Clash between region unassign and splitting kills the master
[ https://issues.apache.org/jira/browse/HBASE-4729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-4729: - Assignee: stack (was: ramkrishna.s.vasudevan) Assigning to your name Stack. Clash between region unassign and splitting kills the master Key: HBASE-4729 URL: https://issues.apache.org/jira/browse/HBASE-4729 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: stack Priority: Critical Fix For: 0.92.0, 0.94.0 Attachments: 4729-v2.txt, 4729-v3.txt, 4729-v4.txt, 4729-v5.txt, 4729-v6-092.txt, 4729-v6-trunk.txt, 4729.txt I was running an online alter while regions were splitting, and suddenly the master died and left my table half-altered (haven't restarted the master yet). What killed the master: {quote} 2011-11-02 17:06:44,428 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected ZK exception creating node CLOSING org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /hbase/unassigned/f7e1783e65ea8d621a4bc96ad310f101 at org.apache.zookeeper.KeeperException.create(KeeperException.java:110) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:459) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:441) at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndWatch(ZKUtil.java:769) at org.apache.hadoop.hbase.zookeeper.ZKAssign.createNodeClosing(ZKAssign.java:568) at org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1722) at org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1661) at org.apache.hadoop.hbase.master.BulkReOpen$1.run(BulkReOpen.java:69) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {quote} A znode was created because the region server was splitting the region 4 seconds before: {quote} 2011-11-02 17:06:40,704 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region TestTable,0012469153,1320253135043.f7e1783e65ea8d621a4bc96ad310f101. 2011-11-02 17:06:40,704 DEBUG org.apache.hadoop.hbase.regionserver.SplitTransaction: regionserver:62023-0x132f043bbde0710 Creating ephemeral node for f7e1783e65ea8d621a4bc96ad310f101 in SPLITTING state 2011-11-02 17:06:40,751 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:62023-0x132f043bbde0710 Attempting to transition node f7e1783e65ea8d621a4bc96ad310f101 from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLITTING ... 2011-11-02 17:06:44,061 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:62023-0x132f043bbde0710 Successfully transitioned node f7e1783e65ea8d621a4bc96ad310f101 from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLIT 2011-11-02 17:06:44,061 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on the master to process the split for f7e1783e65ea8d621a4bc96ad310f101 {quote} Now that the master is dead the region server is spewing those last two lines like mad. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-4880) Region is on service before completing openRegionHanlder, may cause data loss
[ https://issues.apache.org/jira/browse/HBASE-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-4880: - Assignee: chunhui shen Region is on service before completing openRegionHanlder, may cause data loss - Key: HBASE-4880 URL: https://issues.apache.org/jira/browse/HBASE-4880 Project: HBase Issue Type: Bug Reporter: chunhui shen Assignee: chunhui shen Attachments: hbase-4880.patch OpenRegionHandler in regionserver is processed as the following steps: {code} 1.openregion()(Through it, closed = false, closing = false) 2.addToOnlineRegions(region) 3.update .meta. table 4.update ZK's node state to RS_ZK_REGION_OPEND {code} We can find that region is on service before Step 4. It means client could put data to this region after step 3. What will happen if step 4 is failed processing? It will execute OpenRegionHandler#cleanupFailedOpen which will do closing region, and master assign this region to another regionserver. If closing region is failed, the data which is put between step 3 and step 4 may loss, because the region has been opend on another regionserver and be put new data. Therefore, it may not be recoverd through replayRecoveredEdit() because the edit's LogSeqId is smaller than current region SeqId. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-4841) If I call split fast enough, while inserting, rows disappear.
[ https://issues.apache.org/jira/browse/HBASE-4841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-4841: - Assignee: ramkrishna.s.vasudevan If I call split fast enough, while inserting, rows disappear. -- Key: HBASE-4841 URL: https://issues.apache.org/jira/browse/HBASE-4841 Project: HBase Issue Type: Bug Reporter: Alex Newman Assignee: ramkrishna.s.vasudevan Priority: Critical Attachments: 1, log, log2 I'll attach a unit test for this. Basically if you call split, while inserting data you can get to the point to where the cluster becomes unstable, or rows will disappear. The unit test gives you some flexibility of: - How many rows - How wide the rows are - The frequency of the split. The default settings crash unit tests or cause the unit tests to fail on my laptop. On my macbook air, i could actually turn down the number of total rows, and the frequency of the splits which is surprising. I think this is because the macbook air has much better IO than my backup acer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-4308) Race between RegionOpenedHandler and AssignmentManager
[ https://issues.apache.org/jira/browse/HBASE-4308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-4308: - Assignee: ramkrishna.s.vasudevan Race between RegionOpenedHandler and AssignmentManager -- Key: HBASE-4308 URL: https://issues.apache.org/jira/browse/HBASE-4308 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Todd Lipcon Assignee: ramkrishna.s.vasudevan Fix For: 0.92.0 When the master is processing a ZK event for REGION_OPENED, it calls delete() on the znode before it removes the node from RegionsInTransition. If the notification of that delete comes back into AssignmentManager before the region is removed from RIT, you see an error like: 2011-08-30 17:43:29,537 WARN [main-EventThread] master.AssignmentManager(861): Node deleted but still in RIT: .META.,,1.1028785192 state=OPEN, ts=1314751409532, server=todd-w510,55655,1314751396840 Not certain if it causes issues, but it's a concerning log message. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-4796) Race between SplitRegionHandlers for the same region kills the master
[ https://issues.apache.org/jira/browse/HBASE-4796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-4796: - Assignee: ramkrishna.s.vasudevan Race between SplitRegionHandlers for the same region kills the master - Key: HBASE-4796 URL: https://issues.apache.org/jira/browse/HBASE-4796 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: ramkrishna.s.vasudevan Fix For: 0.92.0, 0.94.0 I just saw that multiple SplitRegionHandlers can be created for the same region because of the RS tickling, but it becomes deadly when more than 1 are trying to delete the znode at the same time: {quote} 2011-11-16 02:25:28,778 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_SPLIT, server=sv4r7s38,62023,1321410237387, region=f80b6a904048a99ce88d61420b8906d1 2011-11-16 02:25:28,780 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Handling transition=RS_ZK_REGION_SPLIT, server=sv4r7s38,62023,1321410237387, region=f80b6a904048a99ce88d61420b8906d1 2011-11-16 02:25:28,796 DEBUG org.apache.hadoop.hbase.master.handler.SplitRegionHandler: Handling SPLIT event for f80b6a904048a99ce88d61420b8906d1; deleting node 2011-11-16 02:25:28,798 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x132f043bbde094b Deleting existing unassigned node for f80b6a904048a99ce88d61420b8906d1 that is in expected state RS_ZK_REGION_SPLIT 2011-11-16 02:25:28,804 DEBUG org.apache.hadoop.hbase.master.handler.SplitRegionHandler: Handling SPLIT event for f80b6a904048a99ce88d61420b8906d1; deleting node 2011-11-16 02:25:28,806 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x132f043bbde094b Deleting existing unassigned node for f80b6a904048a99ce88d61420b8906d1 that is in expected state RS_ZK_REGION_SPLIT 2011-11-16 02:25:28,821 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:62003-0x132f043bbde094b Successfully deleted unassigned node for region f80b6a904048a99ce88d61420b8906d1 in expected state RS_ZK_REGION_SPLIT 2011-11-16 02:25:28,821 INFO org.apache.hadoop.hbase.master.handler.SplitRegionHandler: Handled SPLIT report); parent=TestTable,006304,1321409743253.f80b6a904048a99ce88d61420b8906d1. daughter a=TestTable,006304,1321410325564.e0f5d201683bcabe14426817224334b8.daughter b=TestTable,007054,1321410325564.1b82eeb5d230c47ccc51c08256134839. 2011-11-16 02:25:28,829 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Node /hbase/unassigned/f80b6a904048a99ce88d61420b8906d1 already deleted, and this is not a retry 2011-11-16 02:25:28,830 FATAL org.apache.hadoop.hbase.master.HMaster: Error deleting SPLIT node in ZK for transition ZK node (f80b6a904048a99ce88d61420b8906d1) org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /hbase/unassigned/f80b6a904048a99ce88d61420b8906d1 at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:107) at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:884) at org.apache.hadoop.hbase.zookeeper.ZKAssign.deleteNode(ZKAssign.java:506) at org.apache.hadoop.hbase.zookeeper.ZKAssign.deleteNode(ZKAssign.java:453) at org.apache.hadoop.hbase.master.handler.SplitRegionHandler.process(SplitRegionHandler.java:95) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:168) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {quote} Stack and I came up with the solution that we need just manage that exception because handleSplitReport is an in-memory thing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-4729) Race between online altering and splitting kills the master
[ https://issues.apache.org/jira/browse/HBASE-4729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-4729: - Assignee: ramkrishna.s.vasudevan Race between online altering and splitting kills the master --- Key: HBASE-4729 URL: https://issues.apache.org/jira/browse/HBASE-4729 Project: HBase Issue Type: Bug Affects Versions: 0.92.0 Reporter: Jean-Daniel Cryans Assignee: ramkrishna.s.vasudevan Fix For: 0.92.0, 0.94.0 I was running an online alter while regions were splitting, and suddenly the master died and left my table half-altered (haven't restarted the master yet). What killed the master: {quote} 2011-11-02 17:06:44,428 FATAL org.apache.hadoop.hbase.master.HMaster: Unexpected ZK exception creating node CLOSING org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists for /hbase/unassigned/f7e1783e65ea8d621a4bc96ad310f101 at org.apache.zookeeper.KeeperException.create(KeeperException.java:110) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.createNonSequential(RecoverableZooKeeper.java:459) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.create(RecoverableZooKeeper.java:441) at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndWatch(ZKUtil.java:769) at org.apache.hadoop.hbase.zookeeper.ZKAssign.createNodeClosing(ZKAssign.java:568) at org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1722) at org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1661) at org.apache.hadoop.hbase.master.BulkReOpen$1.run(BulkReOpen.java:69) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) {quote} A znode was created because the region server was splitting the region 4 seconds before: {quote} 2011-11-02 17:06:40,704 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of region TestTable,0012469153,1320253135043.f7e1783e65ea8d621a4bc96ad310f101. 2011-11-02 17:06:40,704 DEBUG org.apache.hadoop.hbase.regionserver.SplitTransaction: regionserver:62023-0x132f043bbde0710 Creating ephemeral node for f7e1783e65ea8d621a4bc96ad310f101 in SPLITTING state 2011-11-02 17:06:40,751 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:62023-0x132f043bbde0710 Attempting to transition node f7e1783e65ea8d621a4bc96ad310f101 from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLITTING ... 2011-11-02 17:06:44,061 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: regionserver:62023-0x132f043bbde0710 Successfully transitioned node f7e1783e65ea8d621a4bc96ad310f101 from RS_ZK_REGION_SPLITTING to RS_ZK_REGION_SPLIT 2011-11-02 17:06:44,061 INFO org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on the master to process the split for f7e1783e65ea8d621a4bc96ad310f101 {quote} Now that the master is dead the region server is spewing those last two lines like mad. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-4462) Properly treating SocketTimeoutException
[ https://issues.apache.org/jira/browse/HBASE-4462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-4462: - Assignee: ramkrishna.s.vasudevan Properly treating SocketTimeoutException Key: HBASE-4462 URL: https://issues.apache.org/jira/browse/HBASE-4462 Project: HBase Issue Type: Improvement Affects Versions: 0.90.4 Reporter: Jean-Daniel Cryans Assignee: ramkrishna.s.vasudevan Fix For: 0.90.5 SocketTimeoutException is currently treated like any IOE inside of HCM.getRegionServerWithRetries and I think this is a problem. This method should only do retries in cases where we are pretty sure the operation will complete, but with STE we already waited for (by default) 60 seconds and nothing happened. I found this while debugging Douglas Campbell's problem on the mailing list where it seemed like he was using the same scanner from multiple threads, but actually it was just the same client doing retries while the first run didn't even finish yet (that's another problem). You could see the first scanner, then up to two other handlers waiting for it to finish in order to run (because of the synchronization on RegionScanner). So what should we do? We could treat STE as a DoNotRetryException and let the client deal with it, or we could retry only once. There's also the option of having a different behavior for get/put/icv/scan, the issue with operations that modify a cell is that you don't know if the operation completed or not (same when a RS dies hard after completing let's say a Put but just before returning to the client). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HBASE-4459) HbaseObjectWritable code is a byte, we will eventually run out of codes
[ https://issues.apache.org/jira/browse/HBASE-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ramkrishna.s.vasudevan reassigned HBASE-4459: - Assignee: ramkrishna.s.vasudevan HbaseObjectWritable code is a byte, we will eventually run out of codes --- Key: HBASE-4459 URL: https://issues.apache.org/jira/browse/HBASE-4459 Project: HBase Issue Type: Bug Components: io Reporter: Jonathan Gray Assignee: ramkrishna.s.vasudevan Priority: Critical Fix For: 0.92.0 There are about 90 classes/codes in HbaseObjectWritable currently and Byte.MAX_VALUE is 127. In addition, anyone wanting to add custom classes but not break compatibility might want to leave a gap before using codes and that's difficult in such limited space. Eventually we should get rid of this pattern that makes compatibility difficult (better client/server protocol handshake) but we should probably at least bump this to a short for 0.94. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira