Hi, I'm using hbase 0.20.5 and hadoop 0.20.1. Some region servers are crashing, saying that an file cannot be found, and that a lease has expired (log detail below). Tried to find in this mailing list for the exact problem but was not successful. These are the symptoms:
- Typically I see higher used swap on these servers than on the rest (the ones that didn't crash). - I don't see any Xciever count exceeded message on the DataNode logs. - Please find log snippets as well as configuration settings below. - We don't write to WALs I would greatly appreciate any hint on this regard. Many thanks, Martin ====================== These are some of my settings which I think might be relevant: hdfs-site: <property> <name>dfs.datanode.max.xcievers</name> <value>10000</value> </property> hbase-site: <property> <name>hbase.regionserver.handler.count</name> <value>50</value> </property> <property> <name>hbase.regionserver.global.memstore.upperLimit</name> <value>0.3</value> </property> <property> <name>hfile.block.cache.size</name> <value>0.3</value> </property> ulimit -n 100000 LOGS: Region Server: 2010-12-27 08:35:31,932 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Finished snapshotting, commencing flushing stores 2010-12-27 08:35:32,426 INFO org.apache.hadoop.hbase.regionserver.Store: Added hdfs://ls-nn05.netw.domain01.net:50001/hbase/usertrails_all/1777872268/d/2971304353461036329, entries= 22319, sequenceid=1717292326, memsize=16.1m, filesize=13.5m to usertrails_all,1b005e71-cd19-4d82-87a3-3fa269f4f1d1\x019223370750718307807\x015692970,1292964088406 2010-12-27 08:35:32,426 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Caches flushed, doing commit now (which includes update scanners) 2010-12-27 08:35:32,426 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~16.1m for region usertrails_all,1b005e71-cd19-4d82-87a3-3fa269f4f1d1\x019223370750718307807\x015692970,1292964088406 in 494ms, sequence id=1717292326, compaction requested=false 2010-12-27 08:35:32,427 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for region, no outstanding scanners on usertrails_all,1b005e71-cd19-4d82-87a3-3fa269f4 f1d1\x019223370750718307807\x015692970,1292964088406 2010-12-27 08:35:32,427 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: No more row locks outstanding on region usertrails_all,1b005e71-cd19-4d82-87a3-3fa269f4f1d1\x0192233707 50718307807\x015692970,1292964088406 2010-12-27 08:35:32,427 DEBUG org.apache.hadoop.hbase.regionserver.Store: closed d 2010-12-27 08:35:32,427 INFO org.apache.hadoop.hbase.regionserver.HRegion: Closed usertrails_all,1b005e71-cd19-4d82-87a3-3fa269f4f1d1\x019223370750718307807\x015692970,12929640884 06 2010-12-27 08:35:32,427 DEBUG org.apache.hadoop.hbase.regionserver.HLog: closing hlog writer in hdfs://ls-nn05.netw.domain01.net:50001/hbase/.logs/ls-hdfs24.netw.domain01.net,60020,12 92875777337 2010-12-27 08:35:32,497 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredEx ception: No lease on /hbase/.logs/ls-hdfs24.netw.domain01.net,60020,1292875777337/hlog.dat.1293458980959 File does not exist. Holder DFSClient_52782511 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1328) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1319) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1247) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:434) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) at org.apache.hadoop.ipc.Client.call(Client.java:739) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy1.addBlock(Unknown Source) at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy1.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2906) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2788) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264) 2010-12-27 08:35:32,497 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null2010-12-27 08:35:32,497 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase/.logs/ls-hdfs24.netw.domain01.net,60020,1292875777337/hlog.dat.1293 458980959" - Aborting... 2010-12-27 08:35:32,499 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Close and delete failedorg.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/.logs/ls-hdfs24.netw.domain01.net,6002 0,1292875777337/hlog.dat.1293458980959 File does not exist. Holder DFSClient_52782511 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1328) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1319) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1247) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:434) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94) at org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:626) at java.lang.Thread.run(Thread.java:619)2010-12-27 08:35:32,500 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: telling master that region server is shutting down at: ls-hdfs24.netw.domain01.net,60020,12928757773 37 2010-12-27 08:35:32,508 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server at: ls-hdfs24.netw.domain01.net,60020,1292875777337 2010-12-27 08:35:32,509 INFO org.apache.zookeeper.ZooKeeper: Closing session: 0x2d05625f9b0008 2010-12-27 08:35:32,509 INFO org.apache.zookeeper.ClientCnxn: Closing ClientCnxn for session: 0x2d05625f9b0008 2010-12-27 08:35:32,512 INFO org.apache.zookeeper.ClientCnxn: Disconnecting ClientCnxn for session: 0x2d05625f9b0008 2010-12-27 08:35:32,512 INFO org.apache.zookeeper.ZooKeeper: Session: 0x2d05625f9b0008 closed 2010-12-27 08:35:32,512 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Closed connection with ZooKeeper 2010-12-27 08:35:32,512 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2010-12-27 08:35:32,614 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver/10.32.20.70:60020 exiting 2010-12-27 08:35:49,448 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner -4947199215968218273 lease expired 2010-12-27 08:35:49,538 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 3282292946771297386 lease expired 2010-12-27 08:35:49,582 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 7882165461457343907 lease expired 2010-12-27 08:35:49,582 INFO org.apache.hadoop.hbase.Leases: regionserver/10.32.20.70:60020.leaseChecker closing leases 2010-12-27 08:35:49,582 INFO org.apache.hadoop.hbase.Leases: regionserver/10.32.20.70:60020.leaseChecker closed leases 2010-12-27 08:35:49,607 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown thread. 2010-12-27 08:35:49,607 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete Namenode: 2010-12-27 08:35:32,427 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 50001, call addBlock(/hbase/.logs/ls-hdfs24.netw.domain01.net,60020,1292875777337/hlog.dat.1293458980959, DFSClient_52782511) from 10.32.20.70:44967: error: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/.logs/ls-hdfs24.netw.domain01.net,60020,1292875777337/hlog.dat.1293458980959 File does not exist. Holder DFSClient_52782511 does not have any open files. org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /hbase/.logs/ls-hdfs24.netw.domain01.net,60020,1292875777337/hlog.dat.1293458980959 File does not exist. Holder DFSClient_52782511 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1328) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1319) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1247) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:434) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) Datanode: Not too unusual. Some entries like the following: 2010-12-27 08:35:52,519 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.32.20.70:50010, storageID=DS-2137188581-173.192.229.228-50010-1273698157328, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:168) at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at java.io.DataInputStream.read(DataInputStream.java:132) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:261) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:308) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:372) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:524) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103) at java.lang.Thread.run(Thread.java:619) 2010-12-27 09:35:52,728 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-4708329089501359494_30620223 received exception java.net.SocketException: Connection reset