Hi,
  I'm using hbase 0.20.5 and hadoop 0.20.1. Some region servers are crashing, 
saying that an file cannot be found, and that a lease has expired (log detail 
below). Tried to find in this mailing list for the exact problem but was not 
successful. These are the symptoms:

- Typically I see higher used swap on these servers than on the rest (the ones 
that didn't crash). 
- I don't see any Xciever count exceeded message on the DataNode logs.
- Please find log snippets as well as configuration settings below.
 - We don't write to WALs

I would greatly appreciate any hint on this regard.

Many thanks,
   Martin

======================

These are some of my settings which I think might be relevant:

hdfs-site:
<property>
   <name>dfs.datanode.max.xcievers</name>
   <value>10000</value>
 </property>

hbase-site:

 <property>
   <name>hbase.regionserver.handler.count</name>
   <value>50</value>
 </property>

 <property>
   <name>hbase.regionserver.global.memstore.upperLimit</name>
   <value>0.3</value>
 </property>
 <property>
   <name>hfile.block.cache.size</name>
   <value>0.3</value>
 </property>


ulimit -n
100000


LOGS:  

Region Server:

2010-12-27 08:35:31,932 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Finished snapshotting, commencing flushing stores
2010-12-27 08:35:32,426 INFO org.apache.hadoop.hbase.regionserver.Store: Added 
hdfs://ls-nn05.netw.domain01.net:50001/hbase/usertrails_all/1777872268/d/2971304353461036329,
 entries=
22319, sequenceid=1717292326, memsize=16.1m, filesize=13.5m to 
usertrails_all,1b005e71-cd19-4d82-87a3-3fa269f4f1d1\x019223370750718307807\x015692970,1292964088406
2010-12-27 08:35:32,426 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Caches flushed, doing commit now (which includes update scanners)
2010-12-27 08:35:32,426 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
Finished memstore flush of ~16.1m for region 
usertrails_all,1b005e71-cd19-4d82-87a3-3fa269f4f1d1\x019223370750718307807\x015692970,1292964088406
 in 494ms, sequence id=1717292326, compaction requested=false
2010-12-27 08:35:32,427 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
Updates disabled for region, no outstanding scanners on 
usertrails_all,1b005e71-cd19-4d82-87a3-3fa269f4
f1d1\x019223370750718307807\x015692970,1292964088406
2010-12-27 08:35:32,427 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: No 
more row locks outstanding on region 
usertrails_all,1b005e71-cd19-4d82-87a3-3fa269f4f1d1\x0192233707
50718307807\x015692970,1292964088406
2010-12-27 08:35:32,427 DEBUG org.apache.hadoop.hbase.regionserver.Store: 
closed d
2010-12-27 08:35:32,427 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
Closed 
usertrails_all,1b005e71-cd19-4d82-87a3-3fa269f4f1d1\x019223370750718307807\x015692970,12929640884
06
2010-12-27 08:35:32,427 DEBUG org.apache.hadoop.hbase.regionserver.HLog: 
closing hlog writer in 
hdfs://ls-nn05.netw.domain01.net:50001/hbase/.logs/ls-hdfs24.netw.domain01.net,60020,12
92875777337
2010-12-27 08:35:32,497 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer 
Exception: org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredEx
ception: No lease on 
/hbase/.logs/ls-hdfs24.netw.domain01.net,60020,1292875777337/hlog.dat.1293458980959
 File does not exist. Holder DFSClient_52782511 does not have any open files.
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1328)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1319)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1247)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:434)
        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)        at 
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)        at 
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)        at 
org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
         at org.apache.hadoop.ipc.Client.call(Client.java:739)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
        at $Proxy1.addBlock(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at $Proxy1.addBlock(Unknown Source)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2906)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2788)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)

2010-12-27 08:35:32,497 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery 
for block null bad datanode[0] nodes == null2010-12-27 08:35:32,497 WARN 
org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file 
"/hbase/.logs/ls-hdfs24.netw.domain01.net,60020,1292875777337/hlog.dat.1293
458980959" - Aborting...
2010-12-27 08:35:32,499 ERROR 
org.apache.hadoop.hbase.regionserver.HRegionServer: Close and delete 
failedorg.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: 
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on 
/hbase/.logs/ls-hdfs24.netw.domain01.net,6002
0,1292875777337/hlog.dat.1293458980959 File does not exist. Holder 
DFSClient_52782511 does not have any open files.
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1328)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1319)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1247)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:434)
        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at 
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
        at 
org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:626)
        at java.lang.Thread.run(Thread.java:619)2010-12-27 08:35:32,500 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: telling master that region 
server is shutting down at: ls-hdfs24.netw.domain01.net,60020,12928757773
37
2010-12-27 08:35:32,508 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server at: 
ls-hdfs24.netw.domain01.net,60020,1292875777337
2010-12-27 08:35:32,509 INFO org.apache.zookeeper.ZooKeeper: Closing session: 
0x2d05625f9b0008
2010-12-27 08:35:32,509 INFO org.apache.zookeeper.ClientCnxn: Closing 
ClientCnxn for session: 0x2d05625f9b0008
2010-12-27 08:35:32,512 INFO org.apache.zookeeper.ClientCnxn: Disconnecting 
ClientCnxn for session: 0x2d05625f9b0008
2010-12-27 08:35:32,512 INFO org.apache.zookeeper.ZooKeeper: Session: 
0x2d05625f9b0008 closed
2010-12-27 08:35:32,512 DEBUG 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Closed connection with 
ZooKeeper
2010-12-27 08:35:32,512 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down
2010-12-27 08:35:32,614 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: 
regionserver/10.32.20.70:60020 exiting
2010-12-27 08:35:49,448 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 
-4947199215968218273 lease expired
2010-12-27 08:35:49,538 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 3282292946771297386 
lease expired
2010-12-27 08:35:49,582 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 7882165461457343907 
lease expired
2010-12-27 08:35:49,582 INFO org.apache.hadoop.hbase.Leases: 
regionserver/10.32.20.70:60020.leaseChecker closing leases
2010-12-27 08:35:49,582 INFO org.apache.hadoop.hbase.Leases: 
regionserver/10.32.20.70:60020.leaseChecker closed leases
2010-12-27 08:35:49,607 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown thread.
2010-12-27 08:35:49,607 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete




Namenode:


2010-12-27 08:35:32,427 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 
on 50001, call 
addBlock(/hbase/.logs/ls-hdfs24.netw.domain01.net,60020,1292875777337/hlog.dat.1293458980959,
 DFSClient_52782511) from 10.32.20.70:44967: error: 
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on 
/hbase/.logs/ls-hdfs24.netw.domain01.net,60020,1292875777337/hlog.dat.1293458980959
 File does not exist. Holder DFSClient_52782511 does not have any open files.
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on 
/hbase/.logs/ls-hdfs24.netw.domain01.net,60020,1292875777337/hlog.dat.1293458980959
 File does not exist. Holder DFSClient_52782511 does not have any open files.
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1328)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1319)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1247)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:434)
        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
        
        
        
Datanode:

Not too unusual. Some entries like the following:

2010-12-27 08:35:52,519 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(10.32.20.70:50010, 
storageID=DS-2137188581-173.192.229.228-50010-1273698157328, infoPort=50075, 
ipcPort=50020):DataXceiver
java.net.SocketException: Connection reset
        at java.net.SocketInputStream.read(SocketInputStream.java:168)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at java.io.DataInputStream.read(DataInputStream.java:132)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:261)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:308)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:372)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:524)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
        at java.lang.Thread.run(Thread.java:619)
2010-12-27 09:35:52,728 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
writeBlock blk_-4708329089501359494_30620223 received exception 
java.net.SocketException: Connection reset


Reply via email to