On average, we have about 3000 CLOSE_WAIT, while on the three problematic regionservers, we have about 30k CLOSE_WAIT. We set open files limit to 130k, so it work OK now, but it seems not that well.
On Mon, Aug 22, 2011 at 6:33 PM, Xu-Feng Mao <[email protected]> wrote: > Hi, > > We are running cdh3u0 hbase/hadoop suites on 28 nodes. From last Friday, we > got three regionservers have > opened fd and CLOSE_WAIT kept increasing. > > It looks like if the lines like > > ==== > 2011-08-22 18:19:01,815 WARN > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region > STable,EStore_box_hwi1QZ4IiEVuJN6_AypqG8MUwRo=,1309931789925.3182d1f48a244bad2e5c97eea0cc9240. > has too many store files; delaying flush up to 90000ms > 2011-08-22 18:19:01,815 WARN > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region > STable,EStore_box__dKxQS8qkWqX1XWYIPGIrw4SqSo=,1310033448349.6b480a865e39225016e0815dc336ecf2. > has too many store files; delaying flush up to 90000ms > ==== > > increase, then the the number of opened fds and CLOSE_WAIT increase > accordingly. > > We're not sure if it's kind of fd leak under some unexpected circumstance > or exceptional path. > > By netstat -lntp, we found that there're lots of connection like > > ==== > Proto Recv-Q Send-Q Local Address Foreign Address > State PID/Program name > tcp 65 0 10.150.161.64:23241 10.150.161.64:50010 > CLOSE_WAIT 27748/java > ==== > > The connections are keeping in these situation. It seems like some > connections to hdfs is in the situation > that the hdfs datanode has sent FIN, but regionservers are blocking on the > recv queue, so the fd and CLOSE_WAIT sockets > are probably leaked. > > We also see some logs like > ==== > 2011-08-22 18:19:07,320 INFO org.apache.hadoop.hdfs.DFSClient: Failed to > connect to /10.150.161.73:50010, add to deadNodes and continue > java.io.IOException: Got error in response to OP_READ_BLOCK self=/ > 10.150.161.64:55229, remote=/10.150.161.73:50010 for file > /hbase/S3Table/d0d5004792ec47e02665d1f0947be6b6/file/8279698872781984241 for > block 2791681537571770744_132142063 > at > org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811) > at > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948) > at java.io.DataInputStream.read(DataInputStream.java:132) > at > org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > at java.io.BufferedInputStream.read(BufferedInputStream.java:317) > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102) > at > org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094) > at > org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036) > at > org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1276) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87) > at > org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326) > at > org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:927) > at > org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:733) > at > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:769) > at > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:714) > at > org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81) > ==== > > The number is much less than the number of " too many store files" WARNs, > so this might not the cause of too many > fds, but is this dangerous to the whole cluster? > > Thanks and regards, > > Mao Xu-Feng > >
