Re: The number of fd and CLOSE_WAIT keep increasing.

Xu-Feng Mao Mon, 22 Aug 2011 05:00:47 -0700

On average, we have about 3000 CLOSE_WAIT, while on the three problematic
regionservers, we have about 30k CLOSE_WAIT.
We set open files limit to 130k, so it work OK now, but it seems not that
well.


On Mon, Aug 22, 2011 at 6:33 PM, Xu-Feng Mao <[email protected]> wrote:

> Hi,
>
> We are running cdh3u0 hbase/hadoop suites on 28 nodes. From last Friday, we
> got three regionservers have
> opened fd and CLOSE_WAIT kept increasing.
>
> It looks like if the lines like
>
> ====
> 2011-08-22 18:19:01,815 WARN
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
> STable,EStore_box_hwi1QZ4IiEVuJN6_AypqG8MUwRo=,1309931789925.3182d1f48a244bad2e5c97eea0cc9240.
> has too many store files; delaying flush up to 90000ms
> 2011-08-22 18:19:01,815 WARN
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
> STable,EStore_box__dKxQS8qkWqX1XWYIPGIrw4SqSo=,1310033448349.6b480a865e39225016e0815dc336ecf2.
> has too many store files; delaying flush up to 90000ms
> ====
>
> increase, then the the number of opened fds and CLOSE_WAIT increase
> accordingly.
>
> We're not sure if it's kind of fd leak under some unexpected circumstance
> or exceptional path.
>
> By netstat -lntp, we found that there're lots of connection like
>
> ====
> Proto Recv-Q Send-Q Local Address               Foreign Address
> State       PID/Program name
> tcp       65      0 10.150.161.64:23241         10.150.161.64:50010
>   CLOSE_WAIT  27748/java
> ====
>
> The connections are keeping in these situation. It seems like some
> connections to hdfs is in the situation
> that the hdfs datanode has sent FIN, but regionservers are blocking on the
> recv queue, so the fd and CLOSE_WAIT sockets
> are probably leaked.
>
> We also see some logs like
> ====
> 2011-08-22 18:19:07,320 INFO org.apache.hadoop.hdfs.DFSClient: Failed to
> connect to /10.150.161.73:50010, add to deadNodes and continue
> java.io.IOException: Got error in response to OP_READ_BLOCK self=/
> 10.150.161.64:55229, remote=/10.150.161.73:50010 for file
> /hbase/S3Table/d0d5004792ec47e02665d1f0947be6b6/file/8279698872781984241 for
> block 2791681537571770744_132142063
>         at
> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811)
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
>         at java.io.DataInputStream.read(DataInputStream.java:132)
>         at
> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
>         at
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094)
>         at
> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036)
>         at
> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1276)
>         at
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87)
>         at
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82)
>         at
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262)
>         at
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326)
>         at
> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:927)
>         at
> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:733)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:769)
>         at
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:714)
>         at
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
> ====
>
> The number is much less than the number of " too many store files" WARNs,
> so this might not the cause of too many
> fds, but is this dangerous to the whole cluster?
>
> Thanks and regards,
>
> Mao Xu-Feng
>
>

Re: The number of fd and CLOSE_WAIT keep increasing.

Reply via email to