Re: The number of fd and CLOSE_WAIT keep increasing.

Andrew Purtell Mon, 22 Aug 2011 14:10:57 -0700

> We are running cdh3u0 hbase/hadoop suites on 28 nodes.


For your information, CDHU1 does contain this:

  Author: Eli Collins <[email protected]>
  Date:   Tue Jul 5 16:02:22 2011 -0700

      HDFS-1836. Thousand of CLOSE_WAIT socket.

      Reason: Bug
      Author: Bharath Mundlapudi
      Ref: CDH-3200

Best regards,


   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)


----- Original Message -----
> From: Xu-Feng Mao <[email protected]>
> To: [email protected]; [email protected]
> Cc: 
> Sent: Monday, August 22, 2011 4:58 AM
> Subject: Re: The number of fd and CLOSE_WAIT keep increasing.
> 
> On average, we have about 3000 CLOSE_WAIT, while on the three problematic
> regionservers, we have about 30k CLOSE_WAIT.
> We set open files limit to 130k, so it work OK now, but it seems not that
> well.
> 
> On Mon, Aug 22, 2011 at 6:33 PM, Xu-Feng Mao <[email protected]> wrote:
> 
>>  Hi,
>> 
>>  We are running cdh3u0 hbase/hadoop suites on 28 nodes. From last Friday, we
>>  got three regionservers have
>>  opened fd and CLOSE_WAIT kept increasing.
>> 
>>  It looks like if the lines like
>> 
>>  ====
>>  2011-08-22 18:19:01,815 WARN
>>  org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
>> 
> STable,EStore_box_hwi1QZ4IiEVuJN6_AypqG8MUwRo=,1309931789925.3182d1f48a244bad2e5c97eea0cc9240.
>>  has too many store files; delaying flush up to 90000ms
>>  2011-08-22 18:19:01,815 WARN
>>  org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
>> 
> STable,EStore_box__dKxQS8qkWqX1XWYIPGIrw4SqSo=,1310033448349.6b480a865e39225016e0815dc336ecf2.
>>  has too many store files; delaying flush up to 90000ms
>>  ====
>> 
>>  increase, then the the number of opened fds and CLOSE_WAIT increase
>>  accordingly.
>> 
>>  We're not sure if it's kind of fd leak under some unexpected 
> circumstance
>>  or exceptional path.
>> 
>>  By netstat -lntp, we found that there're lots of connection like
>> 
>>  ====
>>  Proto Recv-Q Send-Q Local Address               Foreign Address
>>  State       PID/Program name
>>  tcp       65      0 10.150.161.64:23241         10.150.161.64:50010
>>    CLOSE_WAIT  27748/java
>>  ====
>> 
>>  The connections are keeping in these situation. It seems like some
>>  connections to hdfs is in the situation
>>  that the hdfs datanode has sent FIN, but regionservers are blocking on the
>>  recv queue, so the fd and CLOSE_WAIT sockets
>>  are probably leaked.
>> 
>>  We also see some logs like
>>  ====
>>  2011-08-22 18:19:07,320 INFO org.apache.hadoop.hdfs.DFSClient: Failed to
>>  connect to /10.150.161.73:50010, add to deadNodes and continue
>>  java.io.IOException: Got error in response to OP_READ_BLOCK self=/
>>  10.150.161.64:55229, remote=/10.150.161.73:50010 for file
>>  /hbase/S3Table/d0d5004792ec47e02665d1f0947be6b6/file/8279698872781984241 
> for
>>  block 2791681537571770744_132142063
>>          at
>> 
> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487)
>>          at
>> 
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811)
>>          at
>>  org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
>>          at java.io.DataInputStream.read(DataInputStream.java:132)
>>          at
>> 
> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105)
>>          at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>>          at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>>          at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
>>          at
>>  org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094)
>>          at
>>  org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036)
>>          at
>>  org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1276)
>>          at
>> 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87)
>>          at
>> 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82)
>>          at
>> 
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262)
>>          at
>> 
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326)
>>          at
>>  org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:927)
>>          at
>>  org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:733)
>>          at
>> 
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:769)
>>          at
>> 
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:714)
>>          at
>> 
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
>>  ====
>> 
>>  The number is much less than the number of " too many store 
> files" WARNs,
>>  so this might not the cause of too many
>>  fds, but is this dangerous to the whole cluster?
>> 
>>  Thanks and regards,
>> 
>>  Mao Xu-Feng
>> 
>> 
>

Re: The number of fd and CLOSE_WAIT keep increasing.

Reply via email to