Re: The number of fd and CLOSE_WAIT keep increasing.

Harsh J Mon, 22 Aug 2011 22:18:29 -0700

Yes, since it is a minor version update, you should be all set with replacement 
of packages and restart of nodes with the same configuration.


No additional procedure should generally be required in updating between dot 
versions cause of compatibility being maintained :)

On 23-Aug-2011, at 10:26 AM, Xu-Feng Mao wrote:

> Thanks Andy!
> 
> cdh3u1 is based on hbase 0.90.3, which has some nice admin scripts, like
> graceful_stop.sh.
> Is it easy to upgrade hbase from cdh3u0 to cdh3u1? I guess we can simply
> replace the package
> with our own configuration, right?
> 
> Thanks and regards,
> 
> Mao Xu-Feng
> 
> On Tue, Aug 23, 2011 at 5:10 AM, Andrew Purtell <[email protected]> wrote:
> 
>>> We are running cdh3u0 hbase/hadoop suites on 28 nodes.
>> 
>> 
>> For your information, CDHU1 does contain this:
>> 
>>  Author: Eli Collins <[email protected]>
>>  Date:   Tue Jul 5 16:02:22 2011 -0700
>> 
>>      HDFS-1836. Thousand of CLOSE_WAIT socket.
>> 
>>      Reason: Bug
>>      Author: Bharath Mundlapudi
>>      Ref: CDH-3200
>> 
>> Best regards,
>> 
>> 
>>   - Andy
>> 
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)
>> 
>> 
>> ----- Original Message -----
>>> From: Xu-Feng Mao <[email protected]>
>>> To: [email protected]; [email protected]
>>> Cc:
>>> Sent: Monday, August 22, 2011 4:58 AM
>>> Subject: Re: The number of fd and CLOSE_WAIT keep increasing.
>>> 
>>> On average, we have about 3000 CLOSE_WAIT, while on the three problematic
>>> regionservers, we have about 30k CLOSE_WAIT.
>>> We set open files limit to 130k, so it work OK now, but it seems not that
>>> well.
>>> 
>>> On Mon, Aug 22, 2011 at 6:33 PM, Xu-Feng Mao <[email protected]> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> We are running cdh3u0 hbase/hadoop suites on 28 nodes. From last
>> Friday, we
>>>> got three regionservers have
>>>> opened fd and CLOSE_WAIT kept increasing.
>>>> 
>>>> It looks like if the lines like
>>>> 
>>>> ====
>>>> 2011-08-22 18:19:01,815 WARN
>>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
>>>> 
>>> 
>> STable,EStore_box_hwi1QZ4IiEVuJN6_AypqG8MUwRo=,1309931789925.3182d1f48a244bad2e5c97eea0cc9240.
>>>> has too many store files; delaying flush up to 90000ms
>>>> 2011-08-22 18:19:01,815 WARN
>>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region
>>>> 
>>> 
>> STable,EStore_box__dKxQS8qkWqX1XWYIPGIrw4SqSo=,1310033448349.6b480a865e39225016e0815dc336ecf2.
>>>> has too many store files; delaying flush up to 90000ms
>>>> ====
>>>> 
>>>> increase, then the the number of opened fds and CLOSE_WAIT increase
>>>> accordingly.
>>>> 
>>>> We're not sure if it's kind of fd leak under some unexpected
>>> circumstance
>>>> or exceptional path.
>>>> 
>>>> By netstat -lntp, we found that there're lots of connection like
>>>> 
>>>> ====
>>>> Proto Recv-Q Send-Q Local Address               Foreign Address
>>>> State       PID/Program name
>>>> tcp       65      0 10.150.161.64:23241         10.150.161.64:50010
>>>>   CLOSE_WAIT  27748/java
>>>> ====
>>>> 
>>>> The connections are keeping in these situation. It seems like some
>>>> connections to hdfs is in the situation
>>>> that the hdfs datanode has sent FIN, but regionservers are blocking on
>> the
>>>> recv queue, so the fd and CLOSE_WAIT sockets
>>>> are probably leaked.
>>>> 
>>>> We also see some logs like
>>>> ====
>>>> 2011-08-22 18:19:07,320 INFO org.apache.hadoop.hdfs.DFSClient: Failed
>> to
>>>> connect to /10.150.161.73:50010, add to deadNodes and continue
>>>> java.io.IOException: Got error in response to OP_READ_BLOCK self=/
>>>> 10.150.161.64:55229, remote=/10.150.161.73:50010 for file
>>>> 
>> /hbase/S3Table/d0d5004792ec47e02665d1f0947be6b6/file/8279698872781984241
>>> for
>>>> block 2791681537571770744_132142063
>>>>         at
>>>> 
>>> 
>> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487)
>>>>         at
>>>> 
>>> 
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811)
>>>>         at
>>>> 
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
>>>>         at java.io.DataInputStream.read(DataInputStream.java:132)
>>>>         at
>>>> 
>>> 
>> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105)
>>>>         at
>> java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>>>>         at
>> java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>>>>         at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
>>>>         at
>>>> 
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094)
>>>>         at
>>>> 
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036)
>>>>         at
>>>> 
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1276)
>>>>         at
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87)
>>>>         at
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82)
>>>>         at
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262)
>>>>         at
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326)
>>>>         at
>>>> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:927)
>>>>         at
>>>> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:733)
>>>>         at
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:769)
>>>>         at
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:714)
>>>>         at
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
>>>> ====
>>>> 
>>>> The number is much less than the number of " too many store
>>> files" WARNs,
>>>> so this might not the cause of too many
>>>> fds, but is this dangerous to the whole cluster?
>>>> 
>>>> Thanks and regards,
>>>> 
>>>> Mao Xu-Feng
>>>> 
>>>> 
>>> 
>>

Re: The number of fd and CLOSE_WAIT keep increasing.

Reply via email to