> We are running cdh3u0 hbase/hadoop suites on 28 nodes.
For your information, CDHU1 does contain this: Author: Eli Collins <[email protected]> Date: Tue Jul 5 16:02:22 2011 -0700 HDFS-1836. Thousand of CLOSE_WAIT socket. Reason: Bug Author: Bharath Mundlapudi Ref: CDH-3200 Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) ----- Original Message ----- > From: Xu-Feng Mao <[email protected]> > To: [email protected]; [email protected] > Cc: > Sent: Monday, August 22, 2011 4:58 AM > Subject: Re: The number of fd and CLOSE_WAIT keep increasing. > > On average, we have about 3000 CLOSE_WAIT, while on the three problematic > regionservers, we have about 30k CLOSE_WAIT. > We set open files limit to 130k, so it work OK now, but it seems not that > well. > > On Mon, Aug 22, 2011 at 6:33 PM, Xu-Feng Mao <[email protected]> wrote: > >> Hi, >> >> We are running cdh3u0 hbase/hadoop suites on 28 nodes. From last Friday, we >> got three regionservers have >> opened fd and CLOSE_WAIT kept increasing. >> >> It looks like if the lines like >> >> ==== >> 2011-08-22 18:19:01,815 WARN >> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region >> > STable,EStore_box_hwi1QZ4IiEVuJN6_AypqG8MUwRo=,1309931789925.3182d1f48a244bad2e5c97eea0cc9240. >> has too many store files; delaying flush up to 90000ms >> 2011-08-22 18:19:01,815 WARN >> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region >> > STable,EStore_box__dKxQS8qkWqX1XWYIPGIrw4SqSo=,1310033448349.6b480a865e39225016e0815dc336ecf2. >> has too many store files; delaying flush up to 90000ms >> ==== >> >> increase, then the the number of opened fds and CLOSE_WAIT increase >> accordingly. >> >> We're not sure if it's kind of fd leak under some unexpected > circumstance >> or exceptional path. >> >> By netstat -lntp, we found that there're lots of connection like >> >> ==== >> Proto Recv-Q Send-Q Local Address Foreign Address >> State PID/Program name >> tcp 65 0 10.150.161.64:23241 10.150.161.64:50010 >> CLOSE_WAIT 27748/java >> ==== >> >> The connections are keeping in these situation. It seems like some >> connections to hdfs is in the situation >> that the hdfs datanode has sent FIN, but regionservers are blocking on the >> recv queue, so the fd and CLOSE_WAIT sockets >> are probably leaked. >> >> We also see some logs like >> ==== >> 2011-08-22 18:19:07,320 INFO org.apache.hadoop.hdfs.DFSClient: Failed to >> connect to /10.150.161.73:50010, add to deadNodes and continue >> java.io.IOException: Got error in response to OP_READ_BLOCK self=/ >> 10.150.161.64:55229, remote=/10.150.161.73:50010 for file >> /hbase/S3Table/d0d5004792ec47e02665d1f0947be6b6/file/8279698872781984241 > for >> block 2791681537571770744_132142063 >> at >> > org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487) >> at >> > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811) >> at >> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948) >> at java.io.DataInputStream.read(DataInputStream.java:132) >> at >> > org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105) >> at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) >> at java.io.BufferedInputStream.read(BufferedInputStream.java:317) >> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102) >> at >> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094) >> at >> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036) >> at >> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1276) >> at >> > org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87) >> at >> > org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82) >> at >> > org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262) >> at >> > org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326) >> at >> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:927) >> at >> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:733) >> at >> > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:769) >> at >> > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:714) >> at >> > org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81) >> ==== >> >> The number is much less than the number of " too many store > files" WARNs, >> so this might not the cause of too many >> fds, but is this dangerous to the whole cluster? >> >> Thanks and regards, >> >> Mao Xu-Feng >> >> >
