Thanks Andy! cdh3u1 is based on hbase 0.90.3, which has some nice admin scripts, like graceful_stop.sh. Is it easy to upgrade hbase from cdh3u0 to cdh3u1? I guess we can simply replace the package with our own configuration, right?
Thanks and regards, Mao Xu-Feng On Tue, Aug 23, 2011 at 5:10 AM, Andrew Purtell <[email protected]> wrote: > > We are running cdh3u0 hbase/hadoop suites on 28 nodes. > > > For your information, CDHU1 does contain this: > > Author: Eli Collins <[email protected]> > Date: Tue Jul 5 16:02:22 2011 -0700 > > HDFS-1836. Thousand of CLOSE_WAIT socket. > > Reason: Bug > Author: Bharath Mundlapudi > Ref: CDH-3200 > > Best regards, > > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) > > > ----- Original Message ----- > > From: Xu-Feng Mao <[email protected]> > > To: [email protected]; [email protected] > > Cc: > > Sent: Monday, August 22, 2011 4:58 AM > > Subject: Re: The number of fd and CLOSE_WAIT keep increasing. > > > > On average, we have about 3000 CLOSE_WAIT, while on the three problematic > > regionservers, we have about 30k CLOSE_WAIT. > > We set open files limit to 130k, so it work OK now, but it seems not that > > well. > > > > On Mon, Aug 22, 2011 at 6:33 PM, Xu-Feng Mao <[email protected]> wrote: > > > >> Hi, > >> > >> We are running cdh3u0 hbase/hadoop suites on 28 nodes. From last > Friday, we > >> got three regionservers have > >> opened fd and CLOSE_WAIT kept increasing. > >> > >> It looks like if the lines like > >> > >> ==== > >> 2011-08-22 18:19:01,815 WARN > >> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region > >> > > > STable,EStore_box_hwi1QZ4IiEVuJN6_AypqG8MUwRo=,1309931789925.3182d1f48a244bad2e5c97eea0cc9240. > >> has too many store files; delaying flush up to 90000ms > >> 2011-08-22 18:19:01,815 WARN > >> org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Region > >> > > > STable,EStore_box__dKxQS8qkWqX1XWYIPGIrw4SqSo=,1310033448349.6b480a865e39225016e0815dc336ecf2. > >> has too many store files; delaying flush up to 90000ms > >> ==== > >> > >> increase, then the the number of opened fds and CLOSE_WAIT increase > >> accordingly. > >> > >> We're not sure if it's kind of fd leak under some unexpected > > circumstance > >> or exceptional path. > >> > >> By netstat -lntp, we found that there're lots of connection like > >> > >> ==== > >> Proto Recv-Q Send-Q Local Address Foreign Address > >> State PID/Program name > >> tcp 65 0 10.150.161.64:23241 10.150.161.64:50010 > >> CLOSE_WAIT 27748/java > >> ==== > >> > >> The connections are keeping in these situation. It seems like some > >> connections to hdfs is in the situation > >> that the hdfs datanode has sent FIN, but regionservers are blocking on > the > >> recv queue, so the fd and CLOSE_WAIT sockets > >> are probably leaked. > >> > >> We also see some logs like > >> ==== > >> 2011-08-22 18:19:07,320 INFO org.apache.hadoop.hdfs.DFSClient: Failed > to > >> connect to /10.150.161.73:50010, add to deadNodes and continue > >> java.io.IOException: Got error in response to OP_READ_BLOCK self=/ > >> 10.150.161.64:55229, remote=/10.150.161.73:50010 for file > >> > /hbase/S3Table/d0d5004792ec47e02665d1f0947be6b6/file/8279698872781984241 > > for > >> block 2791681537571770744_132142063 > >> at > >> > > > org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487) > >> at > >> > > > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811) > >> at > >> > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948) > >> at java.io.DataInputStream.read(DataInputStream.java:132) > >> at > >> > > > org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105) > >> at > java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > >> at > java.io.BufferedInputStream.read(BufferedInputStream.java:317) > >> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102) > >> at > >> > org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094) > >> at > >> > org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036) > >> at > >> > org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1276) > >> at > >> > > > org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87) > >> at > >> > > > org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82) > >> at > >> > > > org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262) > >> at > >> > > > org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326) > >> at > >> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:927) > >> at > >> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:733) > >> at > >> > > > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:769) > >> at > >> > > > org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:714) > >> at > >> > > > org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81) > >> ==== > >> > >> The number is much less than the number of " too many store > > files" WARNs, > >> so this might not the cause of too many > >> fds, but is this dangerous to the whole cluster? > >> > >> Thanks and regards, > >> > >> Mao Xu-Feng > >> > >> > > >
