This sounds like it could be the dreaded 'xciever count' issue. Threads are your resources here. See:
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A5 Let me know if you see anything like that. On Tue, Nov 9, 2010 at 2:22 PM, Seraph Imalia <[email protected]> wrote: > Hi Ryan, > > Thanks for replying so soon. > > Whatever it was, it has stopped happening, so I am breathing normally > again and it is not so urgent anymore. I need to try figure out what > caused this though. I get the feeling it is server resource related - > almost like something using the HDD or CPU heavily. atop did not show > anything unusual, but the 1 regionserver/datanode was sluggish while I was > debugging the problem. It has stopped being sluggish and it seems too > much of a coincidence that it is sluggish at the same time hbase gave > those errors. Also, the mention of codec and compression in the logs > makes me thing it is related to CPU rather than HDD. Syslog and Kernel > logs also reveal nothing unusual. Any ideas on how to figure out what > happened? > > Logs in hadoop seem normal. Both datanodes are showing the following: > > 2010-11-10 00:06:48,510 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.2.97:50010, dest: /192.168.2.97:36783, bytes: 15480, op: > HDFS_READ, cliID: DFSClient_1620748290, srvID: > DS-1090448426-192.168.2.97-50010-1282311128239, blockid: > blk_3714134476848125077_129818 > 2010-11-10 00:06:48,621 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.2.97:50010, dest: /192.168.2.97:36784, bytes: 516, op: HDFS_READ, > cliID: DFSClient_1620748290, srvID: > DS-1090448426-192.168.2.97-50010-1282311128239, blockid: > blk_3714134476848125077_129818 > 2010-11-10 00:06:48,688 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.2.97:50010, dest: /192.168.2.97:36785, bytes: 516, op: HDFS_READ, > cliID: DFSClient_1620748290, srvID: > DS-1090448426-192.168.2.97-50010-1282311128239, blockid: > blk_3714134476848125077_129818 > 2010-11-10 00:06:48,791 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.2.97:50010, dest: /192.168.2.97:36786, bytes: 516, op: HDFS_READ, > cliID: DFSClient_1620748290, srvID: > DS-1090448426-192.168.2.97-50010-1282311128239, blockid: > blk_3714134476848125077_129818 > 2010-11-10 00:06:48,940 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.2.97:50010, dest: /192.168.2.97:36787, bytes: 516, op: HDFS_READ, > cliID: DFSClient_1620748290, srvID: > DS-1090448426-192.168.2.97-50010-1282311128239, blockid: > blk_3714134476848125077_129818 > 2010-11-10 00:06:49,039 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.2.97:50010, dest: /192.168.2.97:36788, bytes: 516, op: HDFS_READ, > cliID: DFSClient_1620748290, srvID: > DS-1090448426-192.168.2.97-50010-1282311128239, blockid: > blk_3714134476848125077_129818 > 2010-11-10 00:06:49,110 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.2.97:50010, dest: /192.168.2.97:36789, bytes: 516, op: HDFS_READ, > cliID: DFSClient_1620748290, srvID: > DS-1090448426-192.168.2.97-50010-1282311128239, blockid: > blk_3714134476848125077_129818 > 2010-11-10 00:06:49,204 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.2.97:50010, dest: /192.168.2.97:36790, bytes: 516, op: HDFS_READ, > cliID: DFSClient_1620748290, srvID: > DS-1090448426-192.168.2.97-50010-1282311128239, blockid: > blk_3714134476848125077_129818 > 2010-11-10 00:06:49,291 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.2.97:50010, dest: /192.168.2.97:36791, bytes: 516, op: HDFS_READ, > cliID: DFSClient_1620748290, srvID: > DS-1090448426-192.168.2.97-50010-1282311128239, blockid: > blk_3714134476848125077_129818 > 2010-11-10 00:06:49,375 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.2.97:50010, dest: /192.168.2.97:36792, bytes: 1548, op: > HDFS_READ, cliID: DFSClient_1620748290, srvID: > DS-1090448426-192.168.2.97-50010-1282311128239, blockid: > blk_3714134476848125077_129818 > 2010-11-10 00:06:49,449 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.2.97:50010, dest: /192.168.2.97:36793, bytes: 516, op: HDFS_READ, > cliID: DFSClient_1620748290, srvID: > DS-1090448426-192.168.2.97-50010-1282311128239, blockid: > blk_3714134476848125077_129818 > 2010-11-10 00:06:49,555 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: > /192.168.2.97:50010, dest: /192.168.2.97:36794, bytes: 516, op: > > > Namenode looks like this: > > 2010-11-10 00:03:17,947 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from > 192.168.2.90 > 2010-11-10 00:05:47,774 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: > ugi=root,root ip=/192.168.2.97 cmd=listStatus src=/hbase dst=null perm=null > 2010-11-10 00:05:47,775 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: > ugi=root,root ip=/192.168.2.97 cmd=listStatus src=/hbase/-ROOT- dst=null pe > rm=null > 2010-11-10 00:05:47,775 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: > ugi=root,root ip=/192.168.2.97 cmd=listStatus src=/hbase/.META. dst=null pe > rm=null > 2010-11-10 00:05:47,776 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: > ugi=root,root ip=/192.168.2.97 cmd=listStatus src=/hbase/ChannelUIDTable ds > t=null perm=null > 2010-11-10 00:05:47,777 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: > ugi=root,root ip=/192.168.2.97 cmd=listStatus src=/hbase/UrlIndex dst=null > perm=null > 2010-11-10 00:05:47,820 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: > ugi=root,root ip=/192.168.2.97 cmd=listStatus src=/hbase/UrlIndex-hostCount > dst=null perm=null > 2010-11-10 00:05:47,820 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: > ugi=root,root ip=/192.168.2.97 cmd=listStatus src=/hbase/UrlIndex-indexHost > dst=null perm=null > 2010-11-10 00:05:47,864 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: > ugi=root,root ip=/192.168.2.97 cmd=listStatus src=/hbase/UrlIndex-indexUrlU > ID dst=null perm=null > 2010-11-10 00:08:17,953 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from > 192.168.2.90 > 2010-11-10 00:10:43,052 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: > ugi=root,root ip=/192.168.2.97 cmd=listStatus src=/hbase dst=null perm=null > 2010-11-10 00:10:43,053 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: > ugi=root,root ip=/192.168.2.97 cmd=listStatus src=/hbase/-ROOT- dst=null pe > rm=null > 2010-11-10 00:10:43,054 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: > ugi=root,root ip=/192.168.2.97 cmd=listStatus src=/hbase/.META. dst=null pe > rm=null > 2010-11-10 00:10:43,054 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: > ugi=root,root ip=/192.168.2.97 cmd=listStatus src=/hbase/ChannelUIDTable ds > t=null perm=null > 2010-11-10 00:10:43,056 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: > ugi=root,root ip=/192.168.2.97 cmd=listStatus src=/hbase/UrlIndex dst=null > perm=null > 2010-11-10 00:10:43,100 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: > ugi=root,root ip=/192.168.2.97 cmd=listStatus src=/hbase/UrlIndex-hostCount > dst=null perm=null > 2010-11-10 00:10:43,101 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: > ugi=root,root ip=/192.168.2.97 cmd=listStatus src=/hbase/UrlIndex-indexHost > dst=null perm=null > 2010-11-10 00:10:43,143 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: > ugi=root,root ip=/192.168.2.97 cmd=listStatus src=/hbase/UrlIndex-indexUrlU > ID dst=null perm=null > 2010-11-10 00:13:17,960 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from > 192.168.2.90 > > > Regards, > Seraph > > > > On 2010/11/10 12:08 AM, "Ryan Rawson" <[email protected]> wrote: > >>Looks like you are running into HDFS issues, can you check the >>datanode logs for errors? >> >>-ryan >> >>On Tue, Nov 9, 2010 at 2:06 PM, Seraph Imalia <[email protected]> wrote: >>> Hi, >>> >>> Some more info: That same Region server just showed the following in the >>> logs too - hope this explains it? >>> >>> Regards, >>> Seraph >>> >>> 649681515:java.net.SocketTimeoutException: 60000 millis timeout while >>> waiting for channel to be ready for connect. ch : >>> java.nio.channels.SocketChannel[connection-pending >>> remote=/192.168.2.97:50010] >>> at >>> >>>org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.jav >>>a: >>> 213) >>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) >>> at >>> >>>org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchBlockByteRange(DFSCl >>>ie >>> nt.java:1848) >>> at >>> >>>org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1922) >>> at >>>org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:46) >>> at >>> >>>org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(Bounded >>>Ra >>> ngeFileInputStream.java:101) >>> at >>> >>>org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(Bounded >>>Ra >>> ngeFileInputStream.java:88) >>> at >>> >>>org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(Bounded >>>Ra >>> ngeFileInputStream.java:81) >>> at >>> >>>org.apache.hadoop.io.compress.BlockDecompressorStream.rawReadInt(BlockDec >>>om >>> pressorStream.java:121) >>> at >>> >>>org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDec >>>om >>> pressorStream.java:66) >>> at >>> >>>org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream. >>>ja >>> va:74) >>> at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) >>> at java.io.BufferedInputStream.read(BufferedInputStream.java:317) >>> at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:100) >>> at >>> >>>org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1018) >>> at >>>org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:966) >>> at >>> >>>org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:115 >>>9) >>> at >>> >>>org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScann >>>er >>> .java:58) >>> at >>> >>>org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java: >>>79 >>> ) >>> at >>> >>>org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java: >>>23 >>> 6) >>> at >>> >>>org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java: >>>10 >>> 6) >>> at >>> >>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.nextInternal(H >>>Re >>> gion.java:1915) >>> at >>> >>>org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.next(HRegion.j >>>av >>> a:1879) >>> at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2500) >>> at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:2486) >>> at >>> >>>org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java >>>:1 >>> 733) >>> at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) >>> at >>> >>>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorI >>>mp >>> l.java:25) >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:657) >>> at >>> >>>org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) >>> >>> 2010-11-10 00:03:57,903 DEBUG >>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: >>> Total=66.45012MB (69678000), Free=341.48737MB (358075472), >>>Max=407.9375MB >>> (427753472), Counts: Blocks=2147, Access=42032, Hit=39143, Miss=2889, >>> Evictions=0, Evicted=0, Ratios: Hit Ratio=93.12666654586792%, Miss >>> Ratio=6.8733349442481995%, Evicted/Run=NaN >>> 2010-11-10 00:04:57,903 DEBUG >>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: >>> Total=69.27812MB (72643376), Free=338.65936MB (355110096), >>>Max=407.9375MB >>> (427753472), Counts: Blocks=2192, Access=43926, Hit=40999, Miss=2927, >>> Evictions=0, Evicted=0, Ratios: Hit Ratio=93.33652257919312%, Miss >>> Ratio=6.663479655981064%, Evicted/Run=NaN >>> >>> >>> >>> On 2010/11/09 11:59 PM, "Seraph Imalia" <[email protected]> wrote: >>> >>>>Hi, >>>> >>>>One of our region servers keeps doing the following - it has only just >>>>started doing this since 40 minutes ago. Our clients are able to get >>>>data >>>>from hBase, but after a short while, threads lock up and they start >>>>waiting indefinitely for data to be returned. What is wrong? - What do >>>>we >>>>do? - I am desperate, please help as quick as you can. >>>> >>>>Regards, >>>>Seraph >>>> >>>>2010-11-09 23:49:59,102 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:49:59,159 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:49:59,224 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:49:59,226 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:50:00,269 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:50:00,730 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:50:01,157 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:50:06,916 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:50:06,917 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:50:06,917 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:50:06,918 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:50:09,106 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:50:09,106 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:50:18,271 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:50:20,924 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:50:23,151 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:50:33,792 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:50:33,793 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:50:44,161 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:50:52,489 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:50:57,903 DEBUG >>>>org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: >>>>Total=25.640144MB (26885640), Free=382.29736MB (400867832), >>>>Max=407.9375MB >>>>(427753472), Counts: Blocks=1493, Access=31181, Hit=28954, Miss=2227, >>>>Evictions=0, Evicted=0, Ratios: Hit Ratio=92.85783171653748%, Miss >>>>Ratio=7.142169773578644%, Evicted/Run=NaN >>>>2010-11-09 23:50:57,996 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:51:31,922 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:51:31,923 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:51:31,924 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:51:57,903 DEBUG >>>>org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: >>>>Total=28.028427MB (29389936), Free=379.90906MB (398363536), >>>>Max=407.9375MB >>>>(427753472), Counts: Blocks=1531, Access=31277, Hit=29008, Miss=2269, >>>>Evictions=0, Evicted=0, Ratios: Hit Ratio=92.74546504020691%, Miss >>>>Ratio=7.254531979560852%, Evicted/Run=NaN >>>>2010-11-09 23:52:57,903 DEBUG >>>>org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: >>>>Total=31.233871MB (32751088), Free=376.7036MB (395002384), >>>>Max=407.9375MB >>>>(427753472), Counts: Blocks=1582, Access=31483, Hit=29168, Miss=2315, >>>>Evictions=0, Evicted=0, Ratios: Hit Ratio=92.64682531356812%, Miss >>>>Ratio=7.353174686431885%, Evicted/Run=NaN >>>>2010-11-09 23:53:57,903 DEBUG >>>>org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: >>>>Total=34.532898MB (36210368), Free=373.4046MB (391543104), >>>>Max=407.9375MB >>>>(427753472), Counts: Blocks=1635, Access=31612, Hit=29246, Miss=2366, >>>>Evictions=0, Evicted=0, Ratios: Hit Ratio=92.5154983997345%, Miss >>>>Ratio=7.484499365091324%, Evicted/Run=NaN >>>>2010-11-09 23:54:21,831 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:54:57,903 DEBUG >>>>org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: >>>>Total=37.375MB (39190528), Free=370.5625MB (388562944), Max=407.9375MB >>>>(427753472), Counts: Blocks=1681, Access=31761, Hit=29344, Miss=2417, >>>>Evictions=0, Evicted=0, Ratios: Hit Ratio=92.39003658294678%, Miss >>>>Ratio=7.609961926937103%, Evicted/Run=NaN >>>>2010-11-09 23:55:45,289 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:55:45,289 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:55:48,079 INFO org.apache.hadoop.io.compress.CodecPool: >>>>Got >>>>brand-new decompressor >>>>2010-11-09 23:55:57,903 DEBUG >>>>org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: >>>>Total=40.266388MB (42222368), Free=367.6711MB (385531104), >>>>Max=407.9375MB >>>>(427753472), Counts: Blocks=1728, Access=33834, Hit=31364, Miss=2470, >>>>Evictions=0, Evicted=0, Ratios: Hit Ratio=92.69965291023254%, Miss >>>>Ratio=7.300348579883575%, Evicted/Run=NaN >>>> >>>> >>>> >>> >>> >>> >>> > > > >
