It can't reproduce, I need dig from logs and code. -----邮件原件----- 发件人: [email protected] [mailto:[email protected]] 代表 Jean-Daniel Cryans 发送时间: 2011年5月13日 1:08 收件人: [email protected] 主题: Re: Hmaster has some warn logs.
Seems the master had lots of problems talking to that node... can you repro? If you jstack, are all the handlers filled? J-D 2011/5/12 Gaojinchao <[email protected]>: > Thank you for reply. > > It seems master had some problem. But I am not sure what 's up. > I am not familiar with RPC and need keep digging. > > > // the master connetion was ok > 2011-05-04 04:47:06,267 INFO org.apache.hadoop.hbase.master.CatalogJanitor: > Scanned 378 catalog row(s) and gc'd 9 unreferenced parent region(s) > 2011-05-04 04:52:05,833 INFO org.apache.hadoop.hbase.master.LoadBalancer: > Skipping load balancing. servers=3 regions=371 average=123.666664 > mostloaded=124 leastloaded=124 > > //the connection had some problem. > 2011-05-04 04:53:05,858 WARN org.apache.hadoop.hbase.master.CatalogJanitor: > Failed scan of catalog table > java.net.SocketTimeoutException: Call to /10.85.129.177:60020 failed on > socket timeout exception: java.net.SocketTimeoutException: 60000 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.85.129.177:42736 > remote=/10.85.129.177:60020] > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) > at $Proxy6.getRegionInfo(Unknown Source) > > // It was also fault. > 2011-05-04 09:52:38,425 WARN org.apache.hadoop.hbase.master.CatalogJanitor: > Failed scan of catalog table > java.net.SocketTimeoutException: Call to /10.85.129.177:60020 failed on > socket timeout exception: java.net.SocketTimeoutException: 60000 millis > timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.85.129.177:50964 > remote=linux-88.site/10.85.129.177:60020] > at > org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) > at $Proxy6.getRegionInfo(Unknown Source) > > // region server had done nothing between 04:47 and 04:52. > 2011-05-04 04:43:08,888 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: > Roll > /hbase/.logs/linux-88.site,60020,1304408517081/linux-88.site%3A60020.1304451798487, > entries=1473, filesize=124253101. New hlog > /hbase/.logs/linux-88.site,60020,1304408517081/linux-88.site%3A60020.1304455388698 > 2011-05-04 04:47:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: > LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, > accesses=526799, hits=37164, hitRatio=7.05%%, cachingAccesses=37279, > cachingHits=37164, cachingHitsRatio=99.69%%, evictions=0, evicted=0, > evictedPerRun=NaN > 2011-05-04 04:52:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: > LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, > accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, > cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, > evictedPerRun=NaN > 2011-05-04 04:57:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: > LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, > accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, > cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, > evictedPerRun=NaN > 2011-05-04 05:02:03,372 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: > LRU Stats: total=2.56 MB, free=196.61 MB, max=199.18 MB, blocks=115, > accesses=526913, hits=37278, hitRatio=7.07%%, cachingAccesses=37393, > cachingHits=37278, cachingHitsRatio=99.69%%, evictions=0, evicted=0, > evictedPerRun=NaN > > -----邮件原件----- > 发件人: [email protected] [mailto:[email protected]] 代表 Jean-Daniel Cryans > 发送时间: 2011年5月10日 6:17 > 收件人: [email protected] > 主题: Re: Hmaster has some warn logs. > > Anything closer to 05:13:05 in the region server? That's an hour and a > half before. > > J-D > > On Thu, May 5, 2011 at 6:03 PM, Gaojinchao <[email protected]> wrote: >> Thanks for your reply. >> >> I got it. But master and Region server is same machine and the cluster is >> free. >> >> In my scenario: >> 1/ before 2011-05-04 03:42. bulk data was putting and some IPC was blocked. >> >> 2011-05-04 03:42:39,155 INFO org.apache.hadoop.hbase.regionserver.HRegion: >> Blocking updates for 'IPC Server handler 1 on 60020' on region >> test1,2010010100#8613880051110#0000,1304451639350.dfbdc1cf590bfffb50ace4c7d116c524.: >> memstore size 128.1m is >= than blocking 128.0m size >> 2011-05-04 03:42:39,194 INFO org.apache.hadoop.hbase.regionserver.HRegion: >> Blocking updates for 'IPC Server handler 6 on 60020' on region >> test1,2010010100#8613880051110#0000,1304451639350.dfbdc1cf590bfffb50ace4c7d116c524.: >> memstore size 128.4m is >= than blocking 128.0m size >> >> >> 2/ 2011-05-04 05:13 hbase client had exited and cluster was free. >> >> 2011-05-04 04:47:03,372 DEBUG >> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, >> free=196.61 MB, max=199.18 MB, blocks=115, accesses=526799, hits=37164, >> hitRatio=7.05%%, cachingAccesses=37279, cachingHits=37164, >> cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN >> 2011-05-04 04:52:03,372 DEBUG >> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, >> free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, >> hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, >> cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN >> 2011-05-04 04:57:03,372 DEBUG >> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, >> free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, >> hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, >> cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN >> 2011-05-04 05:02:03,372 DEBUG >> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, >> free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, >> hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, >> cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN >> 2011-05-04 05:07:03,372 DEBUG >> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, >> free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, >> hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, >> cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN >> 2011-05-04 05:12:03,372 DEBUG >> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, >> free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, >> hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, >> cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN >> 2011-05-04 05:17:03,372 DEBUG >> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, >> free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, >> hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, >> cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN >> 2011-05-04 05:22:03,372 DEBUG >> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=2.56 MB, >> free=196.61 MB, max=199.18 MB, blocks=115, accesses=526913, hits=37278, >> hitRatio=7.07%%, cachingAccesses=37393, cachingHits=37278, >> cachingHitsRatio=99.69%%, evictions=0, evicted=0, evictedPerRun=NaN >> >> 3/ master had some warning log, but region server did not any abnormal. >> >> -----邮件原件----- >> 发件人: [email protected] [mailto:[email protected]] 代表 Jean-Daniel Cryans >> 发送时间: 2011年5月6日 1:27 >> 收件人: [email protected] >> 主题: Re: Hmaster has some warn logs. >> >> The master was trying to talk to the region server at 10.85.129.177 >> but it took more than 60 seconds to get an answer. >> >> J-D >> >> On Thu, May 5, 2011 at 5:40 AM, Gaojinchao <[email protected]> wrote: >>> In our test cluster. There is a lot of logs for socket timeout. >>> I try to dig it, but nothing. >>> Who have a clue? Thanks. >>> >>> Hbase version : 0.90.2 >>> >>> Logs: >>> 2011-05-04 05:13:05,856 WARN org.apache.hadoop.hbase.master.CatalogJanitor: >>> Failed scan of catalog table >>> java.net.SocketTimeoutException: Call to /10.85.129.177:60020 failed on >>> socket timeout exception: java.net.SocketTimeoutException: 60000 millis >>> timeout while waiting for channel to be ready for read. ch : >>> java.nio.channels.SocketChannel[connected local=/10.85.129.177:54980 >>> remote=/10.85.129.177:60020] >>> at >>> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:802) >>> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775) >>> at >>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257) >>> at $Proxy6.getRegionInfo(Unknown Source) >>> at >>> org.apache.hadoop.hbase.catalog.CatalogTracker.verifyRegionLocation(CatalogTracker.java:424) >>> at >>> org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:272) >>> at >>> org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:331) >>> at >>> org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnectionDefault(CatalogTracker.java:364) >>> at >>> org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:255) >>> at >>> org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:237) >>> at >>> org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:116) >>> at >>> org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:85) >>> at org.apache.hadoop.hbase.Chore.run(Chore.java:66) >>> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while >>> waiting for channel to be ready for read. ch : >>> java.nio.channels.SocketChannel[connected local=/10.85.129.177:54980 >>> remote=/10.85.129.177:60020] >>> at >>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165) >>> at >>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) >>> at >>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) >>> at java.io.FilterInputStream.read(FilterInputStream.java:116) >>> at >>> org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBaseClient.java:299) >>> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) >>> at java.io.BufferedInputStream.read(BufferedInputStream.java:237) >>> at java.io.DataInputStream.readInt(DataInputStream.java:370) >>> at >>> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539) >>> at >>> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477) >>> >> >
