Hi Otis thanks for reply, servers are identical in terms of hardware, jvm. right now I cannot afford to restart my any machines, it's in the production environment :D. I will give a shot for some other clusters some time later.
On Wed, Sep 26, 2012 at 11:50 AM, Otis Gospodnetic < [email protected]> wrote: > Hi, > > And the servers are identical in terms of hardware, JVM, etc.? > What does your monitoring tool show is different on problematic > machines, other than the load? How's JVM heap and GC looking? Have a > look at SPM for HBase (see sig). You could install the agent/client > on the problematic servers + a couple non-problematic servers and > visually compare a bunch of different HBase/JVM/OS metrics. > > HTH > Otis > -- > Search Analytics - http://sematext.com/search-analytics/index.html > Performance Monitoring - http://sematext.com/spm/index.html > > > On Tue, Sep 25, 2012 at 11:45 PM, Yusup Ashrap <[email protected]> wrote: > > Hi all , I am new to hbase. I have a 30+ nodes sized cluster running up > in > > the production environment. > > problem is that several nodes of cluster suffer from occassional high > > loads( 30-60) .twice or three times a day and it happens every day. > > It's kinda emergeny situation for me now and I dont have too much time to > > dive in hbase to figure out what's wrong with my cluster. > > I hope someone help me point out what is or could be wrong with my > > cluster or any steps to how to find out the problem , thanks > > > > > > *hbase version*:0.90.2, r > > *region server metrics*: > > request=3256.1, readRequest=26313.0, readDataSize=2173227.0, > > failedReadRequest=0.0, writeRequest=6253.0, writeDataSize=4273037.0, > > failedWriteRequest=0.0, readResponseTime=29935.0, > writeResponseTime=9298.0, > > memStoreHitCount=0.0, memStoreMissCount=16413.0, rpcRequestCount=31877.0, > > rpcRequestTime=790.0, aliveHandlerNum=200.0, aliveReaderNum=10.0, > > handlerQueueSize=0.0, regions=338, stores=426, storefiles=354, > > storefileSize=108349, storefileIndexSize=1116, memstoreSize=1550, > > compactionQueueSize=0, flushQueueSize=0, usedHeap=10217, maxHeap=15872, > > blockCacheSize=5122994416, blockCacheFree=1534205200, > > blockCacheCount=71481, blockCacheHitCount=36328173832, > > metaBlockCacheHitCount=23066398881, dataBlockCacheHitCount=13261774951, > > blockCacheMissCount=2133761468, metaBlockCacheMissCount=219363, > > dataBlockCacheMissCount=2133542105, blockCacheEvictedCount=1634548925, > > blockCacheHitRatio=94, metaBlockCacheHitRatio=99, > > dataBlockCacheHitRatio=86, blockCacheHitCachingRatio=95, > > hdfsBlocksLocalityIndex=86 > > > > here is my region server( load avg 30+ )'s log: > > > > 2012-09-26 11:04:50,185 INFO org.apache.hadoop.hbase.regionserver.Store: > > Added > > > hdfs://hdfs_xxx:9516/hbase/table_a/232d346a258d160bc61618bafe91f047/info/865605993795928725, > > entries=79472, sequenceid=11417919090, memsize=21.7m, filesize=1.2m > > 2012-09-26 11:04:50,187 INFO > org.apache.hadoop.hbase.regionserver.HRegion: > > Finished memstore flush of ~21.7m for region > > > table_a,10396613269\x01NLKMLMHFGFOOO\x01NLKMLMHFGFLFM\x01AntispamPunisher\x01011\x01000000,1345016652050.232d346a258d160bc61618bafe91f047. > > in 752ms, sequenceid=11417919090, compaction requested=true > > 2012-09-26 11:04:50,187 DEBUG > > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction > > requested for > > > table_a,10396613269\x01NLKMLMHFGFOOO\x01NLKMLMHFGFLFM\x01AntispamPunisher\x01011\x01000000,1345016652050.232d346a258d160bc61618bafe91f047. > > because regionserver60020.cacheFlusher; priority=16, compaction queue > > size=19 > > 2012-09-26 11:05:01,431 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > > started; Attempting to free 634.93 MB of total=5.27 GB > > 2012-09-26 11:05:01,505 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > > completed; freed=634.96 MB, total=4.65 GB, single=1.43 GB, multi=3.79 GB, > > memory=0 KB > > 2012-09-26 11:05:01,544 INFO org.apache.hadoop.io.compress.CodecPool: Got > > brand-new decompressor > > 2012-09-26 11:05:21,256 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > > started; Attempting to free 634.92 MB of total=5.27 GB > > 2012-09-26 11:05:21,333 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > > completed; freed=634.94 MB, total=4.65 GB, single=1.43 GB, multi=3.79 GB, > > memory=0 KB > > 2012-09-26 11:05:40,451 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > > started; Attempting to free 634.93 MB of total=5.27 GB > > 2012-09-26 11:05:40,515 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > > completed; freed=634.99 MB, total=4.65 GB, single=1.42 GB, multi=3.8 GB, > > memory=0 KB > > 2012-09-26 11:05:59,978 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > > started; Attempting to free 634.93 MB of total=5.27 GB > > 2012-09-26 11:06:00,051 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > > completed; freed=634.99 MB, total=4.65 GB, single=1.42 GB, multi=3.8 GB, > > memory=0 KB > > 2012-09-26 11:06:07,836 INFO org.apache.hadoop.io.compress.CodecPool: Got > > brand-new decompressor > > 2012-09-26 11:06:18,443 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > > started; Attempting to free 634.91 MB of total=5.27 GB > > 2012-09-26 11:06:18,505 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > > completed; freed=634.93 MB, total=4.65 GB, single=1.43 GB, multi=3.79 GB, > > memory=0 KB > > 2012-09-26 11:06:38,674 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > > started; Attempting to free 634.89 MB of total=5.27 GB > > 2012-09-26 11:06:38,739 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction > > completed; freed=634.93 MB, total=4.65 GB, single=1.43 GB, multi=3.79 GB, > > memory=0 KB > > 2012-09-26 11:06:50,616 INFO > > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using > > syncFs -- HDFS-200 > > 2012-09-26 11:06:50,630 INFO > org.apache.hadoop.hbase.regionserver.wal.HLog: > > Roll > > > /hbase/.logs/my226021.cm4,60020,1333724247105/my226021.cm4%3A60020.1348628673307, > > entries=81412, filesize=63758937. New hlog > > > /hbase/.logs/my226021.cm4,60020,1333724247105/my226021.cm4%3A60020.1348628810604 > > 2012-09-26 11:06:50,659 INFO > org.apache.hadoop.hbase.regionserver.wal.HLog: > > Too many hlogs: logs=33, maxlogs=32; forcing flush of 45 regions(s): > > 016138b1500faade07bc92694f222dd0, 064d4407646b317e7cfd4c35efa0ecfd, > > 07a185ab3fd56c75ee51dadc75911d55, 0842a4e65aea6d65c1d3ce4425ab8ef0, > > 0cfafa661d1819b4e81a4729c89df010, 1c435cddad229f4bf8283f4642cb5f0d, > > 293f2e6c4925f65e370015784a1392a9, 363e160b42f4025a58f074260f1161ce, > > 37e28e04133dfd3a74a6acb7ee5a45f8, 38766d1c77e373fb476fa9854443531c, > > 3f4a868b7233ec6fe8be88e3c2ca73ba, 45cf01b2f8744b033087281308862859, > > 468933ece8913bd7e06c2a6cbd9dc5e6, 480cc150655848e55217b3c63c2ea749, > > 4c4c7beee4eeb8e3d4d1e689aa16ae39, 5045e2289406a2ebe6fda7d15cd64e4d, > > 519a99c62386004dc6fb22308c6b0ac0, 578a6c0d02b8cc0807e60b452c1e5add, > > 5c130a1e57b5dc85b40bc7dd10b4aa9b, 6259605302af1b5ef3accd2c34aeecd7, > > 685370c4349d78fd04efbe071183cc0e, 69080c7e1d14fefccc5e106e955e405d, > > 6d2d97fcef2886ce2936cbf611648de7, 855e3f6a7eaa27b6a6952a33c06f47c9, > > 875b26219f934896756b3f80fd7e7a75, 8fd673d4475a0294845a8681253432ff, > > 9993f31115e8e59450f833771b03a95e, a0cb8dd41fa75f46fc9917c53f7b3b92, > > a6b0b04c13f22dd20b5fd5c3a0db5270, a8d401355719a12d9f7d2881a4be9fc8, > > ad6625d0afe0199eb7ba05c07f0a87a3, b845d168b3ab52fe5ae9e2e4366098fc, > > b9f1139ea5a2e9b512a3e1a173212d70, bec996064a36cc306af18f771965679c, > > c744a27faf49953ea5aa17708e4aefdd, cab712596aeaa41fc6465696ed4ad74a, > > cf83ec10220cd516ba55a0b2a5c2f711, d00c2663772b4f453a4c946032de66df, > > d7e596ea520902f4ed35ce28449a0593, d9058055eab307828c4bb41fab664ab5, > > e43b26752173becfdffbbc75c70f849e, ecc4925b7791123cc452a39c542cbf9f, > > efbb1004f1b68c98350fbe38b343a308, f1288916c8b877ccf6cac252e1af0a18, > > fc5446f5a44ca115ba63a21dca420c64 > > 2012-09-26 11:06:50,659 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: > > Started memstore flush for > > > table_a,10904703985\x01NLKMHOJIOHOOO\x01NLKMHOJIOHHIM\x01PriceDispPunisher\x01031\x01000000,1345202033439.016138b1500faade07bc92694f222dd0., > > current region memstore size 24.9m > > 2012-09-26 11:06:50,659 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: > > Finished snapshotting, commencing flushing stores > > 2012-09-26 11:06:51,389 INFO > > org.apache.hadoop.hbase.regionserver.StoreFile: Bloom added to HFile > > > (hdfs://hdfs_xxx:9516/hbase/table_a/016138b1500faade07bc92694f222dd0/.tmp/4855940649973018735): > > 13.4k, 5703/11406 (50%) > > 2012-09-26 11:06:52,055 INFO org.apache.hadoop.hbase.regionserver.Store: > > Renaming flushed file at > > > hdfs://hdfs_xxx:9516/hbase/table_a/016138b1500faade07bc92694f222dd0/.tmp/4855940649973018735 > > to > > > hdfs://hdfs_xxx:9516/hbase/table_a/016138b1500faade07bc92694f222dd0/info/6325203002828183467 > > 2012-09-26 11:06:52,070 INFO > > org.apache.hadoop.hbase.regionserver.StoreFile$Reader: Loaded row bloom > > filter metadata for > > > hdfs://hdfs_xxx:9516/hbase/table_a/016138b1500faade07bc92694f222dd0/info/6325203002828183467 > > 2012-09-26 11:06:52,070 INFO org.apache.hadoop.hbase.regionserver.Store: > > Added > > > hdfs://hdfs_xxx:9516/hbase/table_a/016138b1500faade07bc92694f222dd0/info/6325203002828183467, > > entries=91248, sequenceid=11417991467, memsize=24.9m, filesize=1.4m > > 2012-09-26 11:06:52,071 INFO > org.apache.hadoop.hbase.regionserver.HRegion: > > Finished memstore flush of ~24.9m for region > > > table_a,10904703985\x01NLKMHOJIOHOOO\x01NLKMHOJIOHHIM\x01PriceDispPunisher\x01031\x01000000,1345202033439.016138b1500faade07bc92694f222dd0. > > in 1412ms, sequenceid=11417991467, compaction requested=true > > 2012-09-26 11:06:52,071 DEBUG > > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction > > requested for > > > table_a,10904703985\x01NLKMHOJIOHOOO\x01NLKMHOJIOHHIM\x01PriceDispPunisher\x01031\x01000000,1345202033439.016138b1500faade07bc92694f222dd0. > > because regionserver60020.cacheFlusher; priority=17, compaction queue > > size=20 > > 2012-09-26 11:06:52,071 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: > > Started memstore flush for > > > table_a,14095325123\x01NLKLKJFHIOOOO\x01NLKLKJFHIOHII\x01AntispamPunisher\x01001\x01000000,1345106676203.064d4407646b317e7cfd4c35efa0ecfd., > > current region memstore size 18.0m > > 2012-09-26 11:06:52,071 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: > > Finished snapshotting, commencing flushing stores > > 2012-09-26 11:06:52,599 INFO > > org.apache.hadoop.hbase.regionserver.StoreFile: Bloom added to HFile > > > (hdfs://hdfs_xxx:9516/hbase/table_a/064d4407646b317e7cfd4c35efa0ecfd/.tmp/3436108128903667477): > > 9.7k, 4124/8248 (50%) > > 2012-09-26 11:06:52,622 INFO org.apache.hadoop.hbase.regionserver.Store: > > Renaming flushed file at > > > hdfs://hdfs_xxx:9516/hbase/table_a/064d4407646b317e7cfd4c35efa0ecfd/.tmp/3436108128903667477 > > to > > > hdfs://hdfs_xxx:9516/hbase/table_a/064d4407646b317e7cfd4c35efa0ecfd/info/2027379172809216464 > > 2012-09-26 11:06:52,637 INFO > > org.apache.hadoop.hbase.regionserver.StoreFile$Reader: Loaded row bloom > > filter metadata for > > > hdfs://hdfs_xxx:9516/hbase/table_a/064d4407646b317e7cfd4c35efa0ecfd/info/2027379172809216464 > > 2012-09-26 11:06:52,637 INFO org.apache.hadoop.hbase.regionserver.Store: > > Added > > > hdfs://hdfs_xxx:9516/hbase/table_a/064d4407646b317e7cfd4c35efa0ecfd/info/2027379172809216464, > > entries=65984, sequenceid=11417992318, memsize=18.0m, filesize=1.0m > -- *Best Regards* *===================* *Yusup Ashrap* *cell:18611205204* *--------------------------------------* *do or don't, the is no try.* *--------------------------------------*
