That is not 0.94 vs 1.0 issue - its BucketCache vs LRUCache issue. It seems that, BucketCache.freeBlock is expensive (and source of contention). It is the issue. Please open JIRA.
-Vlad On Thu, Oct 15, 2015 at 11:29 AM, Randy Fox <[email protected]> wrote: > Here is the region server log distilled down to events pertinent to the > region move. Not much being logged while the Store is being closed. > This region was about 1.4G compressed, which I consider small. > > 2015-10-15 08:34:28,510 INFO > org.apache.hadoop.hbase.regionserver.RSRpcServices: Close > dad6c71ed395df19220ef1056a110086, moving to hb17.prod1.connexity.net > ,60020,1444425924021 > 2015-10-15 08:34:28,510 DEBUG > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing > close of > Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086. > 2015-10-15 08:34:28,511 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: Closing > Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086.: > disabling compactions & flushes > 2015-10-15 08:34:28,511 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Running close preflush of > Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086. > 2015-10-15 08:34:28,511 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Started memstore flush for > Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086., > current region memstore size 44.89 MB, and 1/1 column families' memstores > are being flushed. > 2015-10-15 08:34:28,511 WARN > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Couldn't find oldest > seqNum for the region we are about to flush: > [dad6c71ed395df19220ef1056a110086] > 2015-10-15 08:34:29,137 INFO > org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher: Flushed, > sequenceid=70839856845, memsize=44.9 M, hasBloomFilter=false, into tmp file > hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/.tmp/3600b47839a945de9733cf17581458e0 > 2015-10-15 08:34:29,144 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionFileSystem: Committing store > file > hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/.tmp/3600b47839a945de9733cf17581458e0 > as > hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/L/3600b47839a945de9733cf17581458e0 > 2015-10-15 08:34:29,150 INFO org.apache.hadoop.hbase.regionserver.HStore: > Added > hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/L/3600b47839a945de9733cf17581458e0, > entries=190131, sequenceid=70839856845, filesize=3.2 M > 2015-10-15 08:34:29,151 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Finished memstore flush of ~44.89 MB/47066832, currentsize=22.39 KB/22928 > for region > Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086. > in 640ms, sequenceid=70839856845, compaction requested=false > 2015-10-15 08:34:29,152 DEBUG > org.apache.hadoop.hbase.regionserver.HRegion: Updates disabled for region > Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086. > 2015-10-15 08:34:29,152 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Started memstore flush for > Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086., > current region memstore size 22.39 KB, and 1/1 column families' memstores > are being flushed. > 2015-10-15 08:34:29,152 WARN > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Couldn't find oldest > seqNum for the region we are about to flush: > [dad6c71ed395df19220ef1056a110086] > 2015-10-15 08:34:29,225 INFO > org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher: Flushed, > sequenceid=70839856876, memsize=22.4 K, hasBloomFilter=false, into tmp file > hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/.tmp/a8a74549c7a44174b105cbed23cfaea1 > 2015-10-15 08:34:29,231 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionFileSystem: Committing store > file > hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/.tmp/a8a74549c7a44174b105cbed23cfaea1 > as > hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/L/a8a74549c7a44174b105cbed23cfaea1 > 2015-10-15 08:34:29,279 INFO org.apache.hadoop.hbase.regionserver.HStore: > Added > hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/L/a8a74549c7a44174b105cbed23cfaea1, > entries=128, sequenceid=70839856876, filesize=2.6 K > 2015-10-15 08:34:29,280 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Finished memstore flush of ~22.39 KB/22928, currentsize=0 B/0 for region > Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086. > in 128ms, sequenceid=70839856876, compaction requested=true > > > 2015-10-15 08:37:02,945 INFO org.apache.hadoop.hbase.regionserver.HStore: > Closed L > 2015-10-15 08:37:02,954 DEBUG org.apache.hadoop.hbase.wal.WALSplitter: > Wrote region > seqId=hdfs://woz/hbase/data/default/Wildfire_graph3/dad6c71ed395df19220ef1056a110086/recovered.edits/70839856879.seqid > to file, newSeqId=70839856879, maxSeqId=70838935131 > 2015-10-15 08:37:02,955 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Closed > Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086. > 2015-10-15 08:37:02,955 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Adding moved region > record: dad6c71ed395df19220ef1056a110086 to > hb17.prod1.connexity.net,60020,1444425924021 > as of 70838935131 > 2015-10-15 08:37:02,956 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > regionserver:60020-0x45049a3c1b80a8b, quorum=ccm2.prod1.connexity.net:2181 > ,ccm1.prod1.connexity.net:2181,db2.prod1.connexity.net:2181, > sql2.prod1.connexity.net:2181,ccm3.prod1.connexity.net:2181, > baseZNode=/hbase Transitioning dad6c71ed395df19220ef1056a110086 from > M_ZK_REGION_CLOSING to RS_ZK_REGION_CLOSED > 2015-10-15 08:37:02,967 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > regionserver:60020-0x45049a3c1b80a8b, quorum=ccm2.prod1.connexity.net:2181 > ,ccm1.prod1.connexity.net:2181,db2.prod1.connexity.net:2181, > sql2.prod1.connexity.net:2181,ccm3.prod1.connexity.net:2181, > baseZNode=/hbase Transitioned node dad6c71ed395df19220ef1056a110086 from > M_ZK_REGION_CLOSING to RS_ZK_REGION_CLOSED > 2015-10-15 08:37:02,967 DEBUG > org.apache.hadoop.hbase.coordination.ZkCloseRegionCoordination: Set closed > state in zk for > Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086. > on hb18.prod1.connexity.net,60020,1444427168636 > 2015-10-15 08:37:02,967 DEBUG > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Closed > Wildfire_graph3,\x00)"y\x1B\xF0\x13-,1434402364692.dad6c71ed395df19220ef1056a110086. > ~ > > > > > > > On 10/15/15, 11:05 AM, "Esteban Gutierrez" <[email protected]> wrote: > > >Hi Randy, > > > >As Ted suggested a couple of stack traces would be helpful but if you > could > >attach the HMaster and a couple of RS logs where you see this problem that > >would be great. Also, do you have a large regions that might need some > time > >to flush? have you looked into any potential problem in HDFS? > > > >thanks, > >esteban. > > > >-- > >Cloudera, Inc. > > > > > >On Thu, Oct 15, 2015 at 10:39 AM, Ted Yu <[email protected]> wrote: > > > >> Can you give a bit more detail on why block eviction was cause for the > >> slow region movement? > >> > >> Did you happen to take stack traces ? > >> > >> Thanks > >> > >> > On Oct 15, 2015, at 10:32 AM, Randy Fox <[email protected]> wrote: > >> > > >> > Hi, > >> > > >> > We just upgraded from 0.94 to 1.0.0 and have noticed that region moves > >> are super slow (order of minutes) whereas previously they where in the > >> seconds range. After looking at the code, I think the time is spent > >> waiting for the blocks to be evicted from block cache. > >> > > >> > I wanted to verify that this theory is correct and see if there is > >> anything that can be done to speed up the moves. > >> > > >> > This is particular painful as we are trying to get our configs tuned > to > >> the new SW and need to do rolling restarts which is taking almost 24 > hours > >> on our cluster. We also do our own manual rebalancing of regions across > >> RS’s and that task is also now painful. > >> > > >> > > >> > Thanks, > >> > > >> > Randy > >> >
