After major compacting the references were freed for the above mentioned regions and then the merge_region command succeeded and they got merged. Hmmm.
Regards, Shahab On Fri, Nov 14, 2014 at 2:08 PM, Shahab Yunus <[email protected]> wrote: > Digging deeper into the code, I came across this (this is from > CatalogJanitor#cleanMergeRegion): > > > ... > > ... > > HFileArchiver.archiveRegion > <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase-server/0.96.0-hadoop2/org/apache/hadoop/hbase/backup/HFileArchiver.java#HFileArchiver.archiveRegion%28org.apache.hadoop.conf.Configuration%2Corg.apache.hadoop.fs.FileSystem%2Corg.apache.hadoop.hbase.HRegionInfo%29>(this.services > > <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase-server/0.96.0-hadoop2/org/apache/hadoop/hbase/master/CatalogJanitor.java#CatalogJanitor.0services>.getConfiguration > > <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase-client/0.96.0-hadoop2/org/apache/hadoop/hbase/Server.java#Server.getConfiguration%28%29>(), > fs, regionA); > > HFileArchiver.archiveRegion > <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase-server/0.96.0-hadoop2/org/apache/hadoop/hbase/backup/HFileArchiver.java#HFileArchiver.archiveRegion%28org.apache.hadoop.conf.Configuration%2Corg.apache.hadoop.fs.FileSystem%2Corg.apache.hadoop.hbase.HRegionInfo%29>(this.services > > <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase-server/0.96.0-hadoop2/org/apache/hadoop/hbase/master/CatalogJanitor.java#CatalogJanitor.0services>.getConfiguration > > <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase-client/0.96.0-hadoop2/org/apache/hadoop/hbase/Server.java#Server.getConfiguration%28%29>(), > fs, regionB); > > MetaEditor.deleteMergeQualifiers > <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase-server/0.96.0-hadoop2/org/apache/hadoop/hbase/catalog/MetaEditor.java#MetaEditor.deleteMergeQualifiers%28org.apache.hadoop.hbase.catalog.CatalogTracker%2Corg.apache.hadoop.hbase.HRegionInfo%29>(server > > <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase-server/0.96.0-hadoop2/org/apache/hadoop/hbase/master/CatalogJanitor.java#CatalogJanitor.0server>.getCatalogTracker > > <http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase-client/0.96.0-hadoop2/org/apache/hadoop/hbase/Server.java#Server.getCatalogTracker%28%29>(), > mergedRegion); > > return true; > > > Do you think it is ok if we face this issue then we forcibly archive and > clean the regions ? > > Regards, > Shahab > > On Fri, Nov 14, 2014 at 1:10 PM, Shahab Yunus <[email protected]> > wrote: > >> Yesterday, I believe. >> >> Regards, >> Shahab >> >> On Fri, Nov 14, 2014 at 1:07 PM, Ted Yu <[email protected]> wrote: >> >>> Shahab: >>> When was the last time compaction was run on this table ? >>> >>> Cheers >>> >>> On Fri, Nov 14, 2014 at 9:58 AM, Shahab Yunus <[email protected]> >>> wrote: >>> >>> > I see. Thanks. >>> > >>> > And if the region indeed has references, then can we somehow forcibly >>> > remove them? Is this even possible (if not advisable)? Basically what >>> I am >>> > trying to ask is that let us say we do hit this scenario and we know >>> it is >>> > OK to go ahead and merge. What steps can we follow after detection of >>> such >>> > unwanted references. >>> > >>> > Regards, >>> > Shahab >>> > >>> > On Fri, Nov 14, 2014 at 12:50 PM, Ted Yu <[email protected]> wrote: >>> > >>> > > For automated detection of such scenario, you can reference the code >>> in >>> > > CatalogJanitor#cleanMergeRegion(): >>> > > >>> > > regionFs = HRegionFileSystem.openRegionFromFileSystem( >>> > > >>> > > this.services.getConfiguration(), fs, tabledir, >>> mergedRegion, >>> > > true >>> > > ); >>> > > >>> > > ... >>> > > >>> > > Then regionFs.hasReferences(htd) would tell you whether the >>> underlying >>> > > region has reference files. >>> > > Cheers >>> > > >>> > > On Fri, Nov 14, 2014 at 9:39 AM, Shahab Yunus < >>> [email protected]> >>> > > wrote: >>> > > >>> > > > No. Not that I can recall but I can check. >>> > > > >>> > > > From resolution perspective, is there any way we can resolve this. >>> More >>> > > > importantly, anyway we can automate the resolution, if we run into >>> such >>> > > > issues in future? 'Cleaning the qualifier', that is. >>> > > > >>> > > > Regards, >>> > > > Shahab >>> > > > >>> > > > On Fri, Nov 14, 2014 at 12:12 PM, Ted Yu <[email protected]> >>> wrote: >>> > > > >>> > > > > One possibility was that region 7373f75181c71eb5061a6673cee15931 >>> was >>> > > > > involved in some hbase snapshot. >>> > > > > >>> > > > > Was the underlying table being snapshotted in recent past ? >>> > > > > >>> > > > > Cheers >>> > > > > >>> > > > > On Fri, Nov 14, 2014 at 9:05 AM, Shahab Yunus < >>> > [email protected]> >>> > > > > wrote: >>> > > > > >>> > > > > > Thanks again. >>> > > > > > >>> > > > > > But I have been polling for a while and it still doesn't >>> merge. I >>> > > mean >>> > > > > this >>> > > > > > particular region example that I sent you, I am trying to >>> merge it >>> > > > since >>> > > > > > yesterday. I ran the polling-base code all night and I have to >>> kill >>> > > it. >>> > > > > > Then in the morning, I tried manual merging through hbase >>> shell and >>> > > it >>> > > > > > still doesn't merge. Note that the current polling logic >>> doesnot >>> > try >>> > > to >>> > > > > > call merge again. It just checks the region size. >>> > > > > > >>> > > > > > So how to clean it then? Or actually make it merge? Plus is >>> this >>> > > > > something >>> > > > > > expected (a region keeping a reference)? How can we avoid it? >>> > > > > > >>> > > > > > Note that this is not limited to this table only. We are seeing >>> > this >>> > > in >>> > > > > > other regions of other tables as well. Are we merging too fast? >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > > Regards, >>> > > > > > Shahab >>> > > > > > >>> > > > > > On Fri, Nov 14, 2014 at 11:58 AM, Ted Yu <[email protected]> >>> > > wrote: >>> > > > > > >>> > > > > > > Polling as you described is fine. >>> > > > > > > >>> > > > > > > catalogJanitor.cleanMergeQualifier() is called by >>> > > > > > > DispatchMergingRegionHandler. >>> > > > > > > >>> > > > > > > If clean was successful, you would see the following: >>> > > > > > > >>> > > > > > > LOG.debug("Deleting region " + >>> > > regionA.getRegionNameAsString() >>> > > > + >>> > > > > " >>> > > > > > > and " >>> > > > > > > >>> > > > > > > + regionB.getRegionNameAsString() >>> > > > > > > >>> > > > > > > + " from fs because merged region no longer holds >>> > > > > references"); >>> > > > > > > >>> > > > > > > Assuming there was no log below in your master log: >>> > > > > > > >>> > > > > > > LOG.error("Merged region " + >>> region.getRegionNameAsString() >>> > > > > > > >>> > > > > > > + " has only one merge qualifier in META."); >>> > > > > > > >>> > > > > > > It would be the case that 7373f75181c71eb5061a6673cee15931 >>> still >>> > > had >>> > > > > > > reference file. >>> > > > > > > >>> > > > > > > Cheers >>> > > > > > > >>> > > > > > > On Fri, Nov 14, 2014 at 8:35 AM, Shahab Yunus < >>> > > > [email protected]> >>> > > > > > > wrote: >>> > > > > > > >>> > > > > > > > Hi Ted. >>> > > > > > > > >>> > > > > > > > The log bit is below at the end of the email. This is the >>> > command >>> > > > to >>> > > > > > > merge >>> > > > > > > > that I gave just now through hbase shell. forcible was >>> false >>> > but >>> > > it >>> > > > > > > behaves >>> > > > > > > > similarly if forcible is true too. This is from master log. >>> > > Indeed >>> > > > > the >>> > > > > > > > region merging was skipped! What does this mean? Data >>> seems to >>> > be >>> > > > > > intact >>> > > > > > > > for this table. >>> > > > > > > > >>> > > > > > > > Just to give you a background. This table was first merge >>> by >>> > the >>> > > > auto >>> > > > > > > mated >>> > > > > > > > java application. What we are doing is that we are merging >>> > tables >>> > > > > > > > programmatically. As the HBaseAdmin.mergeRegions calls i >>> async, >>> > > we >>> > > > > poll >>> > > > > > > for >>> > > > > > > > the number of regions getting lowered after this merge >>> call. >>> > The >>> > > > > > > > application hangs and continues polling for ever as the >>> > previous >>> > > > > merge >>> > > > > > > > didn't happen. >>> > > > > > > > >>> > > > > > > > In this poll loop, we do get the number of regions by a >>> fresh >>> > > call >>> > > > to >>> > > > > > > > HBaseAdmin.getTableRegions(tableName).getSize(). >>> > > > > > > > >>> > > > > > > > What are these merge qualifiers and what are we doing >>> wrong or >>> > > > should >>> > > > > > do? >>> > > > > > > > >>> > > > > > > > In the polling loop we can somehow retry merge again? But >>> how >>> > can >>> > > > we >>> > > > > > > know, >>> > > > > > > > that we need to call merge again as it works for some >>> regions. >>> > Is >>> > > > the >>> > > > > > > table >>> > > > > > > > meta corrupted for some reason by the above logic? >>> > > > > > > > >>> > > > > > > > Thanks a lot. >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > >>> > > > >>> > >>> ------------------------------------------------------------------------ >>> > > > > > > > >>> > > > > > > > 2014-11-14 11:25:02,643 INFO >>> org.apache.zookeeper.ZooKeeper: >>> > > > Session: >>> > > > > > > > 0x348c7017707236b closed >>> > > > > > > > 2014-11-14 11:25:02,643 INFO >>> org.apache.zookeeper.ClientCnxn: >>> > > > > > EventThread >>> > > > > > > > shut down >>> > > > > > > > 2014-11-14 11:25:02,645 INFO >>> org.apache.zookeeper.ZooKeeper: >>> > > > > Initiating >>> > > > > > > > client connection, >>> > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> connectString=ip-1010019.ec2.internal:2181,ip-1010017.ec2.internal:2181,ip-1010018.ec2.internal:2181 >>> > > > > > > > sessionTimeout=60000 >>> > > > > watcher=catalogtracker-on-hconnection-0x47d865f2, >>> > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> quorum=ip-1010019.ec2.internal:2181,ip-1010017.ec2.internal:2181,ip-1010018.ec2.internal:2181, >>> > > > > > > > baseZNode=/hbase >>> > > > > > > > 2014-11-14 11:25:02,645 INFO >>> > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: >>> Process >>> > > > > > > > identifier=catalogtracker-on-hconnection-0x47d865f2 >>> connecting >>> > to >>> > > > > > > ZooKeeper >>> > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> ensemble=ip-1010019.ec2.internal:2181,ip-1010017.ec2.internal:2181,ip-1010018.ec2.internal:2181 >>> > > > > > > > 2014-11-14 11:25:02,645 INFO >>> org.apache.zookeeper.ClientCnxn: >>> > > > Opening >>> > > > > > > > socket connection to server >>> > ip-1010018.ec2.internal/1010019:2181. >>> > > > > Will >>> > > > > > > not >>> > > > > > > > attempt to authenticate using SASL (unknown error) >>> > > > > > > > 2014-11-14 11:25:02,646 INFO >>> org.apache.zookeeper.ClientCnxn: >>> > > > Socket >>> > > > > > > > connection established to >>> ip-1010018.ec2.internal/1010019:2181, >>> > > > > > > initiating >>> > > > > > > > session >>> > > > > > > > 2014-11-14 11:25:02,648 INFO >>> org.apache.zookeeper.ClientCnxn: >>> > > > Session >>> > > > > > > > establishment complete on server >>> > > > > ip-1010018.ec2.internal/1010019:2181, >>> > > > > > > > sessionid = 0x348c7017707236c, negotiated timeout = 60000 >>> > > > > > > > 2014-11-14 11:25:02,703 INFO >>> org.apache.zookeeper.ZooKeeper: >>> > > > Session: >>> > > > > > > > 0x348c7017707236c closed >>> > > > > > > > 2014-11-14 11:25:02,703 INFO >>> org.apache.zookeeper.ClientCnxn: >>> > > > > > EventThread >>> > > > > > > > shut down >>> > > > > > > > 2014-11-14 11:25:30,713 INFO >>> > > > > > > > >>> > > > >>> org.apache.hadoop.hbase.master.handler.DispatchMergingRegionHandler: >>> > > > > > Skip >>> > > > > > > > merging regions >>> > > > > > > > >>> TABLE_NAME,,1415915112497.7373f75181c71eb5061a6673cee15931., >>> > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> TABLE_NAME,\x02\xFA\xF0\x80\x00\x00\x01I\xAA\xD5\x87\xA8\x19\x99\x99\x99\x99\x99\x99\x90,1415910559217.43f4d3685d113d3ae18eea9f189de096., >>> > > > > > > > because region 7373f75181c71eb5061a6673cee15931 has merge >>> > > qualifier >>> > > > > > > > 2014-11-14 11:25:41,383 INFO >>> org.apache.zookeeper.ZooKeeper: >>> > > > > Initiating >>> > > > > > > > client connection, >>> > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> connectString=ip-1010019.ec2.internal:2181,ip-1010017.ec2.internal:2181,ip-1010018.ec2.internal:2181 >>> > > > > > > > sessionTimeout=60000 >>> > > > > watcher=catalogtracker-on-hconnection-0x47d865f2, >>> > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> quorum=ip-1010019.ec2.internal:2181,ip-1010017.ec2.internal:2181,ip-1010018.ec2.internal:2181, >>> > > > > > > > baseZNode=/hbase >>> > > > > > > > 2014-11-14 11:25:41,384 INFO >>> > > > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: >>> Process >>> > > > > > > > identifier=catalogtracker-on-hconnection-0x47d865f2 >>> connecting >>> > to >>> > > > > > > ZooKeeper >>> > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> ensemble=ip-1010019.ec2.internal:2181,ip-1010017.ec2.internal:2181,ip-1010018.ec2.internal:2181 >>> > > > > > > > 2014-11-14 11:25:41,384 INFO >>> org.apache.zookeeper.ClientCnxn: >>> > > > Opening >>> > > > > > > > socket connection to server >>> > ip-1010018.ec2.internal/1010019:2181. >>> > > > > Will >>> > > > > > > not >>> > > > > > > > attempt to authenticate using SASL (unknown error) >>> > > > > > > > 2014-11-14 11:25:41,386 INFO >>> org.apache.zookeeper.ClientCnxn: >>> > > > Socket >>> > > > > > > > connection established to >>> ip-1010018.ec2.internal/1010019:2181, >>> > > > > > > initiating >>> > > > > > > > session >>> > > > > > > > 2014-11-14 11:25:41,389 INFO >>> org.apache.zookeeper.ClientCnxn: >>> > > > Session >>> > > > > > > > establishment complete on server >>> > > > > ip-1010018.ec2.internal/1010019:2181, >>> > > > > > > > sessionid = 0x348c7017707236e, negotiated timeout = 60000 >>> > > > > > > > 2014-11-14 11:25:41,397 INFO >>> org.apache.zookeeper.ZooKeeper: >>> > > > Session: >>> > > > > > > > 0x348c7017707236e closed >>> > > > > > > > 2014-11-14 11:25:41,398 INFO >>> org.apache.zookeeper.ClientCnxn: >>> > > > > > EventThread >>> > > > > > > > shut down >>> > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> ------------------------------------------------------------------------------------------------------------------------------------ >>> > > > > > > > >>> > > > > > > > Regards, >>> > > > > > > > Shahab >>> > > > > > > > >>> > > > > > > > On Fri, Nov 14, 2014 at 10:56 AM, Ted Yu < >>> [email protected]> >>> > > > > wrote: >>> > > > > > > > >>> > > > > > > > > Looking at DispatchMergingRegionHandler, it does some >>> check >>> > > > before >>> > > > > > > > > initiating the merge. >>> > > > > > > > > e.g.: >>> > > > > > > > > >>> > > > > > > > > LOG.info("Skip merging regions " + >>> > > > > > > region_a.getRegionNameAsString() >>> > > > > > > > > >>> > > > > > > > > + ", " + region_b.getRegionNameAsString() + ", >>> > > because >>> > > > > > > region " >>> > > > > > > > > >>> > > > > > > > > + (regionAHasMergeQualifier ? >>> > > > region_a.getEncodedName() : >>> > > > > > > > > region_b >>> > > > > > > > > >>> > > > > > > > > .getEncodedName()) + " has merge >>> qualifier"); >>> > > > > > > > > >>> > > > > > > > > Can you take a look at master log around the time merge >>> > request >>> > > > was >>> > > > > > > > issued >>> > > > > > > > > to see if you can get some clue ? >>> > > > > > > > > >>> > > > > > > > > Cheers >>> > > > > > > > > >>> > > > > > > > > On Fri, Nov 14, 2014 at 6:41 AM, Shahab Yunus < >>> > > > > > [email protected]> >>> > > > > > > > > wrote: >>> > > > > > > > > >>> > > > > > > > > > The documentation of online merge tool (merge_region) >>> > states >>> > > > that >>> > > > > > if >>> > > > > > > we >>> > > > > > > > > > forcibly merge regions (by setting the 3rd attribute as >>> > true) >>> > > > > then >>> > > > > > it >>> > > > > > > > can >>> > > > > > > > > > create overlapping regions. if this happens then will >>> this >>> > > > render >>> > > > > > the >>> > > > > > > > > > region or table unusable or it is just a performance >>> hit? I >>> > > > mean >>> > > > > > how >>> > > > > > > > > bigger >>> > > > > > > > > > of a deal it is? >>> > > > > > > > > > >>> > > > > > > > > > Actually, we are merging regions using the >>> programmatic API >>> > > for >>> > > > > > this >>> > > > > > > > and >>> > > > > > > > > > setting this flag ('forcible') as false. But for some >>> > tables >>> > > > (we >>> > > > > > > > haven't >>> > > > > > > > > > figured out a pattern yet, data is still accessible), >>> merge >>> > > of >>> > > > > > > regions >>> > > > > > > > do >>> > > > > > > > > > not happen at all. Afterwards we tried with this flag = >>> > true, >>> > > > and >>> > > > > > it >>> > > > > > > > > still >>> > > > > > > > > > doesn't merge them. >>> > > > > > > > > > >>> > > > > > > > > > CDH 5.1.0 >>> > > > > > > > > > (Hbase is 0.98.1-cdh5.1.0) >>> > > > > > > > > > >>> > > > > > > > > > Regards, >>> > > > > > > > > > Shahab >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> >> >> >
