Hi Dougan, I've been working on this for a while (CRUNCH-212<https://issues.apache.org/jira/browse/CRUNCH-212>, not finished yet). It seems that LoadIncrementalHFiles does not require the input hfiles are named by random GUID, as long as you have the family as the directory name. I tried this in my unit test, not in a real cluster, so I may be wrong.
On Fri, Jul 26, 2013 at 1:23 AM, Dougan,Brian <[email protected]>wrote: > Quick question. In my attempt at an Hfile PathTarget implementation > that extended from FileTargetImpl, I ran into an issue with nested files > and wanted to see what everyone's thoughts were. > > First off, I'm using the latest 0.7.0-SNAPSHOT with CDH4.2.1 > Hadoop/HBase versions (though this shouldn't matter for this question). > > So, Hfiles require a structure something like this > > - Destination path > - {columnFamily} > - {randomGUID1} (corresponding to an HBase region} > - {randomGUID2} (corresponding to an HBase region} > - Etc… > > So you get files in the working path like this > > - workingpath/columnFamily1/guid1 > - workingpath/columnFamily1/guid2 > - workingpath/columnFamily1/… > > FileTargetImpl allows consumers to override the getSourcePattern and > getDestFile to help with this, so the source pattern is something like this > > - Path(workingPath, "[^_]*/*") > > And the destination file is something like > > - Path(destination / src.getParent.getName, src.getName) > > The issue is that FileTargetImpl doesn't create any nested folders before > trying to do the file rename (except for the top-level root server). So > for instance, it may try to do something like copying from > > - workingPath/columnFamily1/guid1 > > To > > - destinationPath/columnFamily1/guid1 > > But only destination path exists, not the nested columnFamily folder. > This makes the rename silently fail and results in missing data in the > destination path (the rename method actually returns a boolean that should > probably also be validated to alert on failures). > > So, my question is, should we look at getting an enhancement to > FileTargetImpl that would build any parent directories required (might also > make sense to make sure it's a folder under destination path) or is the > expectation for FileTargetImpl that it's only suppose to be used by > internal Crunch targets, so copying functionality (and adding this > enhancement) would be a task for anyone wanting to develop a new PathTarget? > CONFIDENTIALITY NOTICE This message and any included attachments are > from Cerner Corporation and are intended only for the addressee. The > information contained in this message is confidential and may constitute > inside or non-public information under international, federal, or state > securities laws. Unauthorized forwarding, printing, copying, distribution, > or use of such information is strictly prohibited and may be unlawful. If > you are not the addressee, please promptly delete this message and notify > the sender of the delivery error by e-mail or you may call Cerner's > corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024. >
