Agreed -- this is actually something that I had meant to do quite a while back when the FileNamingScheme interface was introduced, but I never got around to it.
The idea around the FileNamingScheme is that a custom output structure can be given, and the FileTargetImpl should respect the structure, creating sub-directories where needed. Going further, the idea was that it would be possible to link partitioning information with a FileNamingScheme to create a fanout based on some information in the partitions (which is probably exactly what is needed for doing the HBase file writing a well). - Gabriel On 25 Jul 2013, at 20:03, Josh Wills <[email protected]> wrote: > I think that having Crunch handle it makes sense-- what do we do for Trevni > targets right now? Don't they also create nested subdirectories? > > J > > > On Thu, Jul 25, 2013 at 10:23 AM, Dougan,Brian <[email protected]> > wrote: > Quick question. In my attempt at an Hfile PathTarget implementation that > extended from FileTargetImpl, I ran into an issue with nested files and > wanted to see what everyone's thoughts were. > > First off, I'm using the latest 0.7.0-SNAPSHOT with CDH4.2.1 Hadoop/HBase > versions (though this shouldn't matter for this question). > > So, Hfiles require a structure something like this > Destination path > {columnFamily} > {randomGUID1} (corresponding to an HBase region} > {randomGUID2} (corresponding to an HBase region} > Etc… > So you get files in the working path like this > workingpath/columnFamily1/guid1 > workingpath/columnFamily1/guid2 > workingpath/columnFamily1/… > FileTargetImpl allows consumers to override the getSourcePattern and > getDestFile to help with this, so the source pattern is something like this > Path(workingPath, "[^_]*/*") > And the destination file is something like > Path(destination / src.getParent.getName, src.getName) > The issue is that FileTargetImpl doesn't create any nested folders before > trying to do the file rename (except for the top-level root server). So for > instance, it may try to do something like copying from > workingPath/columnFamily1/guid1 > To > destinationPath/columnFamily1/guid1 > But only destination path exists, not the nested columnFamily folder. This > makes the rename silently fail and results in missing data in the destination > path (the rename method actually returns a boolean that should probably also > be validated to alert on failures). > > So, my question is, should we look at getting an enhancement to > FileTargetImpl that would build any parent directories required (might also > make sense to make sure it's a folder under destination path) or is the > expectation for FileTargetImpl that it's only suppose to be used by internal > Crunch targets, so copying functionality (and adding this enhancement) would > be a task for anyone wanting to develop a new PathTarget? > CONFIDENTIALITY NOTICE This message and any included attachments are from > Cerner Corporation and are intended only for the addressee. The information > contained in this message is confidential and may constitute inside or > non-public information under international, federal, or state securities > laws. Unauthorized forwarding, printing, copying, distribution, or use of > such information is strictly prohibited and may be unlawful. If you are not > the addressee, please promptly delete this message and notify the sender of > the delivery error by e-mail or you may call Cerner's corporate offices in > Kansas City, Missouri, U.S.A at (+1) (816)221-1024. > > > > -- > Director of Data Science > Cloudera > Twitter: @josh_wills
