Re: FileTargetImpl and nested outputs...

Gabriel Reid Thu, 25 Jul 2013 20:51:10 -0700

Agreed -- this is actually something that I had meant to do quite a while back 
when the FileNamingScheme interface was introduced, but I never got around to 
it.


The idea around the FileNamingScheme is that a custom output structure can be 
given, and the FileTargetImpl should respect the structure, creating 
sub-directories where needed. Going further, the idea was that it would be 
possible to link partitioning information with a FileNamingScheme to create a 
fanout based on some information in the partitions (which is probably exactly 
what is needed for doing the HBase file writing a well).

- Gabriel

On 25 Jul 2013, at 20:03, Josh Wills <[email protected]> wrote:

> I think that having Crunch handle it makes sense-- what do we do for Trevni 
> targets right now? Don't they also create nested subdirectories?
> 
> J
> 
> 
> On Thu, Jul 25, 2013 at 10:23 AM, Dougan,Brian <[email protected]> 
> wrote:
> Quick question.  In my attempt at an Hfile PathTarget implementation that 
> extended from FileTargetImpl, I ran into an issue with nested files and 
> wanted to see what everyone's thoughts were.  
> 
> First off, I'm using the latest 0.7.0-SNAPSHOT with CDH4.2.1 Hadoop/HBase 
> versions (though this shouldn't matter for this question).
> 
> So, Hfiles require a structure something like this
> Destination path
> {columnFamily}
> {randomGUID1} (corresponding to an HBase region}
> {randomGUID2} (corresponding to an HBase region}
> Etc…
> So you get files in the working path like this
> workingpath/columnFamily1/guid1
> workingpath/columnFamily1/guid2
> workingpath/columnFamily1/…
> FileTargetImpl allows consumers to override the getSourcePattern and 
> getDestFile to help with this, so the source pattern is something like this
> Path(workingPath, "[^_]*/*")
> And the destination file is something like
> Path(destination / src.getParent.getName, src.getName)
> The issue is that FileTargetImpl doesn't create any nested folders before 
> trying to do the file rename (except for the top-level root server).  So for 
> instance, it may try to do something like copying from
> workingPath/columnFamily1/guid1
> To 
> destinationPath/columnFamily1/guid1
> But only destination path exists, not the nested columnFamily folder.  This 
> makes the rename silently fail and results in missing data in the destination 
> path (the rename method actually returns a boolean that should probably also 
> be validated to alert on failures). 
> 
> So, my question is, should we look at getting an enhancement to 
> FileTargetImpl that would build any parent directories required (might also 
> make sense to make sure it's a folder under destination path) or is the 
> expectation for FileTargetImpl that it's only suppose to be used by internal 
> Crunch targets, so copying functionality (and adding this enhancement) would 
> be a task for anyone wanting to develop a new PathTarget?
> CONFIDENTIALITY NOTICE This message and any included attachments are from 
> Cerner Corporation and are intended only for the addressee. The information 
> contained in this message is confidential and may constitute inside or 
> non-public information under international, federal, or state securities 
> laws. Unauthorized forwarding, printing, copying, distribution, or use of 
> such information is strictly prohibited and may be unlawful. If you are not 
> the addressee, please promptly delete this message and notify the sender of 
> the delivery error by e-mail or you may call Cerner's corporate offices in 
> Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
> 
> 
> 
> -- 
> Director of Data Science
> Cloudera
> Twitter: @josh_wills

Re: FileTargetImpl and nested outputs...

Reply via email to