They create multiple directories however the folder structure does not need to be preserved. So in that case flattening the files is acceptable.
On Thu, Jul 25, 2013 at 1:03 PM, Josh Wills <[email protected]> wrote: > I think that having Crunch handle it makes sense-- what do we do for > Trevni targets right now? Don't they also create nested subdirectories? > > J > > > On Thu, Jul 25, 2013 at 10:23 AM, Dougan,Brian <[email protected]>wrote: > >> Quick question. In my attempt at an Hfile PathTarget implementation >> that extended from FileTargetImpl, I ran into an issue with nested files >> and wanted to see what everyone's thoughts were. >> >> First off, I'm using the latest 0.7.0-SNAPSHOT with CDH4.2.1 >> Hadoop/HBase versions (though this shouldn't matter for this question). >> >> So, Hfiles require a structure something like this >> >> - Destination path >> - {columnFamily} >> - {randomGUID1} (corresponding to an HBase region} >> - {randomGUID2} (corresponding to an HBase region} >> - Etc… >> >> So you get files in the working path like this >> >> - workingpath/columnFamily1/guid1 >> - workingpath/columnFamily1/guid2 >> - workingpath/columnFamily1/… >> >> FileTargetImpl allows consumers to override the getSourcePattern and >> getDestFile to help with this, so the source pattern is something like this >> >> - Path(workingPath, "[^_]*/*") >> >> And the destination file is something like >> >> - Path(destination / src.getParent.getName, src.getName) >> >> The issue is that FileTargetImpl doesn't create any nested folders before >> trying to do the file rename (except for the top-level root server). So >> for instance, it may try to do something like copying from >> >> - workingPath/columnFamily1/guid1 >> >> To >> >> - destinationPath/columnFamily1/guid1 >> >> But only destination path exists, not the nested columnFamily folder. >> This makes the rename silently fail and results in missing data in the >> destination path (the rename method actually returns a boolean that should >> probably also be validated to alert on failures). >> >> So, my question is, should we look at getting an enhancement to >> FileTargetImpl that would build any parent directories required (might also >> make sense to make sure it's a folder under destination path) or is the >> expectation for FileTargetImpl that it's only suppose to be used by >> internal Crunch targets, so copying functionality (and adding this >> enhancement) would be a task for anyone wanting to develop a new PathTarget? >> CONFIDENTIALITY NOTICE This message and any included attachments are >> from Cerner Corporation and are intended only for the addressee. The >> information contained in this message is confidential and may constitute >> inside or non-public information under international, federal, or state >> securities laws. Unauthorized forwarding, printing, copying, distribution, >> or use of such information is strictly prohibited and may be unlawful. If >> you are not the addressee, please promptly delete this message and notify >> the sender of the delivery error by e-mail or you may call Cerner's >> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024. >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> >
