Quick question. In my attempt at an Hfile PathTarget implementation that
extended from FileTargetImpl, I ran into an issue with nested files and wanted
to see what everyone's thoughts were.
First off, I'm using the latest 0.7.0-SNAPSHOT with CDH4.2.1 Hadoop/HBase
versions (though this shouldn't matter for this question).
So, Hfiles require a structure something like this
* Destination path
* {columnFamily}
* {randomGUID1} (corresponding to an HBase region}
* {randomGUID2} (corresponding to an HBase region}
* Etc…
So you get files in the working path like this
* workingpath/columnFamily1/guid1
* workingpath/columnFamily1/guid2
* workingpath/columnFamily1/…
FileTargetImpl allows consumers to override the getSourcePattern and
getDestFile to help with this, so the source pattern is something like this
* Path(workingPath, "[^_]*/*")
And the destination file is something like
* Path(destination / src.getParent.getName, src.getName)
The issue is that FileTargetImpl doesn't create any nested folders before
trying to do the file rename (except for the top-level root server). So for
instance, it may try to do something like copying from
* workingPath/columnFamily1/guid1
To
* destinationPath/columnFamily1/guid1
But only destination path exists, not the nested columnFamily folder. This
makes the rename silently fail and results in missing data in the destination
path (the rename method actually returns a boolean that should probably also be
validated to alert on failures).
So, my question is, should we look at getting an enhancement to FileTargetImpl
that would build any parent directories required (might also make sense to make
sure it's a folder under destination path) or is the expectation for
FileTargetImpl that it's only suppose to be used by internal Crunch targets, so
copying functionality (and adding this enhancement) would be a task for anyone
wanting to develop a new PathTarget?
CONFIDENTIALITY NOTICE This message and any included attachments are from
Cerner Corporation and are intended only for the addressee. The information
contained in this message is confidential and may constitute inside or
non-public information under international, federal, or state securities laws.
Unauthorized forwarding, printing, copying, distribution, or use of such
information is strictly prohibited and may be unlawful. If you are not the
addressee, please promptly delete this message and notify the sender of the
delivery error by e-mail or you may call Cerner's corporate offices in Kansas
City, Missouri, U.S.A at (+1) (816)221-1024.