Quick question.  In my attempt at an Hfile PathTarget implementation that 
extended from FileTargetImpl, I ran into an issue with nested files and wanted 
to see what everyone's thoughts were.

First off, I'm using the latest 0.7.0-SNAPSHOT with CDH4.2.1 Hadoop/HBase 
versions (though this shouldn't matter for this question).

So, Hfiles require a structure something like this

  *   Destination path
     *   {columnFamily}
        *   {randomGUID1} (corresponding to an HBase region}
        *   {randomGUID2} (corresponding to an HBase region}
        *   Etc…

So you get files in the working path like this

  *   workingpath/columnFamily1/guid1
  *   workingpath/columnFamily1/guid2
  *   workingpath/columnFamily1/…

FileTargetImpl allows consumers to override the getSourcePattern and 
getDestFile to help with this, so the source pattern is something like this

  *   Path(workingPath, "[^_]*/*")

And the destination file is something like

  *   Path(destination / src.getParent.getName, src.getName)

The issue is that FileTargetImpl doesn't create any nested folders before 
trying to do the file rename (except for the top-level root server).  So for 
instance, it may try to do something like copying from

  *   workingPath/columnFamily1/guid1

To

  *   destinationPath/columnFamily1/guid1

But only destination path exists, not the nested columnFamily folder.  This 
makes the rename silently fail and results in missing data in the destination 
path (the rename method actually returns a boolean that should probably also be 
validated to alert on failures).

So, my question is, should we look at getting an enhancement to FileTargetImpl 
that would build any parent directories required (might also make sense to make 
sure it's a folder under destination path) or is the expectation for 
FileTargetImpl that it's only suppose to be used by internal Crunch targets, so 
copying functionality (and adding this enhancement) would be a task for anyone 
wanting to develop a new PathTarget?

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.

Reply via email to