RE: Output file prefix

David Ortiz Fri, 13 Nov 2015 10:05:48 -0800

Thanks.  I’ll take a look at that!

From: Josh Wills [mailto:[email protected]]
Sent: Friday, November 13, 2015 1:04 PM
To: [email protected]
Subject: Re: Output file prefix


Yeah, I think that might work. You would create a FileNamingScheme that would 
allow you to specify different prefixes for the FileTargets of your different 
PCollections. I don't see any example code for how to use it for that purpose, 
just this one test Gabriel wrote:

https://github.com/apache/crunch/blob/master/crunch-core/src/test/java/org/apache/crunch/io/SequentialFileNamingSchemeTest.java

On Fri, Nov 13, 2015 at 10:00 AM, Josh Wills 
<[email protected]<mailto:[email protected]>> wrote:
Although...hrm. I wonder if FileNamingScheme would work for this purpose? Did 
you look at that?

On Fri, Nov 13, 2015 at 9:58 AM, Josh Wills 
<[email protected]<mailto:[email protected]>> wrote:
I see; they all need to end up in the same bucket in S3 w/different names. Then 
yes, the options you describe sound about right.

On Fri, Nov 13, 2015 at 9:49 AM, David Ortiz 
<[email protected]<mailto:[email protected]>> wrote:
Hey,

     The reason I was looking for this is because whether I write them to 
different directories, or the same directories, I have to distcp them all to 
the same s3 bucket for downstream processing to function properly, so I need to 
make sure that the file names don’t overlap.  So to get this to work, it sounds 
like my options would be the following:

•        Have the client move the files to a common directory with names I want 
using FileSystem calls

•        Write a shell script that Oozie calls to do the same thing as the 
previous option, but with dfs calls.

•        Write an additional crunch job, which will load the output from the 
previous four jobs and union the results.

Does that sounds about right?

Thanks,
     Dave

From: Josh Wills [mailto:[email protected]<mailto:[email protected]>]
Sent: Friday, November 13, 2015 12:41 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: Output file prefix

Hey David,

There isn't a way to muck w/the file output prefix on a per-collection basis. 
Would something like a PathPerKeyTarget work for this situation, where you 
would have four keys for the different output directories and could sort of 
union together the PTable<String, Whatever> instances that you needed to create 
on a particular run?

J

On Fri, Nov 13, 2015 at 7:36 AM, David Ortiz 
<[email protected]<mailto:[email protected]>> wrote:
Hey everyone,

     I thought I remembered seeing something in the docs about being able to 
set a prefix for output files from a collection, but I am having trouble 
finding it now.  Does that exist?

    I am trying to break up a large job that had four parallel threads of 
execution on different data sets, that all fed one output set into four 
separate jobs to make it easier to rerun only one of the input sets in the 
event something goes wrong, and this would make it a lot easier to deal with 
getting the output all into one directory.

Thanks,
     Dave

This email is intended only for the use of the individual(s) to whom it is 
addressed. If you have received this communication in error, please immediately 
notify the sender and delete the original email.



This email is intended only for the use of the individual(s) to whom it is 
addressed. If you have received this communication in error, please immediately 
notify the sender and delete the original email.

RE: Output file prefix

Reply via email to