Yeah, I think that might work. You would create a FileNamingScheme that would allow you to specify different prefixes for the FileTargets of your different PCollections. I don't see any example code for how to use it for that purpose, just this one test Gabriel wrote:
https://github.com/apache/crunch/blob/master/crunch-core/src/test/java/org/apache/crunch/io/SequentialFileNamingSchemeTest.java On Fri, Nov 13, 2015 at 10:00 AM, Josh Wills <[email protected]> wrote: > Although...hrm. I wonder if FileNamingScheme would work for this purpose? > Did you look at that? > > On Fri, Nov 13, 2015 at 9:58 AM, Josh Wills <[email protected]> wrote: > >> I see; they all need to end up in the same bucket in S3 w/different >> names. Then yes, the options you describe sound about right. >> >> On Fri, Nov 13, 2015 at 9:49 AM, David Ortiz <[email protected]> >> wrote: >> >>> Hey, >>> >>> >>> >>> The reason I was looking for this is because whether I write them >>> to different directories, or the same directories, I have to distcp them >>> all to the same s3 bucket for downstream processing to function properly, >>> so I need to make sure that the file names don’t overlap. So to get this >>> to work, it sounds like my options would be the following: >>> >>> · Have the client move the files to a common directory with >>> names I want using FileSystem calls >>> >>> · Write a shell script that Oozie calls to do the same thing as >>> the previous option, but with dfs calls. >>> >>> · Write an additional crunch job, which will load the output >>> from the previous four jobs and union the results. >>> >>> >>> >>> Does that sounds about right? >>> >>> >>> >>> Thanks, >>> >>> Dave >>> >>> >>> >>> *From:* Josh Wills [mailto:[email protected]] >>> *Sent:* Friday, November 13, 2015 12:41 PM >>> *To:* [email protected] >>> *Subject:* Re: Output file prefix >>> >>> >>> >>> Hey David, >>> >>> >>> >>> There isn't a way to muck w/the file output prefix on a per-collection >>> basis. Would something like a PathPerKeyTarget work for this situation, >>> where you would have four keys for the different output directories and >>> could sort of union together the PTable<String, Whatever> instances that >>> you needed to create on a particular run? >>> >>> >>> >>> J >>> >>> >>> >>> On Fri, Nov 13, 2015 at 7:36 AM, David Ortiz <[email protected]> wrote: >>> >>> Hey everyone, >>> >>> >>> >>> I thought I remembered seeing something in the docs about being >>> able to set a prefix for output files from a collection, but I am having >>> trouble finding it now. Does that exist? >>> >>> >>> >>> I am trying to break up a large job that had four parallel threads >>> of execution on different data sets, that all fed one output set into four >>> separate jobs to make it easier to rerun only one of the input sets in the >>> event something goes wrong, and this would make it a lot easier to deal >>> with getting the output all into one directory. >>> >>> >>> >>> Thanks, >>> >>> Dave >>> >>> >>> *This email is intended only for the use of the individual(s) to whom it >>> is addressed. If you have received this communication in error, please >>> immediately notify the sender and delete the original email.* >>> >> >> >
