Although...hrm. I wonder if FileNamingScheme would work for this purpose? Did you look at that?
On Fri, Nov 13, 2015 at 9:58 AM, Josh Wills <[email protected]> wrote: > I see; they all need to end up in the same bucket in S3 w/different names. > Then yes, the options you describe sound about right. > > On Fri, Nov 13, 2015 at 9:49 AM, David Ortiz <[email protected]> > wrote: > >> Hey, >> >> >> >> The reason I was looking for this is because whether I write them to >> different directories, or the same directories, I have to distcp them all >> to the same s3 bucket for downstream processing to function properly, so I >> need to make sure that the file names don’t overlap. So to get this to >> work, it sounds like my options would be the following: >> >> · Have the client move the files to a common directory with names >> I want using FileSystem calls >> >> · Write a shell script that Oozie calls to do the same thing as >> the previous option, but with dfs calls. >> >> · Write an additional crunch job, which will load the output from >> the previous four jobs and union the results. >> >> >> >> Does that sounds about right? >> >> >> >> Thanks, >> >> Dave >> >> >> >> *From:* Josh Wills [mailto:[email protected]] >> *Sent:* Friday, November 13, 2015 12:41 PM >> *To:* [email protected] >> *Subject:* Re: Output file prefix >> >> >> >> Hey David, >> >> >> >> There isn't a way to muck w/the file output prefix on a per-collection >> basis. Would something like a PathPerKeyTarget work for this situation, >> where you would have four keys for the different output directories and >> could sort of union together the PTable<String, Whatever> instances that >> you needed to create on a particular run? >> >> >> >> J >> >> >> >> On Fri, Nov 13, 2015 at 7:36 AM, David Ortiz <[email protected]> wrote: >> >> Hey everyone, >> >> >> >> I thought I remembered seeing something in the docs about being able >> to set a prefix for output files from a collection, but I am having trouble >> finding it now. Does that exist? >> >> >> >> I am trying to break up a large job that had four parallel threads of >> execution on different data sets, that all fed one output set into four >> separate jobs to make it easier to rerun only one of the input sets in the >> event something goes wrong, and this would make it a lot easier to deal >> with getting the output all into one directory. >> >> >> >> Thanks, >> >> Dave >> >> >> *This email is intended only for the use of the individual(s) to whom it >> is addressed. If you have received this communication in error, please >> immediately notify the sender and delete the original email.* >> > >
