So I would want something like…

PCollection<String> rawInput = pipeline.read(From.textFile(s3Path);
rawInput.cache()
pipeline.run()

//Do other processing followed by another pipeline.run()
?

Thanks,
     Dave

From: Josh Wills [mailto:[email protected]]
Sent: Friday, November 13, 2015 12:44 PM
To: [email protected]
Subject: Re: MRPipeline.cache()

To absolutely guarantee it only runs once, you should make reading/copying the 
data from S3 into HDFS its own job by inserting a Pipeline.run() after the call 
to cache() and before any subsequent processing on the data. cache() will write 
data locally, but if you have N processes that want to do something to the 
data, it won't necessarily guarantee that the caching happens before the rest 
of the processes start trying to read the data w/o a blocking call to run().

J

On Fri, Nov 13, 2015 at 7:34 AM, David Ortiz 
<[email protected]<mailto:[email protected]>> wrote:
Hey,

     If I have a super expensive to read input data set (think hundreds of GB 
of data on s3 for example), would I be able to use cache to make sure I only do 
the read once, then hand it out to the jobs that need it, as opposed to what 
crunch does by default, which is read it once for each parallel thread that 
needs the data?

Thanks,
     Dave

This email is intended only for the use of the individual(s) to whom it is 
addressed. If you have received this communication in error, please immediately 
notify the sender and delete the original email.

Reply via email to