So I would want something likeā¦
PCollection<String> rawInput = pipeline.read(From.textFile(s3Path);
rawInput.cache()
pipeline.run()
//Do other processing followed by another pipeline.run()
?
Thanks,
Dave
From: Josh Wills [mailto:[email protected]]
Sent: Friday, November 13, 2015 12:44 PM
To: [email protected]
Subject: Re: MRPipeline.cache()
To absolutely guarantee it only runs once, you should make reading/copying the
data from S3 into HDFS its own job by inserting a Pipeline.run() after the call
to cache() and before any subsequent processing on the data. cache() will write
data locally, but if you have N processes that want to do something to the
data, it won't necessarily guarantee that the caching happens before the rest
of the processes start trying to read the data w/o a blocking call to run().
J
On Fri, Nov 13, 2015 at 7:34 AM, David Ortiz
<[email protected]<mailto:[email protected]>> wrote:
Hey,
If I have a super expensive to read input data set (think hundreds of GB
of data on s3 for example), would I be able to use cache to make sure I only do
the read once, then hand it out to the jobs that need it, as opposed to what
crunch does by default, which is read it once for each parallel thread that
needs the data?
Thanks,
Dave
This email is intended only for the use of the individual(s) to whom it is
addressed. If you have received this communication in error, please immediately
notify the sender and delete the original email.