To absolutely guarantee it only runs once, you should make reading/copying the data from S3 into HDFS its own job by inserting a Pipeline.run() after the call to cache() and before any subsequent processing on the data. cache() will write data locally, but if you have N processes that want to do something to the data, it won't necessarily guarantee that the caching happens before the rest of the processes start trying to read the data w/o a blocking call to run().
J On Fri, Nov 13, 2015 at 7:34 AM, David Ortiz <[email protected]> wrote: > Hey, > > If I have a super expensive to read input data set (think hundreds of > GB of data on s3 for example), would I be able to use cache to make sure I > only do the read once, then hand it out to the jobs that need it, as > opposed to what crunch does by default, which is read it once for each > parallel thread that needs the data? > > Thanks, > Dave >
