To absolutely guarantee it only runs once, you should make reading/copying
the data from S3 into HDFS its own job by inserting a Pipeline.run() after
the call to cache() and before any subsequent processing on the data.
cache() will write data locally, but if you have N processes that want to
do something to the data, it won't necessarily guarantee that the caching
happens before the rest of the processes start trying to read the data w/o
a blocking call to run().

J

On Fri, Nov 13, 2015 at 7:34 AM, David Ortiz <[email protected]> wrote:

> Hey,
>
>      If I have a super expensive to read input data set (think hundreds of
> GB of data on s3 for example), would I be able to use cache to make sure I
> only do the read once, then hand it out to the jobs that need it, as
> opposed to what crunch does by default, which is read it once for each
> parallel thread that needs the data?
>
> Thanks,
>      Dave
>

Reply via email to