Re: Planning Optimization for Sort

Josh Wills Tue, 13 Jan 2015 13:38:55 -0800

I counted two reads of the first job instead of three-- are you writing out
the "data" PCollection as part of the job as well?


Trying to think of how I would want to communicate the fact that the s3
read is slow/expensive to the planner; maybe a bit on Source that could be
used to signal an expensive source that should only ever be read once?

On Tue, Jan 13, 2015 at 1:11 PM, Danny Morgan <[email protected]>
wrote:

> Hi Everyone,
>
> I have a crunch job that reads some data from s3 and applies a simple
> MapFn and then does a total order sort.
>
> PCollection<String> rawdata = readTextFile("s3n://data");
> PCollection<String> data = rawdata.parallelDo(new myMapFn());
> Sort.sort(data);
>
> I noticed that Sort from the sort library works in two phases the former
> being called the presort phase. When I execute this pipeline as is the data
> is read and transformed three times, the first time to generate the
> PCollections, second time for the presort phase, and third for the final
> sort.
>
> The snippet below ends up only reading the data from s3 once.
>
> PCollection<String> rawdata = readTextFile("s3n://data");
> PCollection<String> data = rawdata.parallelDo(new myMapFn());
> data.cache();
> pipeline.run();
> Sort.sort(data);
>
> Might be a crunch planner optimization opportunity?
>
> Thanks!
>
> Danny
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: Planning Optimization for Sort

Reply via email to