Hi Everyone,
I have a crunch job that reads some data from s3 and applies a simple MapFn and 
then does a total order sort.
PCollection<String> rawdata = readTextFile("s3n://data");PCollection<String> 
data = rawdata.parallelDo(new myMapFn());Sort.sort(data); 
I noticed that Sort from the sort library works in two phases the former being 
called the presort phase. When I execute this pipeline as is the data is read 
and transformed three times, the first time to generate the PCollections, 
second time for the presort phase, and third for the final sort.
The snippet below ends up only reading the data from s3 once.
PCollection<String> rawdata = readTextFile("s3n://data");PCollection<String> 
data = rawdata.parallelDo(new 
myMapFn());data.cache();pipeline.run();Sort.sort(data);
Might be a crunch planner optimization opportunity?
Thanks!
Danny                                     

Reply via email to