I counted two reads of the first job instead of three-- are you writing out the "data" PCollection as part of the job as well?
Trying to think of how I would want to communicate the fact that the s3 read is slow/expensive to the planner; maybe a bit on Source that could be used to signal an expensive source that should only ever be read once? On Tue, Jan 13, 2015 at 1:11 PM, Danny Morgan <[email protected]> wrote: > Hi Everyone, > > I have a crunch job that reads some data from s3 and applies a simple > MapFn and then does a total order sort. > > PCollection<String> rawdata = readTextFile("s3n://data"); > PCollection<String> data = rawdata.parallelDo(new myMapFn()); > Sort.sort(data); > > I noticed that Sort from the sort library works in two phases the former > being called the presort phase. When I execute this pipeline as is the data > is read and transformed three times, the first time to generate the > PCollections, second time for the presort phase, and third for the final > sort. > > The snippet below ends up only reading the data from s3 once. > > PCollection<String> rawdata = readTextFile("s3n://data"); > PCollection<String> data = rawdata.parallelDo(new myMapFn()); > data.cache(); > pipeline.run(); > Sort.sort(data); > > Might be a crunch planner optimization opportunity? > > Thanks! > > Danny > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
