Hi Everett, No, there aren't any currently any optimizations (or at least none that I'm aware of) in Crunch that would skip a repeated operation like this. Any call to parallelDo() and friends will always result in additional operations being performed in the pipeline.
That being said, adding functionality like that might be as simple as implementing equals and hashCode in one or more of the underlying PCollection impls, so this might an interesting thing to look into further if there's a need for it. - Gabriel On Wed, Jun 24, 2015 at 10:28 PM Everett Anderson <[email protected]> wrote: > Hi, > > I'm curious if Crunch attempts to perform any optimizations to avoid > repeated operations, and, if so, how it figures out what's being repeated. > > For example, let's say I have PCollection called xCollection and a > utility method joinAndProcess that extracts keys for two collections by > MapFns, joins, and does a parallelDo on the result like this: > > public PCollection<String> joinAndProcess( > PCollection<String> left, > PCollection<Double> right) { > *PTable<Integer, String> keyedLeftTable = left.by > <http://left.by>(someMapFn1);* > PTable<Integer, Double> keyedRightTable = right.by(...); > PTable<Integer, Pair<String, Double>> joinedTable = ... join ... > return joinedTable.parallelDo(...); > } > > If I call joinAndProcess(xCollection, some other collection) multiple > times, will Crunch be able to notice that the highlighted left.by > (someMapFn1) is the same and reuse the result rather than recompute it? > > Would it be able to do so if the .by step were given the same name or > same MapFn instance each time? > > Thanks, > Everett > > > > *DISCLAIMER:* The contents of this email, including any attachments, may > contain information that is confidential, proprietary in nature, protected > health information (PHI), or otherwise protected by law from disclosure, > and is solely for the use of the intended recipient(s). If you are not the > intended recipient, you are hereby notified that any use, disclosure or > copying of this email, including any attachments, is unauthorized and > strictly prohibited. If you have received this email in error, please > notify the sender of this email. Please delete this and all copies of this > email from your system. Any opinions either expressed or implied in this > email and all attachments, are those of its author only, and do not > necessarily reflect those of Nuna Health, Inc.
