Hi,
I'm curious if Crunch attempts to perform any optimizations to avoid
repeated operations, and, if so, how it figures out what's being repeated.
For example, let's say I have PCollection called xCollection and a utility
method joinAndProcess that extracts keys for two collections by MapFns,
joins, and does a parallelDo on the result like this:
public PCollection<String> joinAndProcess(
PCollection<String> left,
PCollection<Double> right) {
*PTable<Integer, String> keyedLeftTable = left.by
<http://left.by>(someMapFn1);*
PTable<Integer, Double> keyedRightTable = right.by(...);
PTable<Integer, Pair<String, Double>> joinedTable = ... join ...
return joinedTable.parallelDo(...);
}
If I call joinAndProcess(xCollection, some other collection) multiple
times, will Crunch be able to notice that the highlighted left.by
(someMapFn1) is the same and reuse the result rather than recompute it?
Would it be able to do so if the .by step were given the same name or same
MapFn instance each time?
Thanks,
Everett
--
*DISCLAIMER:* The contents of this email, including any attachments, may
contain information that is confidential, proprietary in nature, protected
health information (PHI), or otherwise protected by law from disclosure,
and is solely for the use of the intended recipient(s). If you are not the
intended recipient, you are hereby notified that any use, disclosure or
copying of this email, including any attachments, is unauthorized and
strictly prohibited. If you have received this email in error, please
notify the sender of this email. Please delete this and all copies of this
email from your system. Any opinions either expressed or implied in this
email and all attachments, are those of its author only, and do not
necessarily reflect those of Nuna Health, Inc.