Thanks for the quick answer! So the ParDo fusion will only be possible if I run different DoFns on the same input concurrently, right? Or vice versa, if I don’t have concurrent DoFns running on the same input it woud be save to modify the input element?
Best, Rico. Von: Frances Perry [mailto:[email protected]] Gesendet: Donnerstag, 15. Dezember 2016 07:28 An: [email protected] Betreff: Re: Immutability requirement for UDF input On Wed, Dec 14, 2016 at 10:14 PM, Bergmann, Rico (GfK External) <[email protected]<mailto:[email protected]>> wrote: Hi! In the Beam documentation I read, that a DoFn should never modify any value of an incoming element (retrieved via ProcessContext.element(.)). I’m wondering, when this would be a problem, if I don’t have any object reuse behavior in my execution environment. Can you give a hint, where this might cause problems? The Beam model is designed specifically to gives runners some flexibility to support efficient execution. For example, many runners do an optimization called ParDo fusion, where a given element is run through a tree of adjacent ParDos and only materialized at the leaves. In this case, the output of one ParDo is handed straight to consuming ParDos, which means a single output element may be handed to multiple sibling consumers. Determining if a DoFn mutates an element is tricky, and requiring the runner to always copy the element just in case the first DoFn mutated it would introduce a significant performance cost. And a second question: How would I implement a DISTINCT transformation on a PCollection? We already a Distinct transformation you can use, but you can also dive into the source code to see how it works! https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Distinct.java Thanks in advance, Rico. ________________________________ GfK SE, Nuremberg, Germany, commercial register at the local court Amtsgericht Nuremberg HRB 25014; Management Board: Dr. Gerhard Hausruckinger (Speaker of the Management Board), Christian Diedrich (CFO), Matthias Hartmann, David Krajicek, Alessandra Cama; Chairman of the Supervisory Board: Ralf Klein-Bölting This email and any attachments may contain confidential or privileged information. Please note that unauthorized copying, disclosure or distribution of the material in this email is not permitted. ________________________________ GfK SE, Nuremberg, Germany, commercial register at the local court Amtsgericht Nuremberg HRB 25014; Management Board: Dr. Gerhard Hausruckinger (Speaker of the Management Board), Christian Diedrich (CFO), Matthias Hartmann, David Krajicek, Alessandra Cama; Chairman of the Supervisory Board: Ralf Klein-Bölting This email and any attachments may contain confidential or privileged information. Please note that unauthorized copying, disclosure or distribution of the material in this email is not permitted.
