On Wed, Dec 14, 2016 at 10:14 PM, Bergmann, Rico (GfK External) < [email protected]> wrote:
> Hi! > > > > In the Beam documentation I read, that a DoFn should never modify any > value of an incoming element (retrieved via ProcessContext.element(.)). I’m > wondering, when this would be a problem, if I don’t have any object reuse > behavior in my execution environment. Can you give a hint, where this might > cause problems? > The Beam model is designed specifically to gives runners some flexibility to support efficient execution. For example, many runners do an optimization called ParDo fusion, where a given element is run through a tree of adjacent ParDos and only materialized at the leaves. In this case, the output of one ParDo is handed straight to consuming ParDos, which means a single output element may be handed to multiple sibling consumers. Determining if a DoFn mutates an element is tricky, and requiring the runner to always copy the element just in case the first DoFn mutated it would introduce a significant performance cost. And a second question: How would I implement a DISTINCT transformation on > a PCollection? > > We already a Distinct transformation you can use, but you can also dive into the source code to see how it works! https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Distinct.java > > > Thanks in advance, > > Rico. > > ------------------------------ > > > GfK SE, Nuremberg, Germany, commercial register at the local court > Amtsgericht Nuremberg HRB 25014; Management Board: Dr. Gerhard > Hausruckinger (Speaker of the Management Board), Christian Diedrich (CFO), > Matthias Hartmann, David Krajicek, Alessandra Cama; Chairman of the > Supervisory Board: Ralf Klein-Bölting This email and any attachments may > contain confidential or privileged information. Please note that > unauthorized copying, disclosure or distribution of the material in this > email is not permitted. >
