On Wed, Dec 14, 2016 at 10:14 PM, Bergmann, Rico (GfK External) <
[email protected]> wrote:

> Hi!
>
>
>
> In the Beam documentation I read, that a DoFn should never modify any
> value of an incoming element (retrieved via ProcessContext.element(.)). I’m
> wondering, when this would be a problem, if I don’t have any object reuse
> behavior in my execution environment. Can you give a hint, where this might
> cause problems?
>

The Beam model is designed specifically to gives runners some flexibility
to support efficient execution. For example, many runners do an
optimization called ParDo fusion, where a given element is run through a
tree of adjacent ParDos and only materialized at the leaves. In this case,
the output of one ParDo is handed straight to consuming ParDos, which means
a single output element may be handed to multiple sibling consumers.
Determining if a DoFn mutates an element is tricky, and requiring the
runner to always copy the element just in case the first DoFn mutated it
would introduce a significant performance cost.

 And a second question: How would I implement a DISTINCT transformation on
> a PCollection?
>
>
We already a Distinct transformation you can use, but you can also dive
into the source code to see how it works!
https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Distinct.java


>
>
> Thanks in advance,
>
> Rico.
>
> ------------------------------
>
>
> GfK SE, Nuremberg, Germany, commercial register at the local court
> Amtsgericht Nuremberg HRB 25014; Management Board: Dr. Gerhard
> Hausruckinger (Speaker of the Management Board), Christian Diedrich (CFO),
> Matthias Hartmann, David Krajicek, Alessandra Cama; Chairman of the
> Supervisory Board: Ralf Klein-Bölting This email and any attachments may
> contain confidential or privileged information. Please note that
> unauthorized copying, disclosure or distribution of the material in this
> email is not permitted.
>

Reply via email to