Thanks for the quick answer!

So the ParDo fusion will only be possible if I run different DoFns on the same 
input concurrently, right? Or vice versa, if I don’t have concurrent DoFns 
running on the same input it woud be save to modify the input element?

Best,
Rico.


Von: Frances Perry [mailto:[email protected]]
Gesendet: Donnerstag, 15. Dezember 2016 07:28
An: [email protected]
Betreff: Re: Immutability requirement for UDF input



On Wed, Dec 14, 2016 at 10:14 PM, Bergmann, Rico (GfK External) 
<[email protected]<mailto:[email protected]>> wrote:
Hi!

In the Beam documentation I read, that a DoFn should never modify any value of 
an incoming element (retrieved via ProcessContext.element(.)). I’m wondering, 
when this would be a problem, if I don’t have any object reuse behavior in my 
execution environment. Can you give a hint, where this might cause problems?

The Beam model is designed specifically to gives runners some flexibility to 
support efficient execution. For example, many runners do an optimization 
called ParDo fusion, where a given element is run through a tree of adjacent 
ParDos and only materialized at the leaves. In this case, the output of one 
ParDo is handed straight to consuming ParDos, which means a single output 
element may be handed to multiple sibling consumers. Determining if a DoFn 
mutates an element is tricky, and requiring the runner to always copy the 
element just in case the first DoFn mutated it would introduce a significant 
performance cost.

 And a second question: How would I implement a DISTINCT transformation on a 
PCollection?

We already a Distinct transformation you can use, but you can also dive into 
the source code to see how it works!
https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Distinct.java


Thanks in advance,
Rico.

________________________________


GfK SE, Nuremberg, Germany, commercial register at the local court Amtsgericht 
Nuremberg HRB 25014; Management Board: Dr. Gerhard Hausruckinger (Speaker of 
the Management Board), Christian Diedrich (CFO), Matthias Hartmann, David 
Krajicek, Alessandra Cama; Chairman of the Supervisory Board: Ralf 
Klein-Bölting This email and any attachments may contain confidential or 
privileged information. Please note that unauthorized copying, disclosure or 
distribution of the material in this email is not permitted.


________________________________


GfK SE, Nuremberg, Germany, commercial register at the local court Amtsgericht 
Nuremberg HRB 25014; Management Board: Dr. Gerhard Hausruckinger (Speaker of 
the Management Board), Christian Diedrich (CFO), Matthias Hartmann, David 
Krajicek, Alessandra Cama; Chairman of the Supervisory Board: Ralf 
Klein-Bölting This email and any attachments may contain confidential or 
privileged information. Please note that unauthorized copying, disclosure or 
distribution of the material in this email is not permitted.

Reply via email to