Frances Perry created BEAM-1164:
-----------------------------------

             Summary: Allow a DoFn to opt in to mutating it's input
                 Key: BEAM-1164
                 URL: https://issues.apache.org/jira/browse/BEAM-1164
             Project: Beam
          Issue Type: Bug
          Components: beam-model
            Reporter: Frances Perry
            Priority: Minor


Runners generally can't tell if a DoFn is mutating inputs, but assuming so by 
default leads to significant performance implications from unnecessary copying 
(around sibling fusion, etc). So instead the model prevents mutating inputs, 
and the Direct Runner validates this behavior. (See: 
http://beam.incubator.apache.org/contribute/design-principles/#make-efficient-things-easy-rather-than-make-easy-things-efficient)
 

However, if users are processing a small number of large records by making 
incremental changes (for example, genomics use cases), the cost of immutability 
requirement can be very large. As a workaround, users sometimes do suboptimal 
things (fusing ParDos by hand) or undefined things when they expect the 
immutability requirement is unnecessarily strict (adding no-op coders in places 
they hope the runner won't be materializing things, mutating things anyway when 
they don't expect sibling fusion to happen, etc).

We should consider adding a signal (MutatingDoFn?) that users explicitly opt in 
to to say their code may mutate inputs. The runner can then use this assumption 
to either prevent optimizations that would break in the face of this or insert 
additional copies as needed to allow optimizations to preserve semantics.

See this related user@ discussion:
https://lists.apache.org/thread.html/f39689f54147117f3fc54c498eff1a20fa73f1be5b5cad5b6f816fd3@%3Cuser.beam.apache.org%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to