[ 
https://issues.apache.org/jira/browse/FLINK-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139151#comment-15139151
 ] 

Greg Hogan edited comment on FLINK-3333 at 2/9/16 8:07 PM:
-----------------------------------------------------------

Apache Flink programs can be written and configured to reduce the number of 
object allocations for better performance. User defined functions (like map() 
or groupReduce()) process many millions or billions of input and output values. 
Enabling object reuse and processing mutable objects improves performance by 
lowering demand on the CPU cache and Java garbage collector.

Object reuse is disabled by default, with user defined functions generally 
getting new objects on each call (or through an iterator). In this case it is 
safe to store references to the objects inside the function (for example, in a 
List).

<'storing values in a list' example>

Apache Flink will chain functions to improve performance when sorting is 
preserved and the parallelism unchanged. The chainable operators are Map, 
FlatMap, Reduce on full DataSet, GroupCombine on a Grouped DataSet, or a 
GroupReduce where the user supplied a RichGroupReduceFunction with a combine 
method. Objects are passed without copying _even when object reuse is disabled_.

In the chaining case, the functions in the chain are receiving the same object 
instances. So the the second map() function is receiving the objects the first 
map() is returning. This behavior can lead to errors when the first map() 
function keeps a list of all objects and the second mapper is modifying 
objects. In that case, the user has to manually create copies of the objects 
before putting them into the list.

<chainable example>

<discussion of copyable values>

<copyablevalue example>

There is a switch at the ExecutionConfig which allows users to enable the 
object reuse mode (enableObjectReuse()). For mutable types, Flink will reuse 
object instances. In practice that means that a user function will always 
receive the same object instance (with its fields set to new values). The 
object reuse mode will lead to better performance because fewer objects are 
created, but the user has to manually take care of what they are doing with the 
object references.

<object reuse example>


was (Author: greghogan):
Apache Flink programs can be written and configured to reduce the number of 
object allocations for better performance. User defined functions (like map() 
or groupReduce()) process many millions or billions of input and output values. 
Enabling object reuse and processing mutable objects improves performance by 
lowering demand on the CPU cache and Java garbage collector.

Object reuse is disabled by default, with user defined functions generally 
getting new objects on each call (or through an iterator). In this case it is 
safe to store references to the objects inside the function (for example, in a 
List).

<'storing values in a list' example>

Apache Flink will chain functions to improve performance when sorting is 
preserved and the parallelism unchanged. The chainable operators are Map, 
FlatMap, Reduce on full DataSet, GroupCombine on a Grouped DataSet, or a 
GroupReduce where the user supplied a RichGroupReduceFunction with a combine 
method). Objects are passed without copying _even when object reuse is 
disabled_.

In the chaining case, the functions in the chain are receiving the same object 
instances. So the the second map() function is receiving the objects the first 
map() is returning. This behavior can lead to errors when the first map() 
function keeps a list of all objects and the second mapper is modifying 
objects. In that case, the user has to manually create copies of the objects 
before putting them into the list.

<chainable example>

<discussion of copyable values>

<copyablevalue example>

There is a switch at the ExecutionConfig which allows users to enable the 
object reuse mode (enableObjectReuse()). For mutable types, Flink will reuse 
object instances. In practice that means that a user function will always 
receive the same object instance (with its fields set to new values). The 
object reuse mode will lead to better performance because fewer objects are 
created, but the user has to manually take care of what they are doing with the 
object references.

<object reuse example>

> Documentation about object reuse should be improved
> ---------------------------------------------------
>
>                 Key: FLINK-3333
>                 URL: https://issues.apache.org/jira/browse/FLINK-3333
>             Project: Flink
>          Issue Type: Bug
>          Components: Documentation
>    Affects Versions: 1.0.0
>            Reporter: Gabor Gevay
>            Assignee: Gabor Gevay
>            Priority: Blocker
>             Fix For: 1.0.0
>
>
> The documentation about object reuse \[1\] has several problems, see \[2\].
> \[1\] 
> https://ci.apache.org/projects/flink/flink-docs-master/apis/batch/index.html#object-reuse-behavior
> \[2\] 
> https://docs.google.com/document/d/1cgkuttvmj4jUonG7E2RdFVjKlfQDm_hE6gvFcgAfzXg/edit



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to