[jira] [Commented] (FLINK-6763) Inefficient PojoSerializerConfigSnapshot serialization format

2020-03-17 Thread Tzu-Li (Gordon) Tai (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-6763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060702#comment-17060702
 ] 

Tzu-Li (Gordon) Tai commented on FLINK-6763:


[~NicoK] thanks for pinging on this as well.

The issue is no longer relevant, as we no longer serialize serializers in 
snapshots.
Closing this ticket.

> Inefficient PojoSerializerConfigSnapshot serialization format
> -
>
> Key: FLINK-6763
> URL: https://issues.apache.org/jira/browse/FLINK-6763
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Type Serialization System, Runtime / State Backends
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Till Rohrmann
>Assignee: Tzu-Li (Gordon) Tai
>Priority: Major
>
> The {{PojoSerializerConfigSnapshot}} stores for each serializer the beginning 
> offset and ending offset in the serialization stream. This information is 
> also written if the serializer serialization is supposed to be ignored. The 
> beginning and ending offsets are stored as a sequence of integers at the 
> beginning of the serialization stream. We store this information to skip 
> broken serializers.
> I think we don't need both offsets. Instead I would suggest to write the 
> length of the serialized serializer first into the serialization stream and 
> then the serialized serializer. This can be done in 
> {{TypeSerializerSerializationUtil.writeSerializer}}. When reading the 
> serializer via {{TypeSerializerSerializationUtil.tryReadSerializer}}, we can 
> try to deserialize the serializer. If this operation fails, then we can skip 
> the number of serialized serializer because we know how long it was.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-6763) Inefficient PojoSerializerConfigSnapshot serialization format

2020-03-16 Thread Nico Kruber (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-6763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060374#comment-17060374
 ] 

Nico Kruber commented on FLINK-6763:


[~tzulitai] given that you had a PR for this a while back and it still didn't 
make it into the code base, and also [~sewen]'s suggesting would make this 
whole optimisation obsolete (if you only do this once, you don't care about 
this cost too much), what are the plans regarding this ticket?

> Inefficient PojoSerializerConfigSnapshot serialization format
> -
>
> Key: FLINK-6763
> URL: https://issues.apache.org/jira/browse/FLINK-6763
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Type Serialization System, Runtime / State Backends
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Till Rohrmann
>Assignee: Tzu-Li (Gordon) Tai
>Priority: Major
>
> The {{PojoSerializerConfigSnapshot}} stores for each serializer the beginning 
> offset and ending offset in the serialization stream. This information is 
> also written if the serializer serialization is supposed to be ignored. The 
> beginning and ending offsets are stored as a sequence of integers at the 
> beginning of the serialization stream. We store this information to skip 
> broken serializers.
> I think we don't need both offsets. Instead I would suggest to write the 
> length of the serialized serializer first into the serialization stream and 
> then the serialized serializer. This can be done in 
> {{TypeSerializerSerializationUtil.writeSerializer}}. When reading the 
> serializer via {{TypeSerializerSerializationUtil.tryReadSerializer}}, we can 
> try to deserialize the serializer. If this operation fails, then we can skip 
> the number of serialized serializer because we know how long it was.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-6763) Inefficient PojoSerializerConfigSnapshot serialization format

2018-03-01 Thread Tzu-Li (Gordon) Tai (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383264#comment-16383264
 ] 

Tzu-Li (Gordon) Tai commented on FLINK-6763:


[~aljoscha] yes, moving to 1.6.0.

> Inefficient PojoSerializerConfigSnapshot serialization format
> -
>
> Key: FLINK-6763
> URL: https://issues.apache.org/jira/browse/FLINK-6763
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing, Type Serialization System
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Till Rohrmann
>Assignee: Tzu-Li (Gordon) Tai
>Priority: Blocker
> Fix For: 1.6.0
>
>
> The {{PojoSerializerConfigSnapshot}} stores for each serializer the beginning 
> offset and ending offset in the serialization stream. This information is 
> also written if the serializer serialization is supposed to be ignored. The 
> beginning and ending offsets are stored as a sequence of integers at the 
> beginning of the serialization stream. We store this information to skip 
> broken serializers.
> I think we don't need both offsets. Instead I would suggest to write the 
> length of the serialized serializer first into the serialization stream and 
> then the serialized serializer. This can be done in 
> {{TypeSerializerSerializationUtil.writeSerializer}}. When reading the 
> serializer via {{TypeSerializerSerializationUtil.tryReadSerializer}}, we can 
> try to deserialize the serializer. If this operation fails, then we can skip 
> the number of serialized serializer because we know how long it was.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6763) Inefficient PojoSerializerConfigSnapshot serialization format

2018-03-01 Thread Aljoscha Krettek (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16382142#comment-16382142
 ] 

Aljoscha Krettek commented on FLINK-6763:
-

[~tzulitai] Did we decide to move this to 1.6.0? Or at least make it 
non-blocking?

> Inefficient PojoSerializerConfigSnapshot serialization format
> -
>
> Key: FLINK-6763
> URL: https://issues.apache.org/jira/browse/FLINK-6763
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing, Type Serialization System
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Till Rohrmann
>Assignee: Tzu-Li (Gordon) Tai
>Priority: Blocker
> Fix For: 1.5.0
>
>
> The {{PojoSerializerConfigSnapshot}} stores for each serializer the beginning 
> offset and ending offset in the serialization stream. This information is 
> also written if the serializer serialization is supposed to be ignored. The 
> beginning and ending offsets are stored as a sequence of integers at the 
> beginning of the serialization stream. We store this information to skip 
> broken serializers.
> I think we don't need both offsets. Instead I would suggest to write the 
> length of the serialized serializer first into the serialization stream and 
> then the serialized serializer. This can be done in 
> {{TypeSerializerSerializationUtil.writeSerializer}}. When reading the 
> serializer via {{TypeSerializerSerializationUtil.tryReadSerializer}}, we can 
> try to deserialize the serializer. If this operation fails, then we can skip 
> the number of serialized serializer because we know how long it was.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6763) Inefficient PojoSerializerConfigSnapshot serialization format

2018-02-01 Thread Stephan Ewen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349909#comment-16349909
 ] 

Stephan Ewen commented on FLINK-6763:
-

As a related comment - I think the whole snapshot procedure can be optimized a 
bit. We can create the serializer snapshot one and then just keep the bytes and 
add those to every checkpoint. In smaller state programs, the majority of 
checkpoint time can be spent on serializer snapshots (still only, milliseconds, 
but optimization potential non the less)

> Inefficient PojoSerializerConfigSnapshot serialization format
> -
>
> Key: FLINK-6763
> URL: https://issues.apache.org/jira/browse/FLINK-6763
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing, Type Serialization System
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Till Rohrmann
>Assignee: Tzu-Li (Gordon) Tai
>Priority: Blocker
> Fix For: 1.5.0
>
>
> The {{PojoSerializerConfigSnapshot}} stores for each serializer the beginning 
> offset and ending offset in the serialization stream. This information is 
> also written if the serializer serialization is supposed to be ignored. The 
> beginning and ending offsets are stored as a sequence of integers at the 
> beginning of the serialization stream. We store this information to skip 
> broken serializers.
> I think we don't need both offsets. Instead I would suggest to write the 
> length of the serialized serializer first into the serialization stream and 
> then the serialized serializer. This can be done in 
> {{TypeSerializerSerializationUtil.writeSerializer}}. When reading the 
> serializer via {{TypeSerializerSerializationUtil.tryReadSerializer}}, we can 
> try to deserialize the serializer. If this operation fails, then we can skip 
> the number of serialized serializer because we know how long it was.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-6763) Inefficient PojoSerializerConfigSnapshot serialization format

2018-01-13 Thread Tzu-Li (Gordon) Tai (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325059#comment-16325059
 ] 

Tzu-Li (Gordon) Tai commented on FLINK-6763:


It seems like we forgot completely about adding this to 1.4.0.

Since the previous conclusion was that we do not want to change serialization 
formats across minor releases, we can't include this for 1.4.1.
We should make sure we include this change in 1.5.0 (as soon as possible), as 
serialization formats will affect us a long way ahead.

> Inefficient PojoSerializerConfigSnapshot serialization format
> -
>
> Key: FLINK-6763
> URL: https://issues.apache.org/jira/browse/FLINK-6763
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing, Type Serialization System
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Till Rohrmann
>Assignee: Tzu-Li (Gordon) Tai
>
> The {{PojoSerializerConfigSnapshot}} stores for each serializer the beginning 
> offset and ending offset in the serialization stream. This information is 
> also written if the serializer serialization is supposed to be ignored. The 
> beginning and ending offsets are stored as a sequence of integers at the 
> beginning of the serialization stream. We store this information to skip 
> broken serializers.
> I think we don't need both offsets. Instead I would suggest to write the 
> length of the serialized serializer first into the serialization stream and 
> then the serialized serializer. This can be done in 
> {{TypeSerializerSerializationUtil.writeSerializer}}. When reading the 
> serializer via {{TypeSerializerSerializationUtil.tryReadSerializer}}, we can 
> try to deserialize the serializer. If this operation fails, then we can skip 
> the number of serialized serializer because we know how long it was.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-6763) Inefficient PojoSerializerConfigSnapshot serialization format

2017-07-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16074490#comment-16074490
 ] 

ASF GitHub Bot commented on FLINK-6763:
---

Github user tzulitai closed the pull request at:

https://github.com/apache/flink/pull/4014


> Inefficient PojoSerializerConfigSnapshot serialization format
> -
>
> Key: FLINK-6763
> URL: https://issues.apache.org/jira/browse/FLINK-6763
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing, Type Serialization System
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Till Rohrmann
>Assignee: Tzu-Li (Gordon) Tai
>
> The {{PojoSerializerConfigSnapshot}} stores for each serializer the beginning 
> offset and ending offset in the serialization stream. This information is 
> also written if the serializer serialization is supposed to be ignored. The 
> beginning and ending offsets are stored as a sequence of integers at the 
> beginning of the serialization stream. We store this information to skip 
> broken serializers.
> I think we don't need both offsets. Instead I would suggest to write the 
> length of the serialized serializer first into the serialization stream and 
> then the serialized serializer. This can be done in 
> {{TypeSerializerSerializationUtil.writeSerializer}}. When reading the 
> serializer via {{TypeSerializerSerializationUtil.tryReadSerializer}}, we can 
> try to deserialize the serializer. If this operation fails, then we can skip 
> the number of serialized serializer because we know how long it was.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (FLINK-6763) Inefficient PojoSerializerConfigSnapshot serialization format

2017-06-06 Thread Tzu-Li (Gordon) Tai (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16038408#comment-16038408
 ] 

Tzu-Li (Gordon) Tai commented on FLINK-6763:


After an offline discussion with [~till.rohrmann], we decided not to include 
this fix for 1.3.1 and only for 1.4.0.
The reason is that we shouldn't break serialization formats across minor 
releases.
I'll de-label the "1.3 release blocker" tag for this JIRA.

> Inefficient PojoSerializerConfigSnapshot serialization format
> -
>
> Key: FLINK-6763
> URL: https://issues.apache.org/jira/browse/FLINK-6763
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing, Type Serialization System
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Till Rohrmann
>Assignee: Tzu-Li (Gordon) Tai
>  Labels: flink-rel-1.3.1-blockers
>
> The {{PojoSerializerConfigSnapshot}} stores for each serializer the beginning 
> offset and ending offset in the serialization stream. This information is 
> also written if the serializer serialization is supposed to be ignored. The 
> beginning and ending offsets are stored as a sequence of integers at the 
> beginning of the serialization stream. We store this information to skip 
> broken serializers.
> I think we don't need both offsets. Instead I would suggest to write the 
> length of the serialized serializer first into the serialization stream and 
> then the serialized serializer. This can be done in 
> {{TypeSerializerSerializationUtil.writeSerializer}}. When reading the 
> serializer via {{TypeSerializerSerializationUtil.tryReadSerializer}}, we can 
> try to deserialize the serializer. If this operation fails, then we can skip 
> the number of serialized serializer because we know how long it was.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (FLINK-6763) Inefficient PojoSerializerConfigSnapshot serialization format

2017-05-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-6763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16028755#comment-16028755
 ] 

ASF GitHub Bot commented on FLINK-6763:
---

GitHub user tzulitai opened a pull request:

https://github.com/apache/flink/pull/4014

[FLINK-6763] [core] Make serialization of composite serializer configs more 
efficient

This PR affects the serialization formats of configuration snapshots of 
composite serializers, most notably the `PojoSerializer`, as well as others 
such as `MapSerializer`, `GenericArraySerializer`, `TupleSerializer`, etc. It 
also affects the serialization formats of the 
`OperatorBackendSerializationProxy` and `KeyedBackendSerializationProxy`.

Prior to this PR, whenever we write a serializer and its config snapshot 
into a checkpoint, we always write the start offset and end offset of the 
serializer bytes, effectively indexing every serializer and its config. This 
required buffering the whole list of serializer and config snapshot pairs when 
writing the checkpoint.

This PR changes this to be more efficient by just writing the length of the 
serializer bytes prior to writing the serializer. This also allows lesser 
buffering for the writes.

## Implementation

Now, `TypeSerializerSerializationUtil` has the following methods for 
writing / reading serializers:

- `writeSerializer`
- `tryReadSerializer`
- `writeSerializerWithResilience`
- `tryReadSerializerWithResilience`

The first two non-resilient variants remains as they were (not containing 
write serializer length logic), and needs to remain untouched for backwards 
compatibility (previous checkpoints do not contain the serializer length before 
the serializer bytes). They are only used in code paths for backwards 
compatibility.

All composite type serializers now use the latter two `*WithResilience` 
variants.

## Affect on backwards compatibility

Backwards compatibility is still maintained for prior versions.
However, depending on whether or not this PR makes it into the 1.3.0 
release, it may need to be changed more. The current state of the PR assumes 
that it will be merged for 1.3.0.
If it misses it, the PR needs additional changes to have separate code 
paths for the config snapshot reads of composite serializers, one for VERSION 1 
which still uses offsets, and one for an upticked VERSION 2 which use the new 
`*WithResilience` variants.

## Tests

Since the tests in `TypeSerializerSerializationUtilTest` and 
`PojoSerializerTest` already cover tests for resilience of serializer read 
failures, and this PR only changes the way we store information to achieve the 
same functionality, no new tests are added.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/tzulitai/flink FLINK-6763

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/4014.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4014


commit 61c7ba2bb2d6dbe750227815593c1c4d952556d4
Author: Tzu-Li (Gordon) Tai 
Date:   2017-05-29T18:07:04Z

[FLINK-6763] [core] Make serialization of composite serializer configs more 
efficient

This commit affects the serialization formats of configuration snapshots
of composite serializers, most notably the PojoSerializer, as well as
others such as MapSerializer, GenericArraySerializer, TupleSerializer,
etc. It also affects the serialization formats of the
OperatorBackendSerializationProxy and KeyedBackendSerializationProxy.

Prior to this commit, whenever we write a serializer and its config
snapshot into a checkpoint, we always write the start offset and end
offset of the serializer bytes, effectively indexing every serializer
and its config. This required buffering the whole list of serializer and
config snapshot pairs when writing the checkpoint.

This commit changes this to be more efficient by just writing the length
of the serializer bytes prior to writing the serializer. This allows
lesser buffering for the writes.




> Inefficient PojoSerializerConfigSnapshot serialization format
> -
>
> Key: FLINK-6763
> URL: https://issues.apache.org/jira/browse/FLINK-6763
> Project: Flink
>  Issue Type: Improvement
>  Components: State Backends, Checkpointing, Type Serialization System
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Till Rohrmann
>
> The {{PojoSerializerConfigSnapshot}} stores for each serializer the beginning 
> offset and ending offset in the serialization stream. This