Re: Savepoint class refactor in 1.11 causing restore from 1.9 savepoint to fail

2021-08-11 Thread Weston Woods
I am able to reproduce this failure by loading the production savepoint into a 
locally running 1.11 flink job using the state processor API.The same 
sequence of events occurs; the Kryo snapshot deserializer stores a null for the 
refactored Savepoint interface which causes subsequent failures to restore 
operator state.   The state backend is rocksdb.

Bodily copying the 1.9.0 source code for 
org.apache.flink.runtime.checkpoint.savepoint.Savepoint into my test job allows 
it to load the savepoint and restore the operator states. But that is a 
terrible workaround and I am looking for a good solution.



From: Robert Metzger 
Date: Wednesday, August 4, 2021 at 10:21 AM
To: Weston Woods 
Cc: "user@flink.apache.org" , Timo Walther 

Subject: Re: Savepoint class refactor in 1.11 causing restore from 1.9 
savepoint to fail

Hi Weston,

Oh indeed, you are right! I quickly tried restoring a 1.9 savepoint on a 1.11 
runtime and it worked. So in principle this seems to be supported.

I'm including Timo into this thread, he has a lot of experience with the 
serializers.

On Tue, Aug 3, 2021 at 6:59 PM Weston Woods 
mailto:wwo...@spireon.com>> wrote:
Robert,

Thanks for your reply.How should I interpret the savepoint compatibility 
table here 
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/#compatibility-table<https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/#compatibility-table>
 if a 1.9 savepoint cannot be restored into a 1.11 runtime?



From: Robert Metzger mailto:rmetz...@apache.org>>
Date: Tuesday, August 3, 2021 at 11:52 AM
To: Weston Woods mailto:wwo...@spireon.com>>
Cc: "user@flink.apache.org<mailto:user@flink.apache.org>" 
mailto:user@flink.apache.org>>
Subject: Re: Savepoint class refactor in 1.11 causing restore from 1.9 
savepoint to fail

Hi Weston,
I haven never looked into the savepoint migration code paths myself, but I know 
that savepoint migration across multiple versions is not supported (1.9 can 
only migrate to 1.10, not 1.11). We have test coverage for these migrations, 
and I would be surprised if this "Savepoint" class migration is not covered in 
these tests.

Have you tried upgrading from 1.9 to 1.10, and then from 1.10 to 1.11?

On Fri, Jul 30, 2021 at 11:53 PM Weston Woods 
mailto:wwo...@spireon.com>> wrote:
I am unable to restore a 1.9 savepoint into a 1.11 runtime for the very 
interesting reason that the Savepoint class was renamed and repackaged between 
those two releases.   Apparently a Kryo serializer has that class registered in 
the 1.9 runtime. I can’t think of a good reason for that class to be 
registered with Kryo; none of the job operators reference any such thing.   Yet 
there it is causing the following exception and preventing upgrade to a new 
runtime.


Re: Savepoint class refactor in 1.11 causing restore from 1.9 savepoint to fail

2021-08-04 Thread Robert Metzger
Hi Weston,

Oh indeed, you are right! I quickly tried restoring a 1.9 savepoint on a
1.11 runtime and it worked. So in principle this seems to be supported.

I'm including Timo into this thread, he has a lot of experience with the
serializers.

On Tue, Aug 3, 2021 at 6:59 PM Weston Woods  wrote:

> Robert,
>
>
>
> Thanks for your reply.How should I interpret the savepoint
> compatibility table here
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/#compatibility-table
> if a 1.9 savepoint cannot be restored into a 1.11 runtime?
>
>
>
>
>
>
>
> *From: *Robert Metzger 
> *Date: *Tuesday, August 3, 2021 at 11:52 AM
> *To: *Weston Woods 
> *Cc: *"user@flink.apache.org" 
> *Subject: *Re: Savepoint class refactor in 1.11 causing restore from 1.9
> savepoint to fail
>
>
>
> Hi Weston,
>
> I haven never looked into the savepoint migration code paths myself, but I
> know that savepoint migration across multiple versions is not supported
> (1.9 can only migrate to 1.10, not 1.11). We have test coverage for these
> migrations, and I would be surprised if this "Savepoint" class migration is
> not covered in these tests.
>
>
>
> Have you tried upgrading from 1.9 to 1.10, and then from 1.10 to 1.11?
>
>
>
> On Fri, Jul 30, 2021 at 11:53 PM Weston Woods  wrote:
>
> I am unable to restore a 1.9 savepoint into a 1.11 runtime for the very
> interesting reason that the Savepoint class was renamed and repackaged
> between those two releases.   Apparently a Kryo serializer has that class
> registered in the 1.9 runtime. I can’t think of a good reason for that
> class to be registered with Kryo; none of the job operators reference any
> such thing.   Yet there it is causing the following exception and
> preventing upgrade to a new runtime.
>
>


Re: Savepoint class refactor in 1.11 causing restore from 1.9 savepoint to fail

2021-08-03 Thread Weston Woods
Robert,

Thanks for your reply.How should I interpret the savepoint compatibility 
table here 
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/#compatibility-table
 if a 1.9 savepoint cannot be restored into a 1.11 runtime?



From: Robert Metzger 
Date: Tuesday, August 3, 2021 at 11:52 AM
To: Weston Woods 
Cc: "user@flink.apache.org" 
Subject: Re: Savepoint class refactor in 1.11 causing restore from 1.9 
savepoint to fail

Hi Weston,
I haven never looked into the savepoint migration code paths myself, but I know 
that savepoint migration across multiple versions is not supported (1.9 can 
only migrate to 1.10, not 1.11). We have test coverage for these migrations, 
and I would be surprised if this "Savepoint" class migration is not covered in 
these tests.

Have you tried upgrading from 1.9 to 1.10, and then from 1.10 to 1.11?

On Fri, Jul 30, 2021 at 11:53 PM Weston Woods 
mailto:wwo...@spireon.com>> wrote:
I am unable to restore a 1.9 savepoint into a 1.11 runtime for the very 
interesting reason that the Savepoint class was renamed and repackaged between 
those two releases.   Apparently a Kryo serializer has that class registered in 
the 1.9 runtime. I can’t think of a good reason for that class to be 
registered with Kryo; none of the job operators reference any such thing.   Yet 
there it is causing the following exception and preventing upgrade to a new 
runtime.


Re: Savepoint class refactor in 1.11 causing restore from 1.9 savepoint to fail

2021-08-03 Thread Robert Metzger
Hi Weston,
I haven never looked into the savepoint migration code paths myself, but I
know that savepoint migration across multiple versions is not supported
(1.9 can only migrate to 1.10, not 1.11). We have test coverage for these
migrations, and I would be surprised if this "Savepoint" class migration is
not covered in these tests.

Have you tried upgrading from 1.9 to 1.10, and then from 1.10 to 1.11?

On Fri, Jul 30, 2021 at 11:53 PM Weston Woods  wrote:

> I am unable to restore a 1.9 savepoint into a 1.11 runtime for the very
> interesting reason that the Savepoint class was renamed and repackaged
> between those two releases.   Apparently a Kryo serializer has that class
> registered in the 1.9 runtime. I can’t think of a good reason for that
> class to be registered with Kryo; none of the job operators reference any
> such thing.   Yet there it is causing the following exception and
> preventing upgrade to a new runtime.
>
>
>
> Caused by: java.lang.IllegalStateException: Missing value for the key
> 'org.apache.flink.runtime.checkpoint.savepoint.Savepoint'
> at
> org.apache.flink.util.LinkedOptionalMap.unwrapOptionals(LinkedOptionalMap.java:190)
> ~[flink-dist_2.11-1.11.3.jar:1.11.3]
> at
> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshot.restoreSerializer(KryoSerializerSnapshot.java:86)
> ~[flink-dist_2.11-1.11.3.jar:1.11.3]
>
>
>
> There doesn’t seem to be any way to unregister a class from Kryo.   And
> the mechanism for dealing with missing classes looks to me like it has
> never worked as advertised.Instead of registering a dummy class for a
> missing class name a null gets registered instead, leading to the exception
> which prevents restoring the savepoint.   The code that returns a null
> instead of a dummy is here  -
> https://github.com/apache/flink/blob/e8cfe6701b9768d1f1fe4488640cba5f9b42d73f/flink-core/src/main/java/org/apache/flink/api/java/typeutils/runtime/kryo/KryoSerializerSnapshotData.java#L263
>
>
>
> Resulting in this log.
>
>
>
> 2021-07-27 18:38:11,703 WARN
> org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotData
> [] - Cannot find registered class
> org.apache.flink.runtime.checkpoint.savepoint.Savepoint for Kryo
> serialization in classpath; using a dummy class as a placeholder.
> java.lang.ClassNotFoundException:
> org.apache.flink.runtime.checkpoint.savepoint.Savepoint
>
>
>
> One way or another I need to be able to restore a 1.9 savepoint into
> 1.11.   Perhaps the Kryo registration needs to be cleansed from wherever it
> is lurking in the 1.9 savepoint,  or an effective dummy needs to be
> substituted when reading it into 1.11.
>
>
>
> Has anyone else encountered this problem, or have any advice to offer?
>
>
>
>
>
>
>
>
>
>
>
>
>


Savepoint class refactor in 1.11 causing restore from 1.9 savepoint to fail

2021-07-30 Thread Weston Woods
I am unable to restore a 1.9 savepoint into a 1.11 runtime for the very 
interesting reason that the Savepoint class was renamed and repackaged between 
those two releases.   Apparently a Kryo serializer has that class registered in 
the 1.9 runtime. I can’t think of a good reason for that class to be 
registered with Kryo; none of the job operators reference any such thing.   Yet 
there it is causing the following exception and preventing upgrade to a new 
runtime.

Caused by: java.lang.IllegalStateException: Missing value for the key 
'org.apache.flink.runtime.checkpoint.savepoint.Savepoint'
at 
org.apache.flink.util.LinkedOptionalMap.unwrapOptionals(LinkedOptionalMap.java:190)
 ~[flink-dist_2.11-1.11.3.jar:1.11.3]
at 
org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshot.restoreSerializer(KryoSerializerSnapshot.java:86)
 ~[flink-dist_2.11-1.11.3.jar:1.11.3]

There doesn’t seem to be any way to unregister a class from Kryo.   And the 
mechanism for dealing with missing classes looks to me like it has never worked 
as advertised.Instead of registering a dummy class for a missing class name 
a null gets registered instead, leading to the exception which prevents 
restoring the savepoint.   The code that returns a null instead of a dummy is 
here  - 
https://github.com/apache/flink/blob/e8cfe6701b9768d1f1fe4488640cba5f9b42d73f/flink-core/src/main/java/org/apache/flink/api/java/typeutils/runtime/kryo/KryoSerializerSnapshotData.java#L263

Resulting in this log.

2021-07-27 18:38:11,703 WARN 
org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotData [] 
- Cannot find registered class 
org.apache.flink.runtime.checkpoint.savepoint.Savepoint for Kryo serialization 
in classpath; using a dummy class as a placeholder.
java.lang.ClassNotFoundException: 
org.apache.flink.runtime.checkpoint.savepoint.Savepoint

One way or another I need to be able to restore a 1.9 savepoint into 1.11.   
Perhaps the Kryo registration needs to be cleansed from wherever it is lurking 
in the 1.9 savepoint,  or an effective dummy needs to be substituted when 
reading it into 1.11.

Has anyone else encountered this problem, or have any advice to offer?