Re: Savepoint class refactor in 1.11 causing restore from 1.9 savepoint to fail
I am able to reproduce this failure by loading the production savepoint into a locally running 1.11 flink job using the state processor API.The same sequence of events occurs; the Kryo snapshot deserializer stores a null for the refactored Savepoint interface which causes subsequent failures to restore operator state. The state backend is rocksdb. Bodily copying the 1.9.0 source code for org.apache.flink.runtime.checkpoint.savepoint.Savepoint into my test job allows it to load the savepoint and restore the operator states. But that is a terrible workaround and I am looking for a good solution. From: Robert Metzger Date: Wednesday, August 4, 2021 at 10:21 AM To: Weston Woods Cc: "user@flink.apache.org" , Timo Walther Subject: Re: Savepoint class refactor in 1.11 causing restore from 1.9 savepoint to fail Hi Weston, Oh indeed, you are right! I quickly tried restoring a 1.9 savepoint on a 1.11 runtime and it worked. So in principle this seems to be supported. I'm including Timo into this thread, he has a lot of experience with the serializers. On Tue, Aug 3, 2021 at 6:59 PM Weston Woods mailto:wwo...@spireon.com>> wrote: Robert, Thanks for your reply.How should I interpret the savepoint compatibility table here https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/#compatibility-table<https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/#compatibility-table> if a 1.9 savepoint cannot be restored into a 1.11 runtime? From: Robert Metzger mailto:rmetz...@apache.org>> Date: Tuesday, August 3, 2021 at 11:52 AM To: Weston Woods mailto:wwo...@spireon.com>> Cc: "user@flink.apache.org<mailto:user@flink.apache.org>" mailto:user@flink.apache.org>> Subject: Re: Savepoint class refactor in 1.11 causing restore from 1.9 savepoint to fail Hi Weston, I haven never looked into the savepoint migration code paths myself, but I know that savepoint migration across multiple versions is not supported (1.9 can only migrate to 1.10, not 1.11). We have test coverage for these migrations, and I would be surprised if this "Savepoint" class migration is not covered in these tests. Have you tried upgrading from 1.9 to 1.10, and then from 1.10 to 1.11? On Fri, Jul 30, 2021 at 11:53 PM Weston Woods mailto:wwo...@spireon.com>> wrote: I am unable to restore a 1.9 savepoint into a 1.11 runtime for the very interesting reason that the Savepoint class was renamed and repackaged between those two releases. Apparently a Kryo serializer has that class registered in the 1.9 runtime. I can’t think of a good reason for that class to be registered with Kryo; none of the job operators reference any such thing. Yet there it is causing the following exception and preventing upgrade to a new runtime.
Re: Savepoint class refactor in 1.11 causing restore from 1.9 savepoint to fail
Hi Weston, Oh indeed, you are right! I quickly tried restoring a 1.9 savepoint on a 1.11 runtime and it worked. So in principle this seems to be supported. I'm including Timo into this thread, he has a lot of experience with the serializers. On Tue, Aug 3, 2021 at 6:59 PM Weston Woods wrote: > Robert, > > > > Thanks for your reply.How should I interpret the savepoint > compatibility table here > https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/#compatibility-table > if a 1.9 savepoint cannot be restored into a 1.11 runtime? > > > > > > > > *From: *Robert Metzger > *Date: *Tuesday, August 3, 2021 at 11:52 AM > *To: *Weston Woods > *Cc: *"user@flink.apache.org" > *Subject: *Re: Savepoint class refactor in 1.11 causing restore from 1.9 > savepoint to fail > > > > Hi Weston, > > I haven never looked into the savepoint migration code paths myself, but I > know that savepoint migration across multiple versions is not supported > (1.9 can only migrate to 1.10, not 1.11). We have test coverage for these > migrations, and I would be surprised if this "Savepoint" class migration is > not covered in these tests. > > > > Have you tried upgrading from 1.9 to 1.10, and then from 1.10 to 1.11? > > > > On Fri, Jul 30, 2021 at 11:53 PM Weston Woods wrote: > > I am unable to restore a 1.9 savepoint into a 1.11 runtime for the very > interesting reason that the Savepoint class was renamed and repackaged > between those two releases. Apparently a Kryo serializer has that class > registered in the 1.9 runtime. I can’t think of a good reason for that > class to be registered with Kryo; none of the job operators reference any > such thing. Yet there it is causing the following exception and > preventing upgrade to a new runtime. > >
Re: Savepoint class refactor in 1.11 causing restore from 1.9 savepoint to fail
Robert, Thanks for your reply.How should I interpret the savepoint compatibility table here https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/ops/upgrading/#compatibility-table if a 1.9 savepoint cannot be restored into a 1.11 runtime? From: Robert Metzger Date: Tuesday, August 3, 2021 at 11:52 AM To: Weston Woods Cc: "user@flink.apache.org" Subject: Re: Savepoint class refactor in 1.11 causing restore from 1.9 savepoint to fail Hi Weston, I haven never looked into the savepoint migration code paths myself, but I know that savepoint migration across multiple versions is not supported (1.9 can only migrate to 1.10, not 1.11). We have test coverage for these migrations, and I would be surprised if this "Savepoint" class migration is not covered in these tests. Have you tried upgrading from 1.9 to 1.10, and then from 1.10 to 1.11? On Fri, Jul 30, 2021 at 11:53 PM Weston Woods mailto:wwo...@spireon.com>> wrote: I am unable to restore a 1.9 savepoint into a 1.11 runtime for the very interesting reason that the Savepoint class was renamed and repackaged between those two releases. Apparently a Kryo serializer has that class registered in the 1.9 runtime. I can’t think of a good reason for that class to be registered with Kryo; none of the job operators reference any such thing. Yet there it is causing the following exception and preventing upgrade to a new runtime.
Re: Savepoint class refactor in 1.11 causing restore from 1.9 savepoint to fail
Hi Weston, I haven never looked into the savepoint migration code paths myself, but I know that savepoint migration across multiple versions is not supported (1.9 can only migrate to 1.10, not 1.11). We have test coverage for these migrations, and I would be surprised if this "Savepoint" class migration is not covered in these tests. Have you tried upgrading from 1.9 to 1.10, and then from 1.10 to 1.11? On Fri, Jul 30, 2021 at 11:53 PM Weston Woods wrote: > I am unable to restore a 1.9 savepoint into a 1.11 runtime for the very > interesting reason that the Savepoint class was renamed and repackaged > between those two releases. Apparently a Kryo serializer has that class > registered in the 1.9 runtime. I can’t think of a good reason for that > class to be registered with Kryo; none of the job operators reference any > such thing. Yet there it is causing the following exception and > preventing upgrade to a new runtime. > > > > Caused by: java.lang.IllegalStateException: Missing value for the key > 'org.apache.flink.runtime.checkpoint.savepoint.Savepoint' > at > org.apache.flink.util.LinkedOptionalMap.unwrapOptionals(LinkedOptionalMap.java:190) > ~[flink-dist_2.11-1.11.3.jar:1.11.3] > at > org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshot.restoreSerializer(KryoSerializerSnapshot.java:86) > ~[flink-dist_2.11-1.11.3.jar:1.11.3] > > > > There doesn’t seem to be any way to unregister a class from Kryo. And > the mechanism for dealing with missing classes looks to me like it has > never worked as advertised.Instead of registering a dummy class for a > missing class name a null gets registered instead, leading to the exception > which prevents restoring the savepoint. The code that returns a null > instead of a dummy is here - > https://github.com/apache/flink/blob/e8cfe6701b9768d1f1fe4488640cba5f9b42d73f/flink-core/src/main/java/org/apache/flink/api/java/typeutils/runtime/kryo/KryoSerializerSnapshotData.java#L263 > > > > Resulting in this log. > > > > 2021-07-27 18:38:11,703 WARN > org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotData > [] - Cannot find registered class > org.apache.flink.runtime.checkpoint.savepoint.Savepoint for Kryo > serialization in classpath; using a dummy class as a placeholder. > java.lang.ClassNotFoundException: > org.apache.flink.runtime.checkpoint.savepoint.Savepoint > > > > One way or another I need to be able to restore a 1.9 savepoint into > 1.11. Perhaps the Kryo registration needs to be cleansed from wherever it > is lurking in the 1.9 savepoint, or an effective dummy needs to be > substituted when reading it into 1.11. > > > > Has anyone else encountered this problem, or have any advice to offer? > > > > > > > > > > > > >
Savepoint class refactor in 1.11 causing restore from 1.9 savepoint to fail
I am unable to restore a 1.9 savepoint into a 1.11 runtime for the very interesting reason that the Savepoint class was renamed and repackaged between those two releases. Apparently a Kryo serializer has that class registered in the 1.9 runtime. I can’t think of a good reason for that class to be registered with Kryo; none of the job operators reference any such thing. Yet there it is causing the following exception and preventing upgrade to a new runtime. Caused by: java.lang.IllegalStateException: Missing value for the key 'org.apache.flink.runtime.checkpoint.savepoint.Savepoint' at org.apache.flink.util.LinkedOptionalMap.unwrapOptionals(LinkedOptionalMap.java:190) ~[flink-dist_2.11-1.11.3.jar:1.11.3] at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshot.restoreSerializer(KryoSerializerSnapshot.java:86) ~[flink-dist_2.11-1.11.3.jar:1.11.3] There doesn’t seem to be any way to unregister a class from Kryo. And the mechanism for dealing with missing classes looks to me like it has never worked as advertised.Instead of registering a dummy class for a missing class name a null gets registered instead, leading to the exception which prevents restoring the savepoint. The code that returns a null instead of a dummy is here - https://github.com/apache/flink/blob/e8cfe6701b9768d1f1fe4488640cba5f9b42d73f/flink-core/src/main/java/org/apache/flink/api/java/typeutils/runtime/kryo/KryoSerializerSnapshotData.java#L263 Resulting in this log. 2021-07-27 18:38:11,703 WARN org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializerSnapshotData [] - Cannot find registered class org.apache.flink.runtime.checkpoint.savepoint.Savepoint for Kryo serialization in classpath; using a dummy class as a placeholder. java.lang.ClassNotFoundException: org.apache.flink.runtime.checkpoint.savepoint.Savepoint One way or another I need to be able to restore a 1.9 savepoint into 1.11. Perhaps the Kryo registration needs to be cleansed from wherever it is lurking in the 1.9 savepoint, or an effective dummy needs to be substituted when reading it into 1.11. Has anyone else encountered this problem, or have any advice to offer?