I think your analysis is correct; I'll file a ticket.

On 03/06/2022 15:28, Nick Birnberg wrote:
Hello everyone!

Our current setup has us running Flink on Kubernetes in HA mode (Zookeeper) with multiple JobManagers. This appears to be a regression from 1.14.

We can use the flink CLI to communicate with the REST API to reproduce this. We directly target a standby JobManager (by using `kubectl port-forward $STANDBY_JM 8081`. And then run `flink savepoint -m localhost:8081 $JOB_ID`. This command triggers the savepoint  via the REST API and polls for it using the triggerId.

Relevant stack trace:

org.apache.flink.runtime.rest.util.RestClientException: [org.apache.flink.runtime.rest.handler.RestHandlerException: Internal server error while retrieving status of savepoint operation with triggerId=10e6bb05749f572cf4ee5eee9b4959c7 for job 488f4846310e2763dd1c338d7d7f55bb. at org.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers.createInternalServerError(SavepointHandlers.java:352) at org.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers.access$000(SavepointHandlers.java:115) at org.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers$SavepointStatusHandler.lambda$null$0(SavepointHandlers.java:311)
...
Caused by: org.apache.flink.runtime.rpc.akka.exceptions.AkkaRpcException: Failed to serialize the result for RPC call : getTriggeredSavepointStatus. at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.serializeRemoteResultAndVerifySize(AkkaRpcActor.java:405)
...
Caused by: java.io.NotSerializableException: org.apache.flink.runtime.rest.handler.async.OperationResult at java.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1185) at java.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349) at org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:632) at org.apache.flink.runtime.rpc.akka.AkkaRpcSerializedValue.valueOf(AkkaRpcSerializedValue.java:66) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.serializeRemoteResultAndVerifySize(AkkaRpcActor.java:388)
... 30 more
]
at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:532)

The savepoint itself is successful and this is not a problem if we target the leader JobManager. This seems similar to https://issues.apache.org/jira/browse/FLINK-26779 and I would think that the solution would be to have org.apache.flink.runtime.rest.handler.async.OperationResult implement Serializable, but I wanted a quick sanity check to make sure this is reproducible outside of our environment before moving forward.

Thank you!


Reply via email to