Re: Unable to retrieve savepoint status from non-leader/standby in HA with Flink 1.15

Chesnay Schepler Tue, 07 Jun 2022 01:04:11 -0700

I think your analysis is correct; I'll file a ticket.


On 03/06/2022 15:28, Nick Birnberg wrote:

Hello everyone!
Our current setup has us running Flink on Kubernetes in HA mode(Zookeeper) with multiple JobManagers. This appears to be a regressionfrom 1.14.
We can use the flink CLI to communicate with the REST API to reproducethis. We directly target a standby JobManager (by using `kubectlport-forward $STANDBY_JM 8081`. And then run `flink savepoint -mlocalhost:8081 $JOB_ID`. This command triggers the savepoint via theREST API and polls for it using the triggerId.
Relevant stack trace:
org.apache.flink.runtime.rest.util.RestClientException:[org.apache.flink.runtime.rest.handler.RestHandlerException: Internalserver error while retrieving status of savepoint operation withtriggerId=10e6bb05749f572cf4ee5eee9b4959c7 for job488f4846310e2763dd1c338d7d7f55bb.atorg.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers.createInternalServerError(SavepointHandlers.java:352)atorg.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers.access$000(SavepointHandlers.java:115)atorg.apache.flink.runtime.rest.handler.job.savepoints.SavepointHandlers$SavepointStatusHandler.lambda$null$0(SavepointHandlers.java:311)
...
Caused by:org.apache.flink.runtime.rpc.akka.exceptions.AkkaRpcException: Failedto serialize the result for RPC call : getTriggeredSavepointStatus.atorg.apache.flink.runtime.rpc.akka.AkkaRpcActor.serializeRemoteResultAndVerifySize(AkkaRpcActor.java:405)
...
Caused by: java.io.NotSerializableException:org.apache.flink.runtime.rest.handler.async.OperationResultatjava.base/java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1185)atjava.base/java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:349)atorg.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:632)atorg.apache.flink.runtime.rpc.akka.AkkaRpcSerializedValue.valueOf(AkkaRpcSerializedValue.java:66)atorg.apache.flink.runtime.rpc.akka.AkkaRpcActor.serializeRemoteResultAndVerifySize(AkkaRpcActor.java:388)
... 30 more
]
atorg.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:532)
The savepoint itself is successful and this is not a problem if wetarget the leader JobManager. This seems similar tohttps://issues.apache.org/jira/browse/FLINK-26779 and I would thinkthat the solution would be tohave org.apache.flink.runtime.rest.handler.async.OperationResultimplement Serializable, but I wanted a quick sanity check to make surethis is reproducible outside of our environment before moving forward.
Thank you!

Re: Unable to retrieve savepoint status from non-leader/standby in HA with Flink 1.15

Reply via email to