Re: State Processor API: StateMigrationException for keyed state

2019-12-13 Thread Peter Westermann
) at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:270) ... 13 more From: vino yang Date: Thursday, December 12, 2019 at 8:46 PM To: Peter Westermann Cc: user Subject: Re: State Processor API: StateMigrationException for keyed state Hi pwestermann, Can you share the relevant

Zookeeper connection loss causing checkpoint corruption

2020-09-21 Thread Peter Westermann
I recently ran into an issue with our Flink cluster: A zookeeper service deploy caused a temporary connection loss and triggered a new jobmanager leader election. Leadership election was successful and our Flink job restarted from the last checkpoint. This checkpoint appears to have been taken

Feature request: Removing state from operators

2020-10-26 Thread Peter Westermann
We use the feature for removing stateful operators via the allowNonRestoredState relatively often and it works great. However, there doesn’t seem to be anything like that for removing state from an existing operator (that we want to keep). Say my operator defines a MapState and a ValueState.

NotSerializableException: org.apache.flink.runtime.rest.messages.ResourceProfileInfo

2020-07-22 Thread Peter Westermann
I just started testing Flink 1.11.1 and noticed that the Task Managers section in the UI doesn’t load. The exception in the log is: j.i.NotSerializableException: org.apache.flink.runtime.rest.messages.ResourceProfileInfo \tat j.i.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184) \tat

Re: Feature request: Removing state from operators

2020-10-29 Thread Peter Westermann
> listStates(); // Completely remove a state void dropState(StateDescriptor stateDescriptor); Thanks, Peter From: Congxian Qiu Date: Thursday, October 29, 2020 at 10:38 AM To: Robert Metzger Cc: Peter Westermann , "user@flink.apache.org" Subject: Re: Feature request: Removing state f

Re: Feature request: Removing state from operators

2020-11-02 Thread Peter Westermann
Renaming operators and migrating the state we still need manually is what we have done in the past. I was just hoping for a more convenient solution. Peter From: David Anderson Date: Friday, October 30, 2020 at 5:55 PM To: Peter Westermann , "user@flink.apache.org" Subject: R

Re: Job recovery issues with state restoration

2021-05-26 Thread Peter Westermann
/mnt/data is a local disk, so there shouldn’t be any additional latency. I’ll provide more information when/if this happens again. Peter From: Roman Khachatryan Date: Tuesday, May 25, 2021 at 6:54 PM To: Peter Westermann Cc: user@flink.apache.org Subject: Re: Job recovery issues with state

Re: Job recovery issues with state restoration

2021-05-25 Thread Peter Westermann
for task local recovery, those would be in a different directory (we have configured io.tmp.dirs as /mnt/data/tmp). Thanks, Peter From: Roman Khachatryan Date: Thursday, May 20, 2021 at 4:54 PM To: Peter Westermann Cc: user@flink.apache.org Subject: Re: Job recovery issues with state restoration

Job recovery issues with state restoration

2021-05-20 Thread Peter Westermann
Hello, I’ve reported issues around checkpoint recovery in case of a job failure due to zookeeper connection loss in the past. I am still seeing issues occasionally. This is for Flink 1.12.3 with zookeeper for HA, S3 as the state backend, incremental checkpoints, and task-local recovery enabled.

Re: Duplicate copies of job in Flink UI/API

2021-09-09 Thread Peter Westermann
election is expected. Thanks, Peter From: Piotr Nowojski Date: Thursday, September 9, 2021 at 12:39 AM To: Peter Westermann Cc: user@flink.apache.org Subject: Re: Duplicate copies of job in Flink UI/API Hi Peter, Can you provide relevant JobManager logs? And can you write down what steps have you

Re: Duplicate copies of job in Flink UI/API

2021-09-09 Thread Peter Westermann
: Chesnay Schepler Date: Thursday, September 9, 2021 at 9:11 AM To: Peter Westermann , Piotr Nowojski , user@flink.apache.org Subject: Re: Duplicate copies of job in Flink UI/API Just to double-check that I'm understanding things correctly: You have a job with HA, then Zookeeper breaks down, the job

Duplicate copies of job in Flink UI/API

2021-09-08 Thread Peter Westermann
We recently upgraded from Flink 1.12.4 to 1.12.5 and are seeing some weird behavior after a change in jobmanager leadership: We’re seeing two copies of the same job, one of those is in SUSPENDED state and has a start time of zero. Here’s the output from the /jobs/overview endpoint: { "jobs":

Issue with Flink UI for Flink 1.14.0

2021-10-13 Thread Peter Westermann
to the REST interface. If I requests job data from /v1/jobs/{jobId}, I get the expected response on the leader but on the other job manager, I only get an exception stack trace: {"errors":["Internal server error.",""]} Peter Westermann Team Lead – Realtime Analytics [cid

Missing dependency in flink-shaded-zookeeper-35

2021-10-04 Thread Peter Westermann
-parent/flink-shaded-zookeeper-35/pom.xml#L47). Looks like this is not correct if you want to use SSL. Adding jars for netty-handler and netty-transport-native-epoll to the lib folder addressed this issue. Perhaps this could be addressed in the next release for flink-shaded? Thanks, Peter

Re: Missing dependency in flink-shaded-zookeeper-35

2021-10-04 Thread Peter Westermann
Thanks! From: Chesnay Schepler Date: Monday, October 4, 2021 at 9:27 AM To: Peter Westermann , user Subject: Re: Missing dependency in flink-shaded-zookeeper-35 Indeed, it looks like the client-server SSL support added in 3.5 is implemented with netty. I will create a ticket. On 04/10/2021 15

Re: Issue with Flink UI for Flink 1.14.0

2022-03-18 Thread Peter Westermann
get the following error: {"errors":["Internal server error.",""]} Peter Westermann Analytics Software Architect [cidimage001.jpg@01D78D4C.C00AC080] peter.westerm...@genesys.com<mailto:peter.westerm...@genesys.com> [cidimage001.jpg@01D78D4C.C00AC080] [cid

Re: Issue with Flink UI for Flink 1.14.0

2022-01-20 Thread Peter Westermann
Just tried this again with Flink 1.14.3 since https://issues.apache.org/jira/browse/FLINK-24550 is listed as fixed. I am running into similar errors when calling the /v1/jobs/overview endpoint (without any running jobs): {"errors":["Internal server error.",""]}

Sporadic issues with savepoint status lookup in Flink 1.15

2022-06-16 Thread Peter Westermann
Peter Westermann Analytics Software Architect [cidimage001.jpg@01D78D4C.C00AC080] peter.westerm...@genesys.com<mailto:peter.westerm...@genesys.com> [cidimage001.jpg@01D78D4C.C00AC080] [cidimage002.jpg@01D78D4C.C00AC080]<http://www.genesys.com/>

Re: Sporadic issues with savepoint status lookup in Flink 1.15

2022-06-16 Thread Peter Westermann
If it happens it happens immediately. Once we receive the triggerId from /jobs/:jobid/stop or /jobs/:jobid/savepoints we poll /jobs/:jobid/savepoints/:triggerid every second until the status is no longer IN_PROGRESS. Peter Westermann Analytics Software Architect [cidimage001.jpg

Re: Sporadic issues with savepoint status lookup in Flink 1.15

2022-06-16 Thread Peter Westermann
retry such operations without triggering multiple savepoints. Could this have anything to do with the error I am seeing? Peter Westermann Analytics Software Architect [cidimage001.jpg@01D78D4C.C00AC080] peter.westerm...@genesys.com<mailto:peter.westerm...@genesys.com> [cid