Hi Seth,

Thanks for sharing how you resolved the problem!

The problem might have been related to Flink's key groups which are used to
assign key ranges to tasks.
Not sure why this would be related to ZooKeeper being in a bad state. Maybe
Stefan (in CC) has an idea about the cause.

Also, it would be helpful if you could share the stacktrace of the
exception (in case you still have it).

Best, Fabian

2018-03-13 14:35 GMT+01:00 Seth Wiesman <swies...@mediamath.com>:

> It turns out the issue was due to our zookeeper installation being in a
> bad state. I am not clear enough on flink’s networking internals to explain
> how this manifested as a partition not found exception, but hopefully this
> can serve as a starting point for other’s who run into the same issue.
>
>
>
> *Seth Wiesman*| Software Engineer 4 World Trade Center, 46th Floor, New
> York, NY 10007 swies...@mediamath.com <fl...@mediamath.com>
>
>
>
>
>
> *From: *Seth Wiesman <swies...@mediamath.com>
> *Date: *Friday, March 9, 2018 at 11:53 AM
> *To: *"user@flink.apache.org" <user@flink.apache.org>
> *Subject: *PartitionNotFoundException when restarting from checkpoint
>
>
>
> Hi,
>
>
>
> We are running Flink 1.4.0 with a yarn deployment on ec2 instances, rocks
> dB and incremental checkpointing, last night a job failed and became stuck
> in a restart cycle with a PartitionNotFound. We tried restarting the
> checkpoint on a fresh Flink session with no luck. Looking through the logs
> we can see that the specified partition is never registered with the
> ResultPartitionManager.
>
>
>
> My questions are:
>
> 1)      Are partitions a part of state or are the ephemeral to the job
>
> 2)      If they are not part of state, where would the task managers be
> getting that partition id to begin with
>
> 3)      Right now we are logging everything under
> org.apache.flink.runtime.io.network, is there anywhere else to look
>
>
>
> Thank you,
>
>
>
> *Seth Wiesman*| Software Engineer 4 World Trade Center, 46th Floor, New
> York, NY 10007 swies...@mediamath.com <fl...@mediamath.com>
>
>
>

Reply via email to