Ufuk: I don’t know why. +1 for your other suggestions.
Piotrek > On 4 May 2018, at 14:52, Ufuk Celebi <u...@data-artisans.com> wrote: > > Hey Gyula! > > I'm including Piotr and Nico (cc'd) who have worked on the network > stack in the last releases. > > Registering the network structures including the intermediate results > actually happens **before** any state is restored. I'm not sure why > this reproducibly happens when you restore state. @Nico, Piotr: any > ideas here? > > In general I think what happens here is the following: > - a task requests the result of a local upstream producer, but that > one has not registered its intermediate result yet > - this should result in a retry of the request with some backoff > (controlled via the config params you mention > taskmanager.network.request-backoff.max, > taskmanager.network.request-backoff.initial) > > As a first step I would set logging to DEBUG and check the TM logs for > messages like "Retriggering partition request {}:{}." > > You can also check the SingleInputGate code which has the logic for > retriggering requests. > > – Ufuk > > > On Fri, May 4, 2018 at 10:27 AM, Gyula Fóra <gyula.f...@gmail.com> wrote: >> Hi Ufuk, >> >> Do you have any quick idea what could cause this problems in flink 1.4.2? >> Seems like one operator takes too long to deploy and downstream tasks error >> out on partition not found. This only seems to happen when the job is >> restored from state and in fact that operator has some keyed and operator >> state as well. >> >> Deploying the same job from empty state works well. We tried increasing the >> taskmanager.network.request-backoff.max that didnt help. >> >> It would be great if you have some pointers where to look further, I havent >> seen this happening before. >> >> Thank you! >> Gyula >> >> The errror: >> org.apache.flink.runtime.io.network.partition.: Partition >> 4c5e9cd5dd410331103f51127996068a@b35ef4ffe25e3d17c5d6051ebe2860cd not found. >> at >> org.apache.flink.runtime.io.network.partition.ResultPartitionManager.createSubpartitionView(ResultPartitionManager.java:77) >> at >> org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.requestSubpartition(LocalInputChannel.java:115) >> at >> org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel$1.run(LocalInputChannel.java:159) >> at java.util.TimerThread.mainLoop(Timer.java:555) >> at java.util.TimerThread.run(Timer.java:505) > > > > -- > Data Artisans GmbH | Stresemannstr. 121a | 10963 Berlin > > i...@data-artisans.com > +49-30-43208879 > > Registered at Amtsgericht Charlottenburg - HRB 158244 B > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen