Re: Odd job failure

2018-05-29 Thread Piotr Nowojski
Hi, Could you post full output of the mvn dependency:tree command on your project? Can you reproduce this issue with some minimal project stripped down of any custom code/external dependencies except of Flink itself? Thanks Piotrek > On 28 May 2018, at 20:13, Elias Levy wrote: > > On Mon,

Re: Odd job failure

2018-05-28 Thread Elias Levy
On Mon, May 28, 2018 at 1:48 AM, Piotr Nowojski wrote: > Most likely suspect is the standard java problem of some dependency > convergence issue. Please check if you are not pulling in multiple Kafka > versions into your class path. Especially your job shouldn’t pull any Kafka > library except

Re: Odd job failure

2018-05-28 Thread Piotr Nowojski
Hi, I think that’s unlikely to happen. As far as I know, the only way to actually unload the classes in JVM is when their class loader is garbage collected, which means all the references in the code to it must vanish. In other words, it should never happen that class is not found while anyone

Re: Odd job failure

2018-05-26 Thread Elias Levy
Piotr & Stephan, Thanks for the replies. Apologies for the late response. I've been traveling for the past month. We've not observed this issue (spilling) again, but it is good to know that 1.5 will use back-pressure based alignment. I think for now we'll leave

Re: Odd job failure

2018-05-03 Thread Stephan Ewen
Concerning the connectivity issue - it is hard to say anything more without any logs or details. Does the JM log that it is trying to send tasks to the 3rd TM, but the TM does not show signs of executing them? On Thu, May 3, 2018 at 10:22 AM, Stephan Ewen wrote: > Hi Elias!

Re: Odd job failure

2018-05-03 Thread Stephan Ewen
Hi Elias! Concerning the spilling of alignment data to disk: - In 1.4.x , you can set an upper limit via " task.checkpoint.alignment.max-size ". See [1]. - In 1.5.x, the default is a back-pressure based alignment, which does not spill any more. Best, Stephan [1]

Re: Odd job failure

2018-05-02 Thread Piotr Nowojski
Hi, It might be some Kafka issue. From what you described your reasoning seems sound. For some reason TM3 fails and is unable to restart and process any data, thus forcing spilling on checkpoint barriers on TM1 and TM2. I don’t know the reason behind java.lang.NoClassDefFoundError:

Odd job failure

2018-04-27 Thread Elias Levy
We had a job on a Flink 1.4.2 cluster with three TMs experience an odd failure the other day. It seems that it started as some sort of network event. It began with the 3rd TM starting to warn every 30 seconds about socket timeouts while sending metrics to DataDog. This latest for the whole