Re: Flink 1.5 Yarn Connection unexpectedly closed

2018-06-24 Thread Till Rohrmann
Great to hear that you could solve your problem Garrett. What happens when you call `collect` is that Flink will send the job which has been defined up to this point to the cluster in order to execute it and it waits until it retrieved the result. Once the result has been obtained, the Flink

Re: Flink 1.5 Yarn Connection unexpectedly closed

2018-06-22 Thread Garrett Barton
I don't know why yet, but I did figure it out. After my sample long running map reduce test ran fine all night I tried a ton of things. Turns out there is a difference between env.execute() and env.collect(). My flow had reading from HDFS, decrypting, processing, and finally writing to HDFS, at

Re: Flink 1.5 Yarn Connection unexpectedly closed

2018-06-22 Thread Till Rohrmann
Hi Garrett, have you set a restart strategy for your job [1]? In order to recover from failures you need to specify one. Otherwise Flink will terminally fail the job in case of a failure. [1] https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/restart_strategies.html Cheers, Till

Re: Flink 1.5 Yarn Connection unexpectedly closed

2018-06-21 Thread Garrett Barton
Actually, random thought, could yarn preemption be causing this? What is the failure scenario should a working task manager go down in yarn that is doing real work? The docs make it sound like it should fire up another TM and get back to work out of the box, but I'm not seeing that. On Thu,

Re: Flink 1.5 Yarn Connection unexpectedly closed

2018-06-21 Thread Garrett Barton
Thank you all for the reply! I am running batch jobs, I read in a handful of files from HDFS and output to HBase, HDFS, and Kafka. I run into this when I have partial usage of the cluster as the job runs. So right now I spin up 20 nodes with 3 slots, my job at peak uses all 60 slots, but by the

Re: Flink 1.5 Yarn Connection unexpectedly closed

2018-06-21 Thread Till Rohrmann
Hi Garrett, killing of idle TaskManager should not affect the execution of the job. By definition a TaskManager only idles if it does not execute any tasks. Could you maybe share the complete logs (of the cluster entrypoint and all TaskManagers) with us? Cheers, Till On Thu, Jun 21, 2018 at

Re: Flink 1.5 Yarn Connection unexpectedly closed

2018-06-21 Thread Fabian Hueske
Hi Garrett, I agree, there seems to be an issue and increasing the timeout should not be the right approach to solve it. Are you running streaming or batch jobs, i.e., do some of the tasks finish much earlier than others? I'm adding Till to this thread who's very familiar with scheduling and

Flink 1.5 Yarn Connection unexpectedly closed

2018-06-18 Thread Garrett Barton
Hey all, My jobs that I am trying to write in Flink 1.5 are failing after a few minutes. I think its because the idle task managers are shutting down, which seems to kill the client and the running job. The running job itself was still going on one of the other task managers. I get: