Re: Batch job stuck in Canceled state in Flink 1.5

2018-05-29 Thread Amit Jain
Thanks Till. `taskmanager.network.request-backoff.max` option helped in my case. We tried this on 1.5.0 and jobs are running fine. -- Thanks Amit On Thu 24 May, 2018, 4:58 PM Amit Jain, wrote: > Thanks! Till. I'll give a try on your suggestions and update the thread. > > On Wed, May 23, 2018

Re: Batch job stuck in Canceled state in Flink 1.5

2018-05-24 Thread Amit Jain
Thanks! Till. I'll give a try on your suggestions and update the thread. On Wed, May 23, 2018 at 4:43 AM, Till Rohrmann wrote: > Hi Amit, > > it looks as if the current cancellation cause is not the same as the > initially reported cancellation cause. In the current case,

Re: Batch job stuck in Canceled state in Flink 1.5

2018-05-22 Thread Nico Kruber
Hi Amit, thanks for providing the logs, I'll look into it. We currently have a suspicion of this being caused by https://issues.apache.org/jira/browse/FLINK-9406 which we found by looking over the surrounding code. The RC4 has been cancelled since we see this as a release blocker. To rule out

Re: Batch job stuck in Canceled state in Flink 1.5

2018-05-03 Thread Nico Kruber
Also, please have a look at the other TaskManagers' logs, in particular the one that is running the operator that was mentioned in the exception. You should look out for the ID 98f5976716234236dc69fb0e82a0cc34. Nico PS: Flink logs files should compress quite nicely if they grow too big :) On

Re: Batch job stuck in Canceled state in Flink 1.5

2018-05-03 Thread Stephan Ewen
Google Drive would be great. Thanks! On Thu, May 3, 2018 at 1:33 PM, Amit Jain wrote: > Hi Stephan, > > Size of JM log file is 122 MB. Could you provide me other media to > post the same? We can use Google Drive if that's fine with you. > > -- > Thanks, > Amit > > On Thu,

Re: Batch job stuck in Canceled state in Flink 1.5

2018-05-03 Thread Amit Jain
Hi Stephan, Size of JM log file is 122 MB. Could you provide me other media to post the same? We can use Google Drive if that's fine with you. -- Thanks, Amit On Thu, May 3, 2018 at 12:58 PM, Stephan Ewen wrote: > Hi Amit! > > Thanks for sharing this, this looks like a

Re: Batch job stuck in Canceled state in Flink 1.5

2018-05-03 Thread Stephan Ewen
Hi Amit! Thanks for sharing this, this looks like a regression with the network stack changes. The log you shared from the TaskManager gives some hint, but that exception alone should not be a problem. That exception can occur under a race between deployment of some tasks while the whole job is

Re: Batch job stuck in Canceled state in Flink 1.5

2018-05-02 Thread Amit Jain
Thanks! Fabian I will try using the current release-1.5 branch and update this thread. -- Thanks, Amit On Wed, May 2, 2018 at 3:42 PM, Fabian Hueske wrote: > Hi Amit, > > We recently fixed a bug in the network stack that affected batch jobs > (FLINK-9144). > The fix was

Re: Batch job stuck in Canceled state in Flink 1.5

2018-05-02 Thread Fabian Hueske
Hi Amit, We recently fixed a bug in the network stack that affected batch jobs (FLINK-9144). The fix was added after your commit. Do you have a chance to build the current release-1.5 branch and check if the fix also resolves your problem? Otherwise it would be great if you could open a blocker

Re: Batch job stuck in Canceled state in Flink 1.5

2018-04-29 Thread Amit Jain
Cluster is running on commit 2af481a On Sun, Apr 29, 2018 at 9:59 PM, Amit Jain wrote: > Hi, > > We are running numbers of batch jobs in Flink 1.5 cluster and few of those > are getting stuck at random. These jobs having the following failure after > which operator status

Batch job stuck in Canceled state in Flink 1.5

2018-04-29 Thread Amit Jain
Hi, We are running numbers of batch jobs in Flink 1.5 cluster and few of those are getting stuck at random. These jobs having the following failure after which operator status changes to CANCELED and stuck to same. Please find complete TM's log at