Re: Flink job didn't restart when a task failed

2020-04-17 Thread Till Rohrmann
t;> >> -Bruce >> >> >> >> -- >> >> >> >> >> >> *From: *Zhu Zhu >> *Date: *Monday, April 13, 2020 at 9:29 PM >> *To: *Till Rohrmann >> *Cc: *Aljoscha Krettek , user , >> Gary Yao >> *Subject: *Re: Fli

Re: Flink job didn't restart when a task failed

2020-04-14 Thread Zhu Zhu
; > -Bruce > > > > -- > > > > > > *From: *Zhu Zhu > *Date: *Monday, April 13, 2020 at 9:29 PM > *To: *Till Rohrmann > *Cc: *Aljoscha Krettek , user , > Gary Yao > *Subject: *Re: Flink job didn't restart when a task failed > > > &

Re: Flink job didn't restart when a task failed

2020-04-14 Thread Hanson, Bruce
ser , Gary Yao Subject: Re: Flink job didn't restart when a task failed Sorry for not following this ML earlier. I think the cause might be that the final state ('FAILED') update message to JM is lost. TaskExecutor will simply fail the task (which does not take effect in th

Re: Flink job didn't restart when a task failed

2020-04-13 Thread Zhu Zhu
Sorry for not following this ML earlier. I think the cause might be that the final state ('FAILED') update message to JM is lost. TaskExecutor will simply fail the task (which does not take effect in this case since the task is already FAILED) and will not update the task state again in this case.

Re: Flink job didn't restart when a task failed

2020-04-09 Thread Till Rohrmann
For future reference, here is the issue to track the reconciliation logic [1]. [1] https://issues.apache.org/jira/browse/FLINK-17075 Cheers, Till On Thu, Apr 9, 2020 at 6:47 PM Till Rohrmann wrote: > Hi Bruce, > > what you are describing sounds indeed quite bad. Quite hard to say whether > we

Re: Flink job didn't restart when a task failed

2020-04-09 Thread Till Rohrmann
Hi Bruce, what you are describing sounds indeed quite bad. Quite hard to say whether we fixed such an issue in 1.10. It is definitely worth a try to upgrade, though. In order to further debug the problem, it would be really great if you could provide us with the log files of the JobMaster and the

Re: Flink job didn't restart when a task failed

2020-04-09 Thread Aljoscha Krettek
Hi, this indeed seems very strange! @Gary Could you maybe have a look at this since you work/worked quite a bit on the scheduler? Best, Aljoscha On 09.04.20 05:46, Hanson, Bruce wrote: Hello Flink folks: We had a problem with a Flink job the other day that I haven’t seen before. One task

Flink job didn't restart when a task failed

2020-04-08 Thread Hanson, Bruce
Hello Flink folks: We had a problem with a Flink job the other day that I haven’t seen before. One task encountered a failure and switched to FAILED (see the full exception below). After the failure, the task said it was notifying the Job Manager: 2020-04-06 08:21:04.329 [flink-akka.actor.defau