Re: Flink 1.7.2 extremely unstable and losing jobs in prod

Bruno Aranda Tue, 09 Apr 2019 03:21:30 -0700

Thanks Till, I will start separate threads for the two issues we are
experiencing.


Cheers,

Bruno

On Mon, 8 Apr 2019 at 15:27, Till Rohrmann <[email protected]> wrote:

> Hi Bruno,
>
> first of all good to hear that you could resolve some of the problems.
>
> Slots get removed if a TaskManager gets unregistered from the SlotPool.
> This usually happens if a TaskManager closes its connection or its
> heartbeat with the ResourceManager times out. So you could look for
> messages like "The heartbeat of TaskManager with id ... timed out".
>
> If JobManagers lose its leadership, it could also have something to do
> with the ZooKeeper cluster and the configuration of Flink's ZooKeeper
> client [1] or your network in general.
>
> For the warnings you see, it's hard to say without the full picture. Could
> `10.10.56.193:43041` be a TaskManager which just died and, hence, cannot
> be connected to anymore?
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#zookeeper-based-ha-mode
>
> Cheers,
> Till
>
> On Mon, Apr 8, 2019 at 12:33 PM Bruno Aranda <[email protected]> wrote:
>
>> Hi Till,
>>
>> Many thanks for your reply and don't worry. We understand this is tricky
>> and you are busy.
>>
>> We have been experiencing some issues, and a couple of them have been
>> addressed, so the logs probably were not relevant anymore.
>>
>> About losing jobs on restart -> it seems that YARN was killing the
>> container for the master due to it not passing the liveness probe. Since
>> Flink 1.1 or something we had been using very stringent liveness probe
>> timeouts in Yarn to detect very fast when a node in the cluster was going
>> out of service. This timeout (30 seconds) was probably killing the job
>> manager before it was able to recover the ~40 streaming jobs that we run in
>> session mode. I wonder why we had not seen that in 1.6, though, probably
>> because of the legacy mode?
>>
>> Extremely high unstability -> that was caused because we were running in
>> DEBUG mode to capture logs and the sheer number of them (especially coming
>> from AsyncFunctions) did cause the disks to fill and YARN to decomission
>> the nodes. We do process many thousands of messages per second in some of
>> our jobs.
>>
>> We still have a few instances of Job Managers losing leadership every few
>> hours (all of them in the cluster). Another of our jobs restarts more
>> often, but the "Exceptions" tab in the UI for the job just tells us that
>> "The assigned slot XXX was removed". It would be helpful to see why it was
>> removed, though.
>>
>> I am currently looking at those. But the logs don't tell me much (and I
>> cannot run them in this environment with such a low level anymore). There
>> is only one thing at ERROR level for when the more unstable job restarts:
>>
>> java.util.concurrent.TimeoutException: Remote system has been silent for
>> too long. (more than 48.0 hours),
>> at
>> akka.remote.ReliableDeliverySupervisor$$anonfun$idle$1.applyOrElse(Endpoint.scala:376),
>>
>> at akka.actor.Actor.aroundReceive(Actor.scala:502),
>> at akka.actor.Actor.aroundReceive$(Actor.scala:500),
>> at
>> akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:203),
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526),
>> at akka.actor.ActorCell.invoke(ActorCell.scala:495),
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257),
>> at akka.dispatch.Mailbox.run(Mailbox.scala:224),
>> at akka.dispatch.Mailbox.exec(Mailbox.scala:234),
>> at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289),
>> at
>> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056),
>>
>> at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692),
>> at
>> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
>>
>> And for the other, all the job managers losing leadership, just some
>> warnings about the association with the remote system failing, ie:
>>
>> Remote connection to [null] failed with java.net.ConnectException:
>> Connection refused: ip-10-10-56-193.eu-west-1.compute.internal/
>> 10.10.56.193:43041
>> Association with remote system
>> [akka.tcp://[email protected]:43041] has
>> failed, address is now gated for [50] ms. Reason: [Association failed with
>> [akka.tcp://[email protected]:43041]]
>> Caused by: [Connection refused: ip-10-10-56-193.eu-west-1.compute.internal/
>> 10.10.56.193:43041]
>>
>> over and over again...
>>
>> Thanks for any insight.
>>
>> Bruno
>>
>> On Mon, 8 Apr 2019 at 10:45, Till Rohrmann <[email protected]> wrote:
>>
>>> Hi Bruno,
>>>
>>> sorry for getting back to you so late. I just tried to access your logs
>>> to investigate the problem but transfer.sh tells me that they are no longer
>>> there. Could you maybe re-upload them or directly send them to my mail
>>> address. Sorry for not taking faster a look at your problem and the
>>> inconveniences with the upload.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Thu, Mar 21, 2019 at 4:30 PM Bruno Aranda <[email protected]> wrote:
>>>
>>>> Ok, here it goes:
>>>>
>>>> https://transfer.sh/12qMre/jobmanager-debug.log
>>>>
>>>> In an attempt to make it smaller, did remove the noisy "http wire" ones
>>>> and masked a couple of things. Not sure this covers everything you would
>>>> like to see.
>>>>
>>>> Thanks!
>>>>
>>>> Bruno
>>>>
>>>> On Thu, 21 Mar 2019 at 15:24, Till Rohrmann <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Bruno,
>>>>>
>>>>> could you upload the logs to https://transfer.sh/ or
>>>>> https://gist.github.com/ and then post a link. For further debugging
>>>>> this will be crucial. It would be really good if you could set the log
>>>>> level to DEBUG.
>>>>>
>>>>> Concerning the number of registered TMs, the new mode (not the legacy
>>>>> mode), no longer respects the `-n` setting when you start a yarn session.
>>>>> Instead it will dynamically start as many containers as you need to run 
>>>>> the
>>>>> submitted jobs. That's why you don't see the spare TM and this is the
>>>>> expected behaviour.
>>>>>
>>>>> The community intends to add support for ranges of how many TMs must
>>>>> be active at any given time [1].
>>>>>
>>>>> [1] https://issues.apache.org/jira/browse/FLINK-11078
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>>
>>>>> On Thu, Mar 21, 2019 at 1:50 PM Bruno Aranda <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Andrey,
>>>>>>
>>>>>> Thanks for your response. I was trying to get the logs somewhere but
>>>>>> they are biggish (~4Mb). Do you suggest somewhere I could put them?
>>>>>>
>>>>>> In any case, I can see exceptions like this:
>>>>>>
>>>>>> 2019/03/18 10:11:50,763 DEBUG
>>>>>> org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Releasing
>>>>>> slot [SlotRequestId{ab89ff271ebf317a13a9e773aca4e9fb}] because: null
>>>>>> 2019/03/18 10:11:50,807 INFO
>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
>>>>>> alert-event-beeTrap-notifier (2ff941926e6ad80ba441d9cfcd7d689d) switched
>>>>>> from state RUNNING to FAILING.
>>>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>>>>>> Could not allocate all requires slots within timeout of 300000 ms. Slots
>>>>>> required: 2, slots allocated: 0
>>>>>> at
>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph.lambda$scheduleEager$3(ExecutionGraph.java:991)
>>>>>> at
>>>>>> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
>>>>>> ...
>>>>>>
>>>>>> It looks like a TM may crash, and then the JM. And then the JM is not
>>>>>> able to find slots for the tasks in a reasonable time frame? Weirdly, we
>>>>>> are running 13 TMs with 6 slots each (we used legacy mode in 1.6), and we
>>>>>> always try to keep an extra TM worth of free slots just in case. Looking 
>>>>>> at
>>>>>> the dashboard, I see 12 TMs, 2 free slots, but we tell Flink 13 are
>>>>>> available when we start the session in yarn.
>>>>>>
>>>>>> Any ideas? It is way less stable for us these days without having
>>>>>> changed settings much since we started using Flink around 1.2 some time
>>>>>> back.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Bruno
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 19 Mar 2019 at 17:09, Andrey Zagrebin <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Bruno,
>>>>>>>
>>>>>>> could you also share the job master logs?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Andrey
>>>>>>>
>>>>>>> On Tue, Mar 19, 2019 at 12:03 PM Bruno Aranda <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> This is causing serious instability and data loss in our production
>>>>>>>> environment. Any help figuring out what's going on here would be really
>>>>>>>> appreciated.
>>>>>>>>
>>>>>>>> We recently updated our two EMR clusters from flink 1.6.1 to flink
>>>>>>>> 1.7.2 (running on AWS EMR). The road to the upgrade was fairly rocky, 
>>>>>>>> but
>>>>>>>> we felt like it was working sufficiently well in our pre-production
>>>>>>>> environments that we rolled it out to prod.
>>>>>>>>
>>>>>>>> However we're now seeing the jobmanager crash spontaneously several
>>>>>>>> times a day. There doesn't seem to be any pattern to when this happens 
>>>>>>>> - it
>>>>>>>> doesn't coincide with an increase in the data flowing through the 
>>>>>>>> system,
>>>>>>>> nor is it at the same time of day.
>>>>>>>>
>>>>>>>> The big problem is that when it recovers, sometimes a lot of the
>>>>>>>> jobs fail to resume with the following exception:
>>>>>>>>
>>>>>>>> org.apache.flink.util.FlinkException: JobManager responsible for
>>>>>>>> 2401cd85e70698b25ae4fb2955f96fd0 lost the leadership.
>>>>>>>>     at
>>>>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1185)
>>>>>>>>     at
>>>>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:138)
>>>>>>>>     at
>>>>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625)
>>>>>>>> //...
>>>>>>>> Caused by: java.util.concurrent.TimeoutException: The heartbeat of
>>>>>>>> JobManager with id abb0e96af8966f93d839e4d9395c7697 timed out.
>>>>>>>>     at
>>>>>>>> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1626)
>>>>>>>>     ... 16 more
>>>>>>>>
>>>>>>>> Starting them manually afterwards doesn't resume from checkpoint,
>>>>>>>> which for most jobs means it starts from the end of the source kafka 
>>>>>>>> topic.
>>>>>>>> This means whenever this surprise jobmanager restart happens, we have a
>>>>>>>> ticking clock during which we're losing data.
>>>>>>>>
>>>>>>>> We speculate that those jobs die first and while they wait to be
>>>>>>>> restarted (they have a 30 second delay strategy), the job manager 
>>>>>>>> restarts
>>>>>>>> and does not recover them? In any case, we have never seen so many job
>>>>>>>> failures and JM restarts with exactly the same EMR config.
>>>>>>>>
>>>>>>>> We've got some functionality we're building that uses the
>>>>>>>> StreamingFileSink over S3 bugfixes in 1.7.2, so rolling back isn't an 
>>>>>>>> ideal
>>>>>>>> option.
>>>>>>>>
>>>>>>>> Looking through the mailing list, we found
>>>>>>>> https://issues.apache.org/jira/browse/FLINK-11843 - does it seem
>>>>>>>> possible this might be related?
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>> Bruno
>>>>>>>>
>>>>>>>

Re: Flink 1.7.2 extremely unstable and losing jobs in prod

Reply via email to