Great to hear that you could figure things out Steven.

You are right. The death watch is no longer linked to the akka ask timeout,
because of FLINK-6495. Thanks for the feedback. I will correct the
documentation.

Cheers,
Till

On Sat, Sep 23, 2017 at 10:24 AM, Steven Wu <stevenz...@gmail.com> wrote:

> just to close the thread. akka death watch was triggered by high GC pause,
> which is caused by memory leak in our code during Flink job restart.
>
> noted that akka.ask.timeout wasn't related to akka death watch, which
> Flink has documented and linked.
>
> On Sat, Aug 26, 2017 at 10:58 AM, Steven Wu <stevenz...@gmail.com> wrote:
>
>> this is a stateless job. so we don't use RocksDB.
>>
>> yeah. network can also be a possibility. will keep it in the radar.
>> unfortunately, our metrics system don't have the tcp metrics when running
>> inside containers.
>>
>> On Fri, Aug 25, 2017 at 2:09 PM, Robert Metzger <rmetz...@apache.org>
>> wrote:
>>
>>> Hi,
>>> are you using the RocksDB state backend already?
>>> Maybe writing the state to disk would actually reduce the pressure on
>>> the GC (but of course it'll also reduce throughput a bit).
>>>
>>> Are there any known issues with the network? Maybe the network bursts on
>>> restart cause the timeouts?
>>>
>>>
>>> On Fri, Aug 25, 2017 at 6:17 PM, Steven Wu <stevenz...@gmail.com> wrote:
>>>
>>>> Bowen,
>>>>
>>>> Heap size is ~50G. CPU was actually pretty low (like <20%) when high GC
>>>> pause and akka timeout was happening. So maybe memory allocation and GC
>>>> wasn't really an issue. I also recently learned that JVM can pause for
>>>> writing to GC log for disk I/O. that is another lead I am pursuing.
>>>>
>>>> Thanks,
>>>> Steven
>>>>
>>>> On Wed, Aug 23, 2017 at 10:58 AM, Bowen Li <bowen...@offerupnow.com>
>>>> wrote:
>>>>
>>>>> Hi Steven,
>>>>>     Yes, GC is a big overhead, it may cause your CPU utilization to
>>>>> reach 100%, and every process stopped working. We ran into this a while 
>>>>> too.
>>>>>
>>>>>     How much memory did you assign to TaskManager? How much the your
>>>>> CPU utilization when your taskmanager is considered 'killed'?
>>>>>
>>>>> Bowen
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 23, 2017 at 10:01 AM, Steven Wu <stevenz...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Till,
>>>>>>
>>>>>> Once our job was restarted for some reason (e.g. taskmangaer
>>>>>> container got killed), it can stuck in continuous restart loop for hours.
>>>>>> Right now, I suspect it is caused by GC pause during restart, our job has
>>>>>> very high memory allocation in steady state. High GC pause then caused 
>>>>>> akka
>>>>>> timeout, which then caused jobmanager to think taksmanager containers are
>>>>>> unhealthy/dead and kill them. And the cycle repeats...
>>>>>>
>>>>>> But I hasn't been able to prove or disprove it yet. When I was asking
>>>>>> the question, I was still sifting through metrics and error logs.
>>>>>>
>>>>>> Thanks,
>>>>>> Steven
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 22, 2017 at 1:21 AM, Till Rohrmann <
>>>>>> till.rohrm...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Steven,
>>>>>>>
>>>>>>> quick correction for Flink 1.2. Indeed the MetricFetcher does not
>>>>>>> pick up the right timeout value from the configuration. Instead it uses 
>>>>>>> a
>>>>>>> hardcoded 10s timeout. This has only been changed recently and is 
>>>>>>> already
>>>>>>> committed in the master. So with the next release 1.4 it will properly 
>>>>>>> pick
>>>>>>> up the right timeout settings.
>>>>>>>
>>>>>>> Just out of curiosity, what's the instability issue you're observing?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Till
>>>>>>>
>>>>>>> On Fri, Aug 18, 2017 at 7:07 PM, Steven Wu <stevenz...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Till/Chesnay, thanks for the answers. Look like this is a
>>>>>>>> result/symptom of underline stability issue that I am trying to track 
>>>>>>>> down.
>>>>>>>>
>>>>>>>> It is Flink 1.2.
>>>>>>>>
>>>>>>>> On Fri, Aug 18, 2017 at 12:24 AM, Chesnay Schepler <
>>>>>>>> ches...@apache.org> wrote:
>>>>>>>>
>>>>>>>>> The MetricFetcher always use the default akka timeout value.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 18.08.2017 09:07, Till Rohrmann wrote:
>>>>>>>>>
>>>>>>>>> Hi Steven,
>>>>>>>>>
>>>>>>>>> I thought that the MetricFetcher picks up the right timeout from
>>>>>>>>> the configuration. Which version of Flink are you using?
>>>>>>>>>
>>>>>>>>> The timeout is not a critical problem for the job health.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Till
>>>>>>>>>
>>>>>>>>> On Fri, Aug 18, 2017 at 7:22 AM, Steven Wu <stevenz...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We have set akka.ask.timeout to 60 s in yaml file. I also
>>>>>>>>>> confirmed the setting in Flink UI. But I saw akka timeout of 10 s for
>>>>>>>>>> metric query service. two questions
>>>>>>>>>> 1) why doesn't metric query use the 60 s value configured in yaml
>>>>>>>>>> file? does it always use default 10 s value?
>>>>>>>>>> 2) could this cause heartbeat failure between task manager and
>>>>>>>>>> job manager? or is this jut non-critical failure that won't affect 
>>>>>>>>>> job
>>>>>>>>>> health?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Steven
>>>>>>>>>>
>>>>>>>>>> 2017-08-17 23:34:33,421 WARN 
>>>>>>>>>> org.apache.flink.runtime.webmonitor.metrics.MetricFetcher
>>>>>>>>>> - Fetching metrics failed. akka.pattern.AskTimeoutException: Ask
>>>>>>>>>> timed out on [Actor[akka.tcp://flink@1.2.3.4
>>>>>>>>>> :39139/user/MetricQueryService_23cd9db754bb7d123d80e6b1c0be21d6]]
>>>>>>>>>> after [10000 ms] at akka.pattern.PromiseActorRef$$
>>>>>>>>>> anonfun$1.apply$mcV$sp(AskSupport.scala:334) at
>>>>>>>>>> akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) at
>>>>>>>>>> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)
>>>>>>>>>> at 
>>>>>>>>>> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>>>>>>>>>> at 
>>>>>>>>>> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)
>>>>>>>>>> at 
>>>>>>>>>> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474)
>>>>>>>>>> at 
>>>>>>>>>> akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425)
>>>>>>>>>> at 
>>>>>>>>>> akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429)
>>>>>>>>>> at 
>>>>>>>>>> akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381)
>>>>>>>>>> at java.lang.Thread.run(Thread.java:748)
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to