Re: Flink Task Manager GC overhead limit exceeded

Xintong Song Sun, 03 May 2020 20:22:23 -0700

https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/mem_setup.html


Thank you~

Xintong Song



On Fri, May 1, 2020 at 8:35 AM shao.hongxiao <[email protected]> wrote:

> 你好，宋
> Please refer to this document [1] for more details
> 能发一下具体链接吗，我也发现flink ui上显示的内存参数不太对，我想仔细看一下相关说明
>
> 谢谢啦
>
>
>
>
> | |
> 邵红晓
> |
> |
> 邮箱：[email protected]
> |
>
> 签名由 网易邮箱大师 定制
>
> On 04/30/2020 12:08, Xintong Song wrote:
> Then I would suggest the following.
> - Check the task manager log to see if the '-D' properties are properly
> loaded. They should be located at the beginning of the log file.
> - You can also try to log into the pod and check the JVM launch command
> with "ps -ef | grep TaskManagerRunner". I suspect there might be some
> argument passing problem regarding the spaces and double quotation marks.
>
>
>
>
>
> Thank you~
>
> Xintong Song
>
>
>
>
>
> On Thu, Apr 30, 2020 at 11:39 AM Eleanore Jin <[email protected]>
> wrote:
>
> Hi Xintong,
>
>
> Thanks for the detailed explanation!
>
>
> as for the 2nd question: I mount  it to am emptyDir, I assume pod restart
> will not cause the pod to be rescheduled to another node, so it should
> stay?  I verified by directly adding this to the flink-conf.yaml, which I
> see the heap dump is taken and stays in the directory:  env.java.opts:
> -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps
>
>
> In addition, I also don't see the log print out something like: Heap dump
> file created [5220997112 bytes in 73.464 secs], which I see when directly
> adding the options in the flink-conf.yaml
>
>
> containers:
>
> - volumeMounts:
>
>         - mountPath: /dumps
>
>           name: heap-dumps
>
> volumes:
>
>       - emptyDir: {}
>
>         name: heap-dumps
>
>
>
>
> Thanks a lot!
>
> Eleanore
>
>
>
> On Wed, Apr 29, 2020 at 7:55 PM Xintong Song <[email protected]>
> wrote:
>
> Hi Eleanore,
>
>
> I'd like to explain about 1 & 2. For 3, I have no idea either.
>
>
>
> 1. I dont see the heap size from UI for task manager show correctly
>
>
>
> Despite the 'heap' in the key, 'taskmanager.heap.size' accounts for the
> total memory of a Flink task manager, rather than only the heap memory. A
> Flink task manager process consumes not only java heap memory, but also
> direct memory (e.g., network buffers) and native memory (e.g., JVM
> overhead). That's why the JVM heap size shown on the UI is much smaller
> than the configured 'taskmanager.heap.size'. Please refer to this document
> [1] for more details. This document comes from Flink 1.9 and has not been
> back-ported to 1.8, but the contents should apply to 1.8 as well.
>
>
> 2. I dont see the heap dump file in the restarted pod /dumps/oom.bin, did
> I set the java opts wrong?
>
>
>
> The java options look good to me. It the configured path '/dumps/oom.bin'
> a local path of the pod or a path of the host mounted onto the pod? The
> restarted pod is a completely new different pod. Everything you write to
> the old pod goes away as the pod terminated, unless they are written to the
> host through mounted storage.
>
>
>
> Thank you~
>
> Xintong Song
>
>
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/mem_setup.html
>
>
> On Thu, Apr 30, 2020 at 7:41 AM Eleanore Jin <[email protected]>
> wrote:
>
> Hi All,
>
>
>
> Currently I am running a flink job cluster (v1.8.2) on kubernetes with 4
> pods, each pod with 4 parallelism.
>
>
> The flink job reads from a source topic with 96 partitions, and does per
> element filter, the filtered value comes from a broadcast topic and it
> always use the latest message as the filter criteria, then publish to a
> sink topic.
>
>
> There is no checkpointing and state involved.
>
>
> Then I am seeing GC overhead limit exceeded error continuously and the
> pods keep on restarting
>
>
> So I tried to increase the heap size for task manager by
>
> containers:
>
>       - args:
>
>         - task-manager
>
>         - -Djobmanager.rpc.address=service-job-manager
>
>         - -Dtaskmanager.heap.size=4096m
>
>         - -Denv.java.opts.taskmanager="-XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=/dumps/oom.bin"
>
>
>
>
> 3 things I noticed,
>
>
>
>
> 1. I dont see the heap size from UI for task manager show correctly
>
>
>
>
>
> 2. I dont see the heap dump file in the restarted pod /dumps/oom.bin, did
> I set the java opts wrong?
>
>
> 3. I continously seeing below logs from all pods, not sure if causes any
> issue
> {"@timestamp":"2020-04-29T23:39:43.387Z","@version":"1","message":"[Consumer
> clientId=consumer-1, groupId=aba774bc] Node 6 was unable to process the
> fetch request with (sessionId=2054451921, epoch=474):
> FETCH_SESSION_ID_NOT_FOUND.","logger_name":"org.apache.kafka.clients.FetchSessionHandler","thread_name":"pool-6-thread-1","level":"INFO","level_value":20000}
>
>
> Thanks a lot for any help!
>
>
> Best,
> Eleanore

Re: Flink Task Manager GC overhead limit exceeded

回复