Thanks Ken for the info.This is something which I have done when running
spark batch jobs. However, in this case I really want to understand if
there is anything wrong with the job itself. Is the flink kafka consumer or
some other piece needs more memory than I am allocating.

Hemant

On Fri, Jul 2, 2021 at 9:57 PM Ken Krugler <kkrugler_li...@transpac.com>
wrote:

> When we run Flink jobs in EMR (typically batch, though) we disable the
> pmem (permanent memory) and vmem (virtual memory) checks.
>
> This was initially done for much older versions of Flink (1.6???), where
> the memory model wasn’t so well documented or understood by us.
>
> But I think the pmem check might still have an issue, due to Flink’s use
> of off-heap.
>
> So something like:
>
> [
>     {
>         "classification": "yarn-site",
>         "properties": {
>             "yarn.nodemanager.pmem-check-enabled": "false",
>             "yarn.nodemanager.vmem-check-enabled": "false"
>         }
>     }
> ]
>
>
> …might help.
>
> — Ken
>
>
> On Jul 2, 2021, at 8:36 AM, bat man <tintin0...@gmail.com> wrote:
>
> Hi,
>
> I am running a streaming job (Flink 1.9) on EMR on yarn. Flink web UI or
> metrics reported from prometheus shows total memory usage within specified
> task manager memory - 3GB.
>
> Metrics shows below numbers(in MB) -
> Heap - 577
> Non Heap - 241
> DirectMemoryUsed - 852
>
> Non-heap does rise gradually, starting around 210MB and reaching 241 when
> yarn kills the container. Heap fluctuates between 1.x - .6GB,
> DirectMemoryUsed is constant at 852.
>
> Based on configurations these are the tm params from yarn logs -
> -Xms1957m -Xmx1957m -XX:MaxDirectMemorySize=1115m
>
> These are other params as configuration in flink-conf
> yarn-cutoff - 270MB
> Managed memory - 28MB
> Network memory - 819MB
>
> Above memory values are from around the same time the container is killed
> by yarn for - <container-xxx> is running beyond physical memory limits.
>
> Is there anything else which is not reported by flink in metrics or I have
> been misinterpreting as seen from above total memory consumed is below -
> 3GB.
>
> Same behavior is reported when I have run the job with 2GB, 2.7GB and now
> 3GB task mem. My job does have shuffles as data from one operator is sent
> to 4 other operators after filtering.
>
> One more thing is I am running this with 3 yarn containers(2 tasks in each
> container), total parallelism as 6. As soon as one container fails with
> this error, the job re-starts. However, within minutes other 2 containers
> also fail with the same error one by one.
>
> Thanks,
> Hemant
>
>
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> Custom big data solutions
> Flink, Pinot, Solr, Elasticsearch
>
>
>
>

Reply via email to