Thanks Ken for the info.This is something which I have done when running spark batch jobs. However, in this case I really want to understand if there is anything wrong with the job itself. Is the flink kafka consumer or some other piece needs more memory than I am allocating.
Hemant On Fri, Jul 2, 2021 at 9:57 PM Ken Krugler <kkrugler_li...@transpac.com> wrote: > When we run Flink jobs in EMR (typically batch, though) we disable the > pmem (permanent memory) and vmem (virtual memory) checks. > > This was initially done for much older versions of Flink (1.6???), where > the memory model wasn’t so well documented or understood by us. > > But I think the pmem check might still have an issue, due to Flink’s use > of off-heap. > > So something like: > > [ > { > "classification": "yarn-site", > "properties": { > "yarn.nodemanager.pmem-check-enabled": "false", > "yarn.nodemanager.vmem-check-enabled": "false" > } > } > ] > > > …might help. > > — Ken > > > On Jul 2, 2021, at 8:36 AM, bat man <tintin0...@gmail.com> wrote: > > Hi, > > I am running a streaming job (Flink 1.9) on EMR on yarn. Flink web UI or > metrics reported from prometheus shows total memory usage within specified > task manager memory - 3GB. > > Metrics shows below numbers(in MB) - > Heap - 577 > Non Heap - 241 > DirectMemoryUsed - 852 > > Non-heap does rise gradually, starting around 210MB and reaching 241 when > yarn kills the container. Heap fluctuates between 1.x - .6GB, > DirectMemoryUsed is constant at 852. > > Based on configurations these are the tm params from yarn logs - > -Xms1957m -Xmx1957m -XX:MaxDirectMemorySize=1115m > > These are other params as configuration in flink-conf > yarn-cutoff - 270MB > Managed memory - 28MB > Network memory - 819MB > > Above memory values are from around the same time the container is killed > by yarn for - <container-xxx> is running beyond physical memory limits. > > Is there anything else which is not reported by flink in metrics or I have > been misinterpreting as seen from above total memory consumed is below - > 3GB. > > Same behavior is reported when I have run the job with 2GB, 2.7GB and now > 3GB task mem. My job does have shuffles as data from one operator is sent > to 4 other operators after filtering. > > One more thing is I am running this with 3 yarn containers(2 tasks in each > container), total parallelism as 6. As soon as one container fails with > this error, the job re-starts. However, within minutes other 2 containers > also fail with the same error one by one. > > Thanks, > Hemant > > > -------------------------- > Ken Krugler > http://www.scaleunlimited.com > Custom big data solutions > Flink, Pinot, Solr, Elasticsearch > > > >