Hello,

Next issue in a string of things I'm solving is that my application fails
with the message 'Connection unexpectedly closed by remote task manager'.

Yarn log shows the following:

Container [pid=4102,containerID=container_1461341357870_0004_01_000015] is
running beyond physical memory limits. Current usage: 2.5 GB of 2.5 GB
physical memory used; 9.0 GB of 12.3 GB virtual memory used. Killing
container.
Dump of the process-tree for container_1461341357870_0004_01_000015 :
        |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
        |- 4102 4100 4102 4102 (bash) 1 7 115806208 715 /bin/bash -c
/usr/lib/jvm/java-1.8.0/bin/java -Xms570m -Xmx570m
-XX:MaxDirectMemorySize=1900m
-Dlog.file=/var/log/hadoop-yarn/containers/application_1461341357870_0004/container_1461341357870_0004_01_000015/taskmanager.log
-Dlogback.configurationFile=file:logback.xml
-Dlog4j.configuration=file:log4j.properties
org.apache.flink.yarn.YarnTaskManagerRunner --configDir . 1>
/var/log/hadoop-yarn/containers/application_1461341357870_0004/container_1461341357870_0004_01_000015/taskmanager.out
2>
/var/log/hadoop-yarn/containers/application_1461341357870_0004/container_1461341357870_0004_01_000015/taskmanager.err
        |- 4306 4102 4102 4102 (java) 172258 40265 9495257088 646460
/usr/lib/jvm/java-1.8.0/bin/java -Xms570m -Xmx570m
-XX:MaxDirectMemorySize=1900m
-Dlog.file=/var/log/hadoop-yarn/containers/application_1461341357870_0004/container_1461341357870_0004_01_000015/taskmanager.log
-Dlogback.configurationFile=file:logback.xml
-Dlog4j.configuration=file:log4j.properties
org.apache.flink.yarn.YarnTaskManagerRunner --configDir .

One thing that drew my attention is `-Xmx570m`. I expected it to be
TaskManagerMemory*0.75 (due to yarn.heap-cutoff-ratio). I run the
application as follows:
HADOOP_CONF_DIR=/etc/hadoop/conf flink run -m yarn-cluster -yn 18 -yjm 4096
-ytm 2500 eval-assembly-1.0.jar

In flink logs I do see 'Task Manager memory: 2500'. When I look at the yarn
container logs on the cluster node I see that it starts with 570mb, which
puzzles me. When I look at the actually allocated memory for a Yarn
container using 'top' I see 2.2GB used. Am I interpreting these parameters
correctly?

I also have set (it failed in the same way without this as well):
taskmanager.memory.off-heap: true

Also, I don't understand why this happens at all. I assumed that Flink
won't overcommit allocated resources and will spill to the disk when
running out of heap memory. Appreciate if someone can shed light on this
too.

Thanks,
Timur

Reply via email to