Hello, Next issue in a string of things I'm solving is that my application fails with the message 'Connection unexpectedly closed by remote task manager'.
Yarn log shows the following: Container [pid=4102,containerID=container_1461341357870_0004_01_000015] is running beyond physical memory limits. Current usage: 2.5 GB of 2.5 GB physical memory used; 9.0 GB of 12.3 GB virtual memory used. Killing container. Dump of the process-tree for container_1461341357870_0004_01_000015 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 4102 4100 4102 4102 (bash) 1 7 115806208 715 /bin/bash -c /usr/lib/jvm/java-1.8.0/bin/java -Xms570m -Xmx570m -XX:MaxDirectMemorySize=1900m -Dlog.file=/var/log/hadoop-yarn/containers/application_1461341357870_0004/container_1461341357870_0004_01_000015/taskmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnTaskManagerRunner --configDir . 1> /var/log/hadoop-yarn/containers/application_1461341357870_0004/container_1461341357870_0004_01_000015/taskmanager.out 2> /var/log/hadoop-yarn/containers/application_1461341357870_0004/container_1461341357870_0004_01_000015/taskmanager.err |- 4306 4102 4102 4102 (java) 172258 40265 9495257088 646460 /usr/lib/jvm/java-1.8.0/bin/java -Xms570m -Xmx570m -XX:MaxDirectMemorySize=1900m -Dlog.file=/var/log/hadoop-yarn/containers/application_1461341357870_0004/container_1461341357870_0004_01_000015/taskmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnTaskManagerRunner --configDir . One thing that drew my attention is `-Xmx570m`. I expected it to be TaskManagerMemory*0.75 (due to yarn.heap-cutoff-ratio). I run the application as follows: HADOOP_CONF_DIR=/etc/hadoop/conf flink run -m yarn-cluster -yn 18 -yjm 4096 -ytm 2500 eval-assembly-1.0.jar In flink logs I do see 'Task Manager memory: 2500'. When I look at the yarn container logs on the cluster node I see that it starts with 570mb, which puzzles me. When I look at the actually allocated memory for a Yarn container using 'top' I see 2.2GB used. Am I interpreting these parameters correctly? I also have set (it failed in the same way without this as well): taskmanager.memory.off-heap: true Also, I don't understand why this happens at all. I assumed that Flink won't overcommit allocated resources and will spill to the disk when running out of heap memory. Appreciate if someone can shed light on this too. Thanks, Timur