Folks, We are running a simple PageRank algorithm in Gelly with about 1M edges and we are seeing that one the TaskManager just crashes. We suspect it is some configuration issue because each TaskManager has a total of 136GB memory and we have 8 of these. So, the total memory is more than enough.
Here is an excerpt from the TaskManager log: 2018-02-21 17:52:24,610 INFO org.apache.flink.runtime.taskmanager.TaskManager - -------------------------------------------------------------------------------- 2018-02-21 17:52:24,626 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager (Version: 1.4.0, Rev:3a9d9f2, Date:06.12.2017 @ 11:08:40 UTC) 2018-02-21 17:52:24,626 INFO org.apache.flink.runtime.taskmanager.TaskManager - OS current user: flink-user 2018-02-21 17:52:24,626 INFO org.apache.flink.runtime.taskmanager.TaskManager - Current Hadoop/Kerberos user: <no hadoop dependency found> 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.161-b14 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - Maximum heap size: 25400 MiBytes 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - JAVA_HOME: /usr/lib/jvm/jre-1.8.0-openjdk.x86_64 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - No Hadoop Dependency available 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - JVM Options: 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Xms25395M 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Xmx25395M 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - -XX:MaxDirectMemorySize=8388607T 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - -XX:+UseG1GC 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - -XX:+PrintSafepointStatistics 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - -XX:+HeapDumpOnOutOfMemoryError 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Dlog.file=/home/flink-user/flink-1.4.0/log/flink-flink-user-taskmanager-0-ip-10-10-1-59.log 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Dlog4j.configuration=file:/home/flink-user/flink-1.4.0/conf/log4j.properties 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - -Dlogback.configurationFile=file:/home/flink-user/flink-1.4.0/conf/logback.xml 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - Program Arguments: 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - --configDir 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - /home/flink-user/flink-1.4.0/conf 2018-02-21 17:52:24,627 INFO org.apache.flink.runtime.taskmanager.TaskManager - Classpath: /home/flink-user/flink-1.4.0/lib/flink-gelly_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-gelly-scala_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-python_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-s3-fs-hadoop-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-s3-fs-presto-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/log4j-1.2.17.jar:/home/flink-user/flink-1.4.0/lib/slf4j-log4j12-1.7.7.jar:/home/flink-user/flink-1.4.0/lib/flink-dist_2.11-1.4.0.jar::: 2018-02-21 17:52:24,628 INFO org.apache.flink.runtime.taskmanager.TaskManager - -------------------------------------------------------------------------------- 2018-02-21 17:52:24,629 INFO org.apache.flink.runtime.taskmanager.TaskManager - Registered UNIX signal handlers for [TERM, HUP, INT] 2018-02-21 17:52:24,667 INFO org.apache.flink.runtime.taskmanager.TaskManager - Maximum number of open file descriptors is 768000 2018-02-21 17:52:24,728 INFO org.apache.flink.runtime.taskmanager.TaskManager - Loading configuration from /home/flink-user/flink-1.4.0/conf 2018-02-21 17:52:24,746 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, 10.10.1.242 2018-02-21 17:52:24,746 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123 2018-02-21 17:52:24,746 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 131072 2018-02-21 17:52:24,746 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.mb, 139264 2018-02-21 17:52:24,746 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 64 2018-02-21 17:52:24,747 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.preallocate, false 2018-02-21 17:52:24,747 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.off-heap, true 2018-02-21 17:52:24,747 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.fraction, 0.8 2018-02-21 17:52:24,747 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.network.memory.min, 4294967296 2018-02-21 17:52:24,747 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.network.memory.max, 12884901888 2018-02-21 17:52:24,747 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 512 2018-02-21 17:52:24,748 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: web.port, 8081 2018-02-21 17:52:24,748 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.tmp.dirs, /home/flink-user/flink-tmp-dir 2018-02-21 17:52:24,748 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: env.java.home, /usr/lib/jvm/jre-1.8.0-openjdk.x86_64 2018-02-21 17:52:24,749 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: env.java.opts, -XX:+UseG1GC -XX:+PrintSafepointStatistics -XX:+HeapDumpOnOutOfMemoryError 2018-02-21 17:52:24,749 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: akka.framesize, 201326591b 2018-02-21 17:52:24,749 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: akka.log.lifecycle.events, true 2018-02-21 17:52:24,749 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: akka.client.timeout, 300 s 2018-02-21 17:52:24,849 INFO org.apache.flink.core.fs.FileSystem - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available. 2018-02-21 17:52:24,965 INFO org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath. 2018-02-21 17:52:25,188 INFO org.apache.flink.runtime.security.SecurityUtils - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath. 2018-02-21 17:52:25,347 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - Trying to select the network interface and address to use by connecting to the leading JobManager. 2018-02-21 17:52:25,348 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics 2018-02-21 17:52:25,350 INFO org.apache.flink.runtime.net.ConnectionUtils - Retrieved new target address /10.10.1.242:6123. 2018-02-21 17:52:25,367 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager will use hostname/address 'ip-10-10-1-59' (10.10.1.59) for communication. 2018-02-21 17:52:25,405 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager 2018-02-21 17:52:25,406 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor system at ip-10-10-1-59:40949. 2018-02-21 17:52:25,408 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to start actor system at ip-10-10-1-59:40949 2018-02-21 17:52:26,493 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started 2018-02-21 17:52:26,553 INFO akka.remote.Remoting - Starting remoting 2018-02-21 17:52:27,021 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink@ip-10-10-1-59:40949] 2018-02-21 17:52:27,022 INFO akka.remote.Remoting - Remoting now listens on addresses: [akka.tcp://flink@ip-10-10-1-59:40949] 2018-02-21 17:52:27,029 INFO org.apache.flink.runtime.taskmanager.TaskManager - Actor system started at akka.tcp://flink@ip-10-10-1-59:40949 2018-02-21 17:52:27,067 INFO org.apache.flink.runtime.metrics.MetricRegistryImpl - No metrics reporter configured, no metrics will be exposed/reported. 2018-02-21 17:52:27,084 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor --------------------- Here is the dump from the hs_err_pid file: # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory. # Possible reasons: # The system is out of physical RAM or swap space # In 32 bit mode, the process size limit was hit # Possible solutions: # Reduce memory load on the system # Increase physical memory or swap space # Check if swap backing store is full # Use 64 bit Java on a 64 bit OS # Decrease Java heap size (-Xmx/-Xms) # Decrease number of Java threads # Decrease Java thread stack sizes (-Xss) # Set larger code cache with -XX:ReservedCodeCacheSize= # This output file may be truncated or incomplete. # # Out of Memory Error (os_linux.cpp:2651), pid=2439, tid=0x00007fc4b7efe700 # # JRE version: OpenJDK Runtime Environment (8.0_161-b14) (build 1.8.0_161-b14) # Java VM: OpenJDK 64-Bit Server VM (25.161-b14 mixed mode linux-amd64 compressed oops) # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # --------------- T H R E A D --------------- Current thread (0x00007fb5afff8260): -------------- In the JobManager we see the following: 2018-02-21 17:55:52,380 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Try to restart or fail the job Flink Java Job at Wed Feb 21 17:53:30 UTC 2018 (d55f327901087350c24e2a8c34937db1) if no longer possible. 2018-02-21 17:55:52,380 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Flink Java Job at Wed Feb 21 17:53:30 UTC 2018 (d55f327901087350c24e2a8c34937db1) switched from state FAILING to FAILED. java.lang.Exception: The data preparation for task 'Reduce (Sum)' , caused an error: Error obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due to an exception: Connection unexpectedly closed by remote task manager 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the remote task manager was lost. at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:466) at org.apache.flink.runtime.iterative.task.AbstractIterativeTask.run(AbstractIterativeTask.java:145) at org.apache.flink.runtime.iterative.task.IterationIntermediateTask.run(IterationIntermediateTask.java:93) at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:355) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.RuntimeException: Error obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due to an exception: Connection unexpectedly closed by remote task manager 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the remote task manager was lost. at org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619) at org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1095) at org.apache.flink.runtime.operators.ReduceDriver.prepare(ReduceDriver.java:95) at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:460) ... 5 more Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' terminated due to an exception: Connection unexpectedly closed by remote task manager 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the remote task manager was lost. at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800) Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the remote task manager was lost. ------------- Here are the TaskManager settings: # The heap size for the TaskManager JVM taskmanager.heap.mb: 139264 # The number of task slots that each TaskManager offers. Each slot runs one parallel pipeline. taskmanager.numberOfTaskSlots: 64 # Specify whether TaskManager memory should be allocated when starting up (true) or when # memory is required in the memory manager (false) # Important Note: For pure streaming setups, we highly recommend to set this value to `false` # as the default state backends currently do not use the managed memory. taskmanager.memory.preallocate: false taskmanager.memory.off-heap: true taskmanager.memory.fraction: 0.8 #taskmanager.network.memory.fraction: 0.1 taskmanager.network.memory.min: 4294967296 taskmanager.network.memory.max: 12884901888 #taskmanager.network.numberOfBuffers: 8192 #taskmanager.debug.memory.startLogThread: true #taskmanager.debug.memory.logIntervalMs: 500 # The parallelism used for programs that did not specify and other parallelism. parallelism.default: 512 ----------- So, what are we doing wrong here ? -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/