Folks,

We are running a simple PageRank algorithm in Gelly with about 1M edges and
we are seeing that one the TaskManager just crashes. We suspect it is some
configuration issue because each TaskManager has a total of 136GB memory and
we have 8 of these. So, the total memory is more than enough. 

Here is an excerpt from the TaskManager log:

2018-02-21 17:52:24,610 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -
--------------------------------------------------------------------------------
2018-02-21 17:52:24,626 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -  Starting
TaskManager (Version: 1.4.0, Rev:3a9d9f2, Date:06.12.2017 @ 11:08:40 UTC)
2018-02-21 17:52:24,626 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -  OS current
user: flink-user
2018-02-21 17:52:24,626 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -  Current
Hadoop/Kerberos user: <no hadoop dependency found>
2018-02-21 17:52:24,627 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -  JVM:
OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.161-b14
2018-02-21 17:52:24,627 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -  Maximum
heap size: 25400 MiBytes
2018-02-21 17:52:24,627 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -  JAVA_HOME:
/usr/lib/jvm/jre-1.8.0-openjdk.x86_64
2018-02-21 17:52:24,627 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -  No Hadoop
Dependency available
2018-02-21 17:52:24,627 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -  JVM
Options:
2018-02-21 17:52:24,627 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -    
-Xms25395M
2018-02-21 17:52:24,627 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -    
-Xmx25395M
2018-02-21 17:52:24,627 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -    
-XX:MaxDirectMemorySize=8388607T
2018-02-21 17:52:24,627 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -    
-XX:+UseG1GC
2018-02-21 17:52:24,627 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -    
-XX:+PrintSafepointStatistics
2018-02-21 17:52:24,627 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -    
-XX:+HeapDumpOnOutOfMemoryError
2018-02-21 17:52:24,627 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -    
-Dlog.file=/home/flink-user/flink-1.4.0/log/flink-flink-user-taskmanager-0-ip-10-10-1-59.log
2018-02-21 17:52:24,627 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -    
-Dlog4j.configuration=file:/home/flink-user/flink-1.4.0/conf/log4j.properties
2018-02-21 17:52:24,627 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -    
-Dlogback.configurationFile=file:/home/flink-user/flink-1.4.0/conf/logback.xml
2018-02-21 17:52:24,627 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -  Program
Arguments:
2018-02-21 17:52:24,627 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -    
--configDir
2018-02-21 17:52:24,627 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -    
/home/flink-user/flink-1.4.0/conf
2018-02-21 17:52:24,627 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -  Classpath:
/home/flink-user/flink-1.4.0/lib/flink-gelly_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-gelly-scala_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-python_2.11-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-s3-fs-hadoop-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/flink-s3-fs-presto-1.4.0.jar:/home/flink-user/flink-1.4.0/lib/log4j-1.2.17.jar:/home/flink-user/flink-1.4.0/lib/slf4j-log4j12-1.7.7.jar:/home/flink-user/flink-1.4.0/lib/flink-dist_2.11-1.4.0.jar:::
2018-02-21 17:52:24,628 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              -
--------------------------------------------------------------------------------
2018-02-21 17:52:24,629 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              - Registered
UNIX signal handlers for [TERM, HUP, INT]
2018-02-21 17:52:24,667 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              - Maximum
number of open file descriptors is 768000
2018-02-21 17:52:24,728 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              - Loading
configuration from /home/flink-user/flink-1.4.0/conf
2018-02-21 17:52:24,746 INFO 
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: jobmanager.rpc.address, 10.10.1.242
2018-02-21 17:52:24,746 INFO 
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: jobmanager.rpc.port, 6123
2018-02-21 17:52:24,746 INFO 
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: jobmanager.heap.mb, 131072
2018-02-21 17:52:24,746 INFO 
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.heap.mb, 139264
2018-02-21 17:52:24,746 INFO 
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.numberOfTaskSlots, 64
2018-02-21 17:52:24,747 INFO 
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.memory.preallocate, false
2018-02-21 17:52:24,747 INFO 
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.memory.off-heap, true
2018-02-21 17:52:24,747 INFO 
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.memory.fraction, 0.8
2018-02-21 17:52:24,747 INFO 
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.network.memory.min, 4294967296
2018-02-21 17:52:24,747 INFO 
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.network.memory.max, 12884901888
2018-02-21 17:52:24,747 INFO 
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: parallelism.default, 512
2018-02-21 17:52:24,748 INFO 
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: web.port, 8081
2018-02-21 17:52:24,748 INFO 
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: taskmanager.tmp.dirs, /home/flink-user/flink-tmp-dir
2018-02-21 17:52:24,748 INFO 
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: env.java.home, /usr/lib/jvm/jre-1.8.0-openjdk.x86_64
2018-02-21 17:52:24,749 INFO 
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: env.java.opts, -XX:+UseG1GC
-XX:+PrintSafepointStatistics -XX:+HeapDumpOnOutOfMemoryError
2018-02-21 17:52:24,749 INFO 
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: akka.framesize, 201326591b
2018-02-21 17:52:24,749 INFO 
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: akka.log.lifecycle.events, true
2018-02-21 17:52:24,749 INFO 
org.apache.flink.configuration.GlobalConfiguration            - Loading
configuration property: akka.client.timeout, 300 s
2018-02-21 17:52:24,849 INFO  org.apache.flink.core.fs.FileSystem               
           
- Hadoop is not in the classpath/dependencies. The extended set of supported
File Systems via Hadoop is not available.
2018-02-21 17:52:24,965 INFO 
org.apache.flink.runtime.security.modules.HadoopModuleFactory  - Cannot
create Hadoop Security Module because Hadoop cannot be found in the
Classpath.
2018-02-21 17:52:25,188 INFO 
org.apache.flink.runtime.security.SecurityUtils               - Cannot
install HadoopSecurityContext because Hadoop cannot be found in the
Classpath.
2018-02-21 17:52:25,347 INFO 
org.apache.flink.runtime.util.LeaderRetrievalUtils            - Trying to
select the network interface and address to use by connecting to the leading
JobManager.
2018-02-21 17:52:25,348 INFO 
org.apache.flink.runtime.util.LeaderRetrievalUtils            - TaskManager
will try to connect for 10000 milliseconds before falling back to heuristics
2018-02-21 17:52:25,350 INFO  org.apache.flink.runtime.net.ConnectionUtils      
           
- Retrieved new target address /10.10.1.242:6123.
2018-02-21 17:52:25,367 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager
will use hostname/address 'ip-10-10-1-59' (10.10.1.59) for communication.
2018-02-21 17:52:25,405 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              - Starting
TaskManager
2018-02-21 17:52:25,406 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              - Starting
TaskManager actor system at ip-10-10-1-59:40949.
2018-02-21 17:52:25,408 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              - Trying to
start actor system at ip-10-10-1-59:40949
2018-02-21 17:52:26,493 INFO  akka.event.slf4j.Slf4jLogger                      
           
- Slf4jLogger started
2018-02-21 17:52:26,553 INFO  akka.remote.Remoting                              
           
- Starting remoting
2018-02-21 17:52:27,021 INFO  akka.remote.Remoting                              
           
- Remoting started; listening on addresses
:[akka.tcp://flink@ip-10-10-1-59:40949]
2018-02-21 17:52:27,022 INFO  akka.remote.Remoting                              
           
- Remoting now listens on addresses: [akka.tcp://flink@ip-10-10-1-59:40949]
2018-02-21 17:52:27,029 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              - Actor system
started at akka.tcp://flink@ip-10-10-1-59:40949
2018-02-21 17:52:27,067 INFO 
org.apache.flink.runtime.metrics.MetricRegistryImpl           - No metrics
reporter configured, no metrics will be exposed/reported.
2018-02-21 17:52:27,084 INFO 
org.apache.flink.runtime.taskmanager.TaskManager              - Starting
TaskManager actor


---------------------

Here is the dump from the hs_err_pid file:

#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing
reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Use 64 bit Java on a 64 bit OS
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:2651), pid=2439, tid=0x00007fc4b7efe700
#
# JRE version: OpenJDK Runtime Environment (8.0_161-b14) (build
1.8.0_161-b14)
# Java VM: OpenJDK 64-Bit Server VM (25.161-b14 mixed mode linux-amd64
compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core
dumping, try "ulimit -c unlimited" before starting Java again
#

---------------  T H R E A D  ---------------

Current thread (0x00007fb5afff8260):


--------------

In the JobManager we see the following:

2018-02-21 17:55:52,380 INFO 
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Try to
restart or fail the job Flink Java Job at Wed Feb 21 17:53:30 UTC 2018
(d55f327901087350c24e2a8c34937db1) if no longer possible.
2018-02-21 17:55:52,380 INFO 
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Flink
Java Job at Wed Feb 21 17:53:30 UTC 2018 (d55f327901087350c24e2a8c34937db1)
switched from state FAILING to FAILED.
java.lang.Exception: The data preparation for task 'Reduce (Sum)' , caused
an error: Error obtaining the sorted input: Thread 'SortMerger Reading
Thread' terminated due to an exception: Connection unexpectedly closed by
remote task manager 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate
that the remote task manager was lost.
        at
org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:466)
        at
org.apache.flink.runtime.iterative.task.AbstractIterativeTask.run(AbstractIterativeTask.java:145)
        at
org.apache.flink.runtime.iterative.task.IterationIntermediateTask.run(IterationIntermediateTask.java:93)
        at
org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:355)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Error obtaining the sorted input:
Thread 'SortMerger Reading Thread' terminated due to an exception:
Connection unexpectedly closed by remote task manager
'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the remote task
manager was lost.
        at
org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619)
        at
org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1095)
        at
org.apache.flink.runtime.operators.ReduceDriver.prepare(ReduceDriver.java:95)
        at
org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:460)
        ... 5 more
Caused by: java.io.IOException: Thread 'SortMerger Reading Thread'
terminated due to an exception: Connection unexpectedly closed by remote
task manager 'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the
remote task manager was lost.
        at
org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800)
Caused by:
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
Connection unexpectedly closed by remote task manager
'ip-10-10-1-59/10.10.1.59:37805'. This might indicate that the remote task
manager was lost.


-------------

Here are the TaskManager settings:

# The heap size for the TaskManager JVM

taskmanager.heap.mb: 139264


# The number of task slots that each TaskManager offers. Each slot runs one
parallel pipeline.

taskmanager.numberOfTaskSlots: 64

# Specify whether TaskManager memory should be allocated when starting up
(true) or when
# memory is required in the memory manager (false)
# Important Note: For pure streaming setups, we highly recommend to set this
value to `false`
# as the default state backends currently do not use the managed memory.

taskmanager.memory.preallocate: false
taskmanager.memory.off-heap: true
taskmanager.memory.fraction: 0.8

#taskmanager.network.memory.fraction: 0.1
taskmanager.network.memory.min: 4294967296
taskmanager.network.memory.max: 12884901888

#taskmanager.network.numberOfBuffers: 8192
#taskmanager.debug.memory.startLogThread: true
#taskmanager.debug.memory.logIntervalMs: 500

# The parallelism used for programs that did not specify and other
parallelism.

parallelism.default: 512

-----------

So, what are we doing wrong here ?





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to