I think it might be that the computation is to CPU heavy, which makes the TaskManager unresponsive to any JobManager messages and so the JobManager thinks that the TaskManager is lost.
@Till, do you have another idea about what could be going on? > On 15. Sep 2017, at 13:52, AndreaKinn <kinn6...@hotmail.it> wrote: > > the job manager log probably is more interesting: > > 2017-09-15 12:47:45,420 WARN org.apache.hadoop.util.NativeCodeLoader > > - Unable to load native-hadoop library for your platform... using > builtin-java classes where applicable > 2017-09-15 12:47:45,650 INFO org.apache.flink.runtime.jobmanager.JobManager > > - > -------------------------------------------------------------------------------- > 2017-09-15 12:47:45,650 INFO org.apache.flink.runtime.jobmanager.JobManager > > - Starting JobManager (Version: 1.3.2, Rev:0399bee, Date:03.08.2017 @ > 10:23:11 UTC) > 2017-09-15 12:47:45,650 INFO org.apache.flink.runtime.jobmanager.JobManager > > - Current user: giordano > 2017-09-15 12:47:45,651 INFO org.apache.flink.runtime.jobmanager.JobManager > > - JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - > 1.8/25.131-b11 > 2017-09-15 12:47:45,652 INFO org.apache.flink.runtime.jobmanager.JobManager > > - Maximum heap size: 491 MiBytes > 2017-09-15 12:47:45,652 INFO org.apache.flink.runtime.jobmanager.JobManager > > - JAVA_HOME: /usr/lib/jvm/java-8-oracle > 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager > > - Hadoop version: 2.7.2 > 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager > > - JVM Options: > 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager > > - -Xms512m > 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager > > - -Xmx512m > 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager > > - > -Dlog.file=/home/giordano/flink-1.3.2/log/flink-giordano-jobmanager-0-giordano-2-2-100-1.log > 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager > > - > -Dlog4j.configuration=file:/home/giordano/flink-1.3.2/conf/log4j.properties > 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager > > - > -Dlogback.configurationFile=file:/home/giordano/flink-1.3.2/conf/logback.xml > 2017-09-15 12:47:45,658 INFO org.apache.flink.runtime.jobmanager.JobManager > > - Program Arguments: > 2017-09-15 12:47:45,659 INFO org.apache.flink.runtime.jobmanager.JobManager > > - --configDir > 2017-09-15 12:47:45,659 INFO org.apache.flink.runtime.jobmanager.JobManager > > - /home/giordano/flink-1.3.2/conf > 2017-09-15 12:47:45,659 INFO org.apache.flink.runtime.jobmanager.JobManager > > - --executionMode > 2017-09-15 12:47:45,659 INFO org.apache.flink.runtime.jobmanager.JobManager > > - cluster > 2017-09-15 12:47:45,659 INFO org.apache.flink.runtime.jobmanager.JobManager > > - Classpath: > /home/giordano/flink-1.3.2/lib/flink-python_2.11-1.3.2.jar:/home/giordano/flink-1.3.2/lib/flin$ > 2017-09-15 12:47:45,659 INFO org.apache.flink.runtime.jobmanager.JobManager > > - > -------------------------------------------------------------------------------- > 2017-09-15 12:47:45,661 INFO org.apache.flink.runtime.jobmanager.JobManager > > - Registered UNIX signal handlers for [TERM, HUP, INT] > 2017-09-15 12:47:45,947 INFO org.apache.flink.runtime.jobmanager.JobManager > > - Loading configuration from /home/giordano/flink-1.3.2/conf > 2017-09-15 12:47:45,953 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: env.java.home, /usr/lib/jvm/java-8-oracle > 2017-09-15 12:47:45,953 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.rpc.address, localhost > 2017-09-15 12:47:45,953 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.rpc.port, 6123 > 2017-09-15 12:47:45,954 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.heap.mb, 512 > 2017-09-15 12:47:45,954 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.numberOfTaskSlots, 2 > 2017-09-15 12:47:45,954 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.memory.preallocate, false > 2017-09-15 12:47:45,955 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.web.port, 8081 > 2017-09-15 12:47:45,956 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: state.backend, filesystem > 2017-09-15 12:47:45,956 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: state.backend.fs.checkpointdir, > file:///home/flink-checkpoints > 2017-09-15 12:47:45,970 INFO org.apache.flink.runtime.jobmanager.JobManager > > - Starting JobManager without high-availability > 2017-09-15 12:47:45,973 INFO org.apache.flink.runtime.jobmanager.JobManager > > - Starting JobManager on localhost:6123 with execution mode CLUSTER > 2017-09-15 12:47:45,993 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: env.java.home, /usr/lib/jvm/java-8-oracle > 2017-09-15 12:47:45,995 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.rpc.address, localhost > 2017-09-15 12:47:45,995 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.rpc.port, 6123 > 2017-09-15 12:47:45,995 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.heap.mb, 512 > 2017-09-15 12:47:45,995 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.numberOfTaskSlots, 2 > 2017-09-15 12:47:45,996 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.memory.preallocate, false > 2017-09-15 12:47:45,996 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.web.port, 8081 > 2017-09-15 12:47:45,996 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: state.backend, filesystem > 2017-09-15 12:47:45,996 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: state.backend.fs.checkpointdir, > file:///home/flink-checkpoints > 2017-09-15 12:47:46,045 INFO > org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user > set to giordano (auth:SIMPLE) > 2017-09-15 12:47:46,209 INFO org.apache.flink.runtime.jobmanager.JobManager > > - Starting JobManager actor system reachable at localhost:6123 > 2017-09-15 12:47:46,878 INFO akka.event.slf4j.Slf4jLogger > > - Slf4jLogger started > 2017-09-15 12:47:46,982 INFO Remoting > > - Starting remoting > 2017-09-15 12:47:47,392 INFO Remoting > > - Remoting started; listening on addresses > :[akka.tcp://flink@localhost:6123] > 2017-09-15 12:47:47,423 INFO org.apache.flink.runtime.jobmanager.JobManager > > - Starting JobManager web frontend > 2017-09-15 12:47:47,433 INFO > org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined > location of JobManager log file: > /home/giordano/flink-1.3.2/log/flink-giordano-jobmanager-0-gio$ > 2017-09-15 12:47:47,434 INFO > org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined > location of JobManager stdout file: > /home/giordano/flink-1.3.2/log/flink-giordano-jobmanager-0-$ > 2017-09-15 12:47:47,434 INFO > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using > directory /tmp/flink-web-f9816186-7918-4475-8359-c3cafb63559a for the web > interface files > 2017-09-15 12:47:47,435 INFO > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using > directory /tmp/flink-web-75aba058-8587-4181-bf16-ee2e68d9b70c for web > frontend JAR file uploads > 2017-09-15 12:47:47,853 INFO > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Web frontend > listening at 0:0:0:0:0:0:0:0:8081 > 2017-09-15 12:47:47,854 INFO org.apache.flink.runtime.jobmanager.JobManager > > - Starting JobManager actor > 2017-09-15 12:47:47,876 INFO org.apache.flink.runtime.blob.BlobServer > > - Created BLOB server storage directory > /tmp/blobStore-9ad4e807-069e-4fb7-88b3-410fbdcb5eb0 > 2017-09-15 12:47:47,879 INFO org.apache.flink.runtime.blob.BlobServer > > - Started BLOB server at 0.0.0.0:39682 - max concurrent requests: 50 - max > backlog: 1000 > 2017-09-15 12:47:47,896 INFO > org.apache.flink.runtime.metrics.MetricRegistry - No metrics > reporter configured, no metrics will be exposed/reported. > 2017-09-15 12:47:47,906 INFO > org.apache.flink.runtime.jobmanager.MemoryArchivist - Started > memory archivist akka://flink/user/archive > 2017-09-15 12:47:47,914 INFO org.apache.flink.runtime.jobmanager.JobManager > > - Starting JobManager at akka.tcp://flink@localhost:6123/user/jobmanager. > 2017-09-15 12:47:47,929 INFO > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting > with JobManager akka.tcp://flink@localhost:6123/user/jobmanager on port 8081 > 2017-09-15 12:47:47,930 INFO > org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader > reachable under > akka.tcp://flink@localhost:6123/user/jobmanager:00000000-0000-0000-0000-0000000$ > 2017-09-15 12:47:47,970 INFO > org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager > > - Trying to associate with JobManager leader > akka.tcp://flink@localhost:6123/user/jobmanag$ > 2017-09-15 12:47:48,010 INFO org.apache.flink.runtime.jobmanager.JobManager > > - JobManager akka.tcp://flink@localhost:6123/user/jobmanager was granted > leadership with leader session ID S$ > 2017-09-15 12:47:48,034 INFO > org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager > > - Resource Manager associating with leading JobManager > Actor[akka://flink/user/jobmanager#$ > 2017-09-15 12:47:51,071 ERROR akka.remote.EndpointWriter > > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:47:51,383 ERROR akka.remote.EndpointWriter > > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:47:51,718 ERROR akka.remote.EndpointWriter > > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:47:52,109 ERROR akka.remote.EndpointWriter > > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:47:52,414 ERROR akka.remote.EndpointWriter > > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:47:53,130 ERROR akka.remote.EndpointWriter > > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:47:54,365 INFO > org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager > > - TaskManager cf04d1390ff86aba4d1702ef1a0d2b67 has started. > 2017-09-15 12:47:54,368 INFO > org.apache.flink.runtime.instance.InstanceManager - Registered > TaskManager at giordano-2-2-100-1 > (akka.tcp://flink@giordano-2-2-100-1:35127/user/taskmanager) $ > 2017-09-15 12:47:54,434 ERROR akka.remote.EndpointWriter > > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:47:55,150 ERROR akka.remote.EndpointWriter > > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:47:58,455 ERROR akka.remote.EndpointWriter > > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:47:59,172 ERROR akka.remote.EndpointWriter > > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:48:01,650 INFO org.apache.flink.runtime.jobmanager.JobManager > > - Submitting job df1f60ca168364759c69dbe078544346 (Flink Streaming Job). > 2017-09-15 12:48:01,768 INFO org.apache.flink.runtime.jobmanager.JobManager > > - Using restart strategy > FixedDelayRestartStrategy(maxNumberRestartAttempts=1, > delayBetweenRestartAttempts=0$ > 2017-09-15 12:48:01,782 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job recovers > via failover strategy: full graph restart > 2017-09-15 12:48:01,796 INFO org.apache.flink.runtime.jobmanager.JobManager > > - Running initialization on master for job Flink Streaming Job > (df1f60ca168364759c69dbe078544346). > 2017-09-15 12:48:01,796 INFO org.apache.flink.runtime.jobmanager.JobManager > > - Successfully ran initialization on master in 0 ms. > 2017-09-15 12:48:01,830 INFO org.apache.flink.runtime.jobmanager.JobManager > > - State backend is set to heap memory (checkpoints to filesystem > "file:/home/flink-checkpoints") > 2017-09-15 12:48:01,843 INFO org.apache.flink.runtime.jobmanager.JobManager > > - Scheduling job df1f60ca168364759c69dbe078544346 (Flink Streaming Job). > 2017-09-15 12:48:01,844 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Job Flink > Streaming Job (df1f60ca168364759c69dbe078544346) switched from state CREATED > to RUNNING. > 2017-09-15 12:48:01,861 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: > Custom Source -> Timestamps/Watermarks (1/1) > (bc0e95e951deb6680cff372a954950d2) switched from CREA$ > 2017-09-15 12:48:01,882 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Map -> Sink: > Unnamed (1/1) (7ecc9cea7b0132f9604bce99545b49bd) switched from CREATED to > SCHEDULED. > 2017-09-15 12:48:01,883 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Learn -> > Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink) (1/1) > (d24f111f9720c8e4df77f67a$ > 2017-09-15 12:48:01,895 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: > Custom Source -> Timestamps/Watermarks (1/1) > (bc0e95e951deb6680cff372a954950d2) switched from SCHE$ > 2017-09-15 12:48:01,912 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying > Source: Custom Source -> Timestamps/Watermarks (1/1) (attempt #0) to > giordano-2-2-100-1 > 2017-09-15 12:48:01,933 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Map -> Sink: > Unnamed (1/1) (7ecc9cea7b0132f9604bce99545b49bd) switched from SCHEDULED to > DEPLOYING. > 2017-09-15 12:48:01,933 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying > Map -> Sink: Unnamed (1/1) (attempt #0) to giordano-2-2-100-1 > 2017-09-15 12:48:01,941 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Learn -> > Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink) (1/1) > (d24f111f9720c8e4df77f67a$ > 2017-09-15 12:48:01,941 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying > Learn -> Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink) > (1/1) (attempt #0) to$ > 2017-09-15 12:48:03,018 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Learn -> > Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink) (1/1) > (d24f111f9720c8e4df77f67a$ > 2017-09-15 12:48:03,038 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: > Custom Source -> Timestamps/Watermarks (1/1) > (bc0e95e951deb6680cff372a954950d2) switched from DEPL$ > 2017-09-15 12:48:03,046 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Map -> Sink: > Unnamed (1/1) (7ecc9cea7b0132f9604bce99545b49bd) switched from DEPLOYING to > RUNNING. > 2017-09-15 12:48:06,474 ERROR akka.remote.EndpointWriter > > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:48:07,189 ERROR akka.remote.EndpointWriter > > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:48:22,495 ERROR akka.remote.EndpointWriter > > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > $flink@giordano-2-2-100-1:6123/]] arriving at > [akka.tcp://flink@giordano-2-2-100-1:6123] inbound addresses are > [akka.tcp://flink@localhost:6123] > 2017-09-15 12:48:52,506 ERROR akka.remote.EndpointWriter > > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:48:53,212 ERROR akka.remote.EndpointWriter > > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:49:22,524 ERROR akka.remote.EndpointWriter > > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:49:23,230 ERROR akka.remote.EndpointWriter > > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:49:52,544 ERROR akka.remote.EndpointWriter > > - dropping message [class akka.actor.ActorSelectionMessage] for non-local > recipient [Actor[akka.tcp://flink@$ > 2017-09-15 12:49:59,410 WARN akka.remote.RemoteWatcher > > - Detected unreachable: [akka.tcp://flink@giordano-2-2-100-1:35127] > 2017-09-15 12:49:59,445 INFO org.apache.flink.runtime.jobmanager.JobManager > > - Task manager akka.tcp://flink@giordano-2-2-100-1:35127/user/taskmanager > terminated. > 2017-09-15 12:49:59,446 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: > Custom Source -> Timestamps/Watermarks (1/1) > (bc0e95e951deb6680cff372a954950d2) switched from RUNN$ > java.lang.Exception: TaskManager was lost/killed: > cf04d1390ff86aba4d1702ef1a0d2b67 @ giordano-2-2-100-1 (dataPort=36806) > > it tag as unreachable the task manager (who reside on the same node...) > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/