I think it might be that the computation is to CPU heavy, which makes the 
TaskManager unresponsive to any JobManager messages and so the JobManager 
thinks that the TaskManager is lost.

@Till, do you have another idea about what could be going on?

> On 15. Sep 2017, at 13:52, AndreaKinn <kinn6...@hotmail.it> wrote:
> 
> the job manager log probably is more interesting:
> 
> 2017-09-15 12:47:45,420 WARN  org.apache.hadoop.util.NativeCodeLoader         
>              
> - Unable to load native-hadoop library for your platform... using
> builtin-java classes where applicable
> 2017-09-15 12:47:45,650 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> -
> --------------------------------------------------------------------------------
> 2017-09-15 12:47:45,650 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> -  Starting JobManager (Version: 1.3.2, Rev:0399bee, Date:03.08.2017 @
> 10:23:11 UTC)
> 2017-09-15 12:47:45,650 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> -  Current user: giordano
> 2017-09-15 12:47:45,651 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> -  JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation -
> 1.8/25.131-b11
> 2017-09-15 12:47:45,652 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> -  Maximum heap size: 491 MiBytes
> 2017-09-15 12:47:45,652 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> -  JAVA_HOME: /usr/lib/jvm/java-8-oracle
> 2017-09-15 12:47:45,658 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> -  Hadoop version: 2.7.2
> 2017-09-15 12:47:45,658 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> -  JVM Options:
> 2017-09-15 12:47:45,658 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> -     -Xms512m
> 2017-09-15 12:47:45,658 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> -     -Xmx512m
> 2017-09-15 12:47:45,658 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> -    
> -Dlog.file=/home/giordano/flink-1.3.2/log/flink-giordano-jobmanager-0-giordano-2-2-100-1.log
> 2017-09-15 12:47:45,658 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> -    
> -Dlog4j.configuration=file:/home/giordano/flink-1.3.2/conf/log4j.properties
> 2017-09-15 12:47:45,658 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> -    
> -Dlogback.configurationFile=file:/home/giordano/flink-1.3.2/conf/logback.xml
> 2017-09-15 12:47:45,658 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> -  Program Arguments:
> 2017-09-15 12:47:45,659 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> -     --configDir
> 2017-09-15 12:47:45,659 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> -     /home/giordano/flink-1.3.2/conf
> 2017-09-15 12:47:45,659 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> -     --executionMode
> 2017-09-15 12:47:45,659 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> -     cluster
> 2017-09-15 12:47:45,659 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> -  Classpath:
> /home/giordano/flink-1.3.2/lib/flink-python_2.11-1.3.2.jar:/home/giordano/flink-1.3.2/lib/flin$
> 2017-09-15 12:47:45,659 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> -
> --------------------------------------------------------------------------------
> 2017-09-15 12:47:45,661 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> - Registered UNIX signal handlers for [TERM, HUP, INT]
> 2017-09-15 12:47:45,947 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> - Loading configuration from /home/giordano/flink-1.3.2/conf
> 2017-09-15 12:47:45,953 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: env.java.home, /usr/lib/jvm/java-8-oracle
> 2017-09-15 12:47:45,953 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.rpc.address, localhost
> 2017-09-15 12:47:45,953 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.rpc.port, 6123
> 2017-09-15 12:47:45,954 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.heap.mb, 512
> 2017-09-15 12:47:45,954 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.numberOfTaskSlots, 2
> 2017-09-15 12:47:45,954 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.memory.preallocate, false
> 2017-09-15 12:47:45,955 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.web.port, 8081
> 2017-09-15 12:47:45,956 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: state.backend, filesystem
> 2017-09-15 12:47:45,956 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: state.backend.fs.checkpointdir,
> file:///home/flink-checkpoints
> 2017-09-15 12:47:45,970 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> - Starting JobManager without high-availability
> 2017-09-15 12:47:45,973 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> - Starting JobManager on localhost:6123 with execution mode CLUSTER
> 2017-09-15 12:47:45,993 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: env.java.home, /usr/lib/jvm/java-8-oracle
> 2017-09-15 12:47:45,995 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.rpc.address, localhost
> 2017-09-15 12:47:45,995 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.rpc.port, 6123
> 2017-09-15 12:47:45,995 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.heap.mb, 512
> 2017-09-15 12:47:45,995 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.numberOfTaskSlots, 2
> 2017-09-15 12:47:45,996 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: taskmanager.memory.preallocate, false
> 2017-09-15 12:47:45,996 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: jobmanager.web.port, 8081
> 2017-09-15 12:47:45,996 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: state.backend, filesystem
> 2017-09-15 12:47:45,996 INFO 
> org.apache.flink.configuration.GlobalConfiguration            - Loading
> configuration property: state.backend.fs.checkpointdir,
> file:///home/flink-checkpoints
> 2017-09-15 12:47:46,045 INFO 
> org.apache.flink.runtime.security.modules.HadoopModule        - Hadoop user
> set to giordano (auth:SIMPLE)
> 2017-09-15 12:47:46,209 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> - Starting JobManager actor system reachable at localhost:6123
> 2017-09-15 12:47:46,878 INFO  akka.event.slf4j.Slf4jLogger                    
>              
> - Slf4jLogger started
> 2017-09-15 12:47:46,982 INFO  Remoting                                        
>              
> - Starting remoting
> 2017-09-15 12:47:47,392 INFO  Remoting                                        
>              
> - Remoting started; listening on addresses
> :[akka.tcp://flink@localhost:6123]
> 2017-09-15 12:47:47,423 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> - Starting JobManager web frontend
> 2017-09-15 12:47:47,433 INFO 
> org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined
> location of JobManager log file:
> /home/giordano/flink-1.3.2/log/flink-giordano-jobmanager-0-gio$
> 2017-09-15 12:47:47,434 INFO 
> org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined
> location of JobManager stdout file:
> /home/giordano/flink-1.3.2/log/flink-giordano-jobmanager-0-$
> 2017-09-15 12:47:47,434 INFO 
> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Using
> directory /tmp/flink-web-f9816186-7918-4475-8359-c3cafb63559a for the web
> interface files
> 2017-09-15 12:47:47,435 INFO 
> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Using
> directory /tmp/flink-web-75aba058-8587-4181-bf16-ee2e68d9b70c for web
> frontend JAR file uploads
> 2017-09-15 12:47:47,853 INFO 
> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web frontend
> listening at 0:0:0:0:0:0:0:0:8081
> 2017-09-15 12:47:47,854 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> - Starting JobManager actor
> 2017-09-15 12:47:47,876 INFO  org.apache.flink.runtime.blob.BlobServer        
>              
> - Created BLOB server storage directory
> /tmp/blobStore-9ad4e807-069e-4fb7-88b3-410fbdcb5eb0
> 2017-09-15 12:47:47,879 INFO  org.apache.flink.runtime.blob.BlobServer        
>              
> - Started BLOB server at 0.0.0.0:39682 - max concurrent requests: 50 - max
> backlog: 1000
> 2017-09-15 12:47:47,896 INFO 
> org.apache.flink.runtime.metrics.MetricRegistry               - No metrics
> reporter configured, no metrics will be exposed/reported.
> 2017-09-15 12:47:47,906 INFO 
> org.apache.flink.runtime.jobmanager.MemoryArchivist           - Started
> memory archivist akka://flink/user/archive
> 2017-09-15 12:47:47,914 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> - Starting JobManager at akka.tcp://flink@localhost:6123/user/jobmanager.
> 2017-09-15 12:47:47,929 INFO 
> org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Starting
> with JobManager akka.tcp://flink@localhost:6123/user/jobmanager on port 8081
> 2017-09-15 12:47:47,930 INFO 
> org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader
> reachable under
> akka.tcp://flink@localhost:6123/user/jobmanager:00000000-0000-0000-0000-0000000$
> 2017-09-15 12:47:47,970 INFO 
> org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager
>  
> - Trying to associate with JobManager leader
> akka.tcp://flink@localhost:6123/user/jobmanag$
> 2017-09-15 12:47:48,010 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> - JobManager akka.tcp://flink@localhost:6123/user/jobmanager was granted
> leadership with leader session ID S$
> 2017-09-15 12:47:48,034 INFO 
> org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager
>  
> - Resource Manager associating with leading JobManager
> Actor[akka://flink/user/jobmanager#$
> 2017-09-15 12:47:51,071 ERROR akka.remote.EndpointWriter                      
>              
> - dropping message [class akka.actor.ActorSelectionMessage] for non-local
> recipient [Actor[akka.tcp://flink@$
> 2017-09-15 12:47:51,383 ERROR akka.remote.EndpointWriter                      
>              
> - dropping message [class akka.actor.ActorSelectionMessage] for non-local
> recipient [Actor[akka.tcp://flink@$
> 2017-09-15 12:47:51,718 ERROR akka.remote.EndpointWriter                      
>              
> - dropping message [class akka.actor.ActorSelectionMessage] for non-local
> recipient [Actor[akka.tcp://flink@$
> 2017-09-15 12:47:52,109 ERROR akka.remote.EndpointWriter                      
>              
> - dropping message [class akka.actor.ActorSelectionMessage] for non-local
> recipient [Actor[akka.tcp://flink@$
> 2017-09-15 12:47:52,414 ERROR akka.remote.EndpointWriter                      
>              
> - dropping message [class akka.actor.ActorSelectionMessage] for non-local
> recipient [Actor[akka.tcp://flink@$
> 2017-09-15 12:47:53,130 ERROR akka.remote.EndpointWriter                      
>              
> - dropping message [class akka.actor.ActorSelectionMessage] for non-local
> recipient [Actor[akka.tcp://flink@$
> 2017-09-15 12:47:54,365 INFO 
> org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager
>  
> - TaskManager cf04d1390ff86aba4d1702ef1a0d2b67 has started.
> 2017-09-15 12:47:54,368 INFO 
> org.apache.flink.runtime.instance.InstanceManager             - Registered
> TaskManager at giordano-2-2-100-1
> (akka.tcp://flink@giordano-2-2-100-1:35127/user/taskmanager) $
> 2017-09-15 12:47:54,434 ERROR akka.remote.EndpointWriter                      
>              
> - dropping message [class akka.actor.ActorSelectionMessage] for non-local
> recipient [Actor[akka.tcp://flink@$
> 2017-09-15 12:47:55,150 ERROR akka.remote.EndpointWriter                      
>              
> - dropping message [class akka.actor.ActorSelectionMessage] for non-local
> recipient [Actor[akka.tcp://flink@$
> 2017-09-15 12:47:58,455 ERROR akka.remote.EndpointWriter                      
>              
> - dropping message [class akka.actor.ActorSelectionMessage] for non-local
> recipient [Actor[akka.tcp://flink@$
> 2017-09-15 12:47:59,172 ERROR akka.remote.EndpointWriter                      
>              
> - dropping message [class akka.actor.ActorSelectionMessage] for non-local
> recipient [Actor[akka.tcp://flink@$
> 2017-09-15 12:48:01,650 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> - Submitting job df1f60ca168364759c69dbe078544346 (Flink Streaming Job).
> 2017-09-15 12:48:01,768 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> - Using restart strategy
> FixedDelayRestartStrategy(maxNumberRestartAttempts=1,
> delayBetweenRestartAttempts=0$
> 2017-09-15 12:48:01,782 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job recovers
> via failover strategy: full graph restart
> 2017-09-15 12:48:01,796 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> - Running initialization on master for job Flink Streaming Job
> (df1f60ca168364759c69dbe078544346).
> 2017-09-15 12:48:01,796 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> - Successfully ran initialization on master in 0 ms.
> 2017-09-15 12:48:01,830 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> - State backend is set to heap memory (checkpoints to filesystem
> "file:/home/flink-checkpoints")
> 2017-09-15 12:48:01,843 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> - Scheduling job df1f60ca168364759c69dbe078544346 (Flink Streaming Job).
> 2017-09-15 12:48:01,844 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Flink
> Streaming Job (df1f60ca168364759c69dbe078544346) switched from state CREATED
> to RUNNING.
> 2017-09-15 12:48:01,861 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
> Custom Source -> Timestamps/Watermarks (1/1)
> (bc0e95e951deb6680cff372a954950d2) switched from CREA$
> 2017-09-15 12:48:01,882 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Map -> Sink:
> Unnamed (1/1) (7ecc9cea7b0132f9604bce99545b49bd) switched from CREATED to
> SCHEDULED.
> 2017-09-15 12:48:01,883 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Learn ->
> Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink) (1/1)
> (d24f111f9720c8e4df77f67a$
> 2017-09-15 12:48:01,895 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
> Custom Source -> Timestamps/Watermarks (1/1)
> (bc0e95e951deb6680cff372a954950d2) switched from SCHE$
> 2017-09-15 12:48:01,912 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Deploying
> Source: Custom Source -> Timestamps/Watermarks (1/1) (attempt #0) to
> giordano-2-2-100-1
> 2017-09-15 12:48:01,933 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Map -> Sink:
> Unnamed (1/1) (7ecc9cea7b0132f9604bce99545b49bd) switched from SCHEDULED to
> DEPLOYING.
> 2017-09-15 12:48:01,933 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Deploying
> Map -> Sink: Unnamed (1/1) (attempt #0) to giordano-2-2-100-1
> 2017-09-15 12:48:01,941 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Learn ->
> Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink) (1/1)
> (d24f111f9720c8e4df77f67a$
> 2017-09-15 12:48:01,941 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Deploying
> Learn -> Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink)
> (1/1) (attempt #0) to$
> 2017-09-15 12:48:03,018 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Learn ->
> Select -> Process -> (Sink: Cassandra Sink, Sink: Cassandra Sink) (1/1)
> (d24f111f9720c8e4df77f67a$
> 2017-09-15 12:48:03,038 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
> Custom Source -> Timestamps/Watermarks (1/1)
> (bc0e95e951deb6680cff372a954950d2) switched from DEPL$
> 2017-09-15 12:48:03,046 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Map -> Sink:
> Unnamed (1/1) (7ecc9cea7b0132f9604bce99545b49bd) switched from DEPLOYING to
> RUNNING.
> 2017-09-15 12:48:06,474 ERROR akka.remote.EndpointWriter                      
>              
> - dropping message [class akka.actor.ActorSelectionMessage] for non-local
> recipient [Actor[akka.tcp://flink@$
> 2017-09-15 12:48:07,189 ERROR akka.remote.EndpointWriter                      
>              
> - dropping message [class akka.actor.ActorSelectionMessage] for non-local
> recipient [Actor[akka.tcp://flink@$
> 2017-09-15 12:48:22,495 ERROR akka.remote.EndpointWriter                      
>              
> - dropping message [class akka.actor.ActorSelectionMessage] for non-local
> recipient [Actor[akka.tcp://flink@$
> $flink@giordano-2-2-100-1:6123/]] arriving at
> [akka.tcp://flink@giordano-2-2-100-1:6123] inbound addresses are
> [akka.tcp://flink@localhost:6123]
> 2017-09-15 12:48:52,506 ERROR akka.remote.EndpointWriter                      
>              
> - dropping message [class akka.actor.ActorSelectionMessage] for non-local
> recipient [Actor[akka.tcp://flink@$
> 2017-09-15 12:48:53,212 ERROR akka.remote.EndpointWriter                      
>              
> - dropping message [class akka.actor.ActorSelectionMessage] for non-local
> recipient [Actor[akka.tcp://flink@$
> 2017-09-15 12:49:22,524 ERROR akka.remote.EndpointWriter                      
>              
> - dropping message [class akka.actor.ActorSelectionMessage] for non-local
> recipient [Actor[akka.tcp://flink@$
> 2017-09-15 12:49:23,230 ERROR akka.remote.EndpointWriter                      
>              
> - dropping message [class akka.actor.ActorSelectionMessage] for non-local
> recipient [Actor[akka.tcp://flink@$
> 2017-09-15 12:49:52,544 ERROR akka.remote.EndpointWriter                      
>              
> - dropping message [class akka.actor.ActorSelectionMessage] for non-local
> recipient [Actor[akka.tcp://flink@$
> 2017-09-15 12:49:59,410 WARN  akka.remote.RemoteWatcher                       
>              
> - Detected unreachable: [akka.tcp://flink@giordano-2-2-100-1:35127]
> 2017-09-15 12:49:59,445 INFO  org.apache.flink.runtime.jobmanager.JobManager  
>              
> - Task manager akka.tcp://flink@giordano-2-2-100-1:35127/user/taskmanager
> terminated.
> 2017-09-15 12:49:59,446 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
> Custom Source -> Timestamps/Watermarks (1/1)
> (bc0e95e951deb6680cff372a954950d2) switched from RUNN$
> java.lang.Exception: TaskManager was lost/killed:
> cf04d1390ff86aba4d1702ef1a0d2b67 @ giordano-2-2-100-1 (dataPort=36806)
> 
> it tag as unreachable the task manager (who reside on the same node...)
> 
> 
> 
> --
> Sent from: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to