Hi, John Thanks for your update. >From the last 4 lines for the log it seems that there is some TM lost. So it is likely that the TM stopped caused that the retrieve log failed. Best, Guowei
On Thu, Nov 4, 2021 at 10:10 PM John Smith <java.dev....@gmail.com> wrote: > No, I guess it's stable. > > 2021-11-02 22:41:08,276 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > -------------------------------------------------------------------------------- > 2021-11-02 22:41:08,292 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting > StandaloneSessionClusterEntrypoint (Version: 1.10.0, Rev:aa4eb8f, > Date:07.02.2020 @ 19:18:19 CET) > 2021-11-02 22:41:08,292 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - OS > current user: flink > 2021-11-02 22:41:08,304 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Current > Hadoop/Kerberos user: <no hadoop dependency found> > 2021-11-02 22:41:08,304 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM: > OpenJDK 64-Bit Server VM - Private Build - 1.8/25.292-b10 > 2021-11-02 22:41:08,306 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Maximum > heap size: 2944 MiBytes > 2021-11-02 22:41:08,306 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > JAVA_HOME: (not set) > 2021-11-02 22:41:08,311 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - No Hadoop > Dependency available > 2021-11-02 22:41:08,311 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM > Options: > 2021-11-02 22:41:08,313 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > -Xms3072m > 2021-11-02 22:41:08,313 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > -Xmx3072m > 2021-11-02 22:41:08,313 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > -Dlog.file=/opt/flink-1.10.0/log/flink-flink-standalonesession-0-xxxxxxjob-0003.log > 2021-11-02 22:41:08,313 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > -Dlog4j.configuration=file:/opt/flink-1.10.0/conf/log4j.properties > 2021-11-02 22:41:08,313 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > -Dlogback.configurationFile=file:/opt/flink-1.10.0/conf/logback.xml > 2021-11-02 22:41:08,314 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Program > Arguments: > 2021-11-02 22:41:08,317 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > --configDir > 2021-11-02 22:41:08,318 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > /opt/flink-1.10.0/conf > 2021-11-02 22:41:08,318 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > --executionMode > 2021-11-02 22:41:08,318 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - cluster > 2021-11-02 22:41:08,318 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --host > 2021-11-02 22:41:08,318 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > xxxxxxjob-0003 > 2021-11-02 22:41:08,329 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > --webui-port > 2021-11-02 22:41:08,330 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - 8081 > 2021-11-02 22:41:08,330 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > Classpath: > /opt/flink-1.10.0/lib/flink-table-blink_2.12-1.10.0.jar:/opt/flink-1.10.0/lib/flink-table_2.12-1.10.0.jar:/opt/flink-1.10.0/lib/log4j-1.2.17.jar:/opt/flink-1.10.0/lib/slf4j-log4j12-1.7.15.jar:/opt/flink-1.10.0/lib/flink-dist_2.12-1.10.0.jar::: > 2021-11-02 22:41:08,330 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > -------------------------------------------------------------------------------- > 2021-11-02 22:41:08,362 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Registered > UNIX signal handlers for [TERM, HUP, INT] > 2021-11-02 22:41:08,558 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: env.ssh.opts, -l flink -oStrictHostKeyChecking=no > 2021-11-02 22:41:08,558 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: cluster.evenly-spread-out-slots, true > 2021-11-02 22:41:08,559 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: jobmanager.heap.size, 3072m > 2021-11-02 22:41:08,559 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.memory.flink.size, 3072m > 2021-11-02 22:41:08,559 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.memory.jvm-metaspace.size, 256m > 2021-11-02 22:41:08,559 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: taskmanager.numberOfTaskSlots, 8 > 2021-11-02 22:41:08,560 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: parallelism.default, 1 > 2021-11-02 22:41:08,560 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: high-availability, zookeeper > 2021-11-02 22:41:08,560 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: high-availability.storageDir, > file:///mnt/flink/ha/flink_1_10/ > 2021-11-02 22:41:08,560 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: high-availability.zookeeper.quorum, > xxxxxx-0001.xxxxxx.xxxxxx:2181,xxxxxx-0002.xxxxxx.xxxxxx:2181,xxxxxx-0003.xxxxxx.xxxxxx:2181 > 2021-11-02 22:41:08,561 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: high-availability.zookeeper.path.root, /flink_1_10 > 2021-11-02 22:41:08,561 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: high-availability.cluster-id, > /flink_1_10_cluster_0001 > 2021-11-02 22:41:08,561 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: web.upload.dir, /mnt/flink/uploads/flink_1_10 > 2021-11-02 22:41:08,562 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: state.backend, filesystem > 2021-11-02 22:41:08,562 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: state.checkpoints.dir, > file:///mnt/flink/checkpoints/flink_1_10 > 2021-11-02 22:41:08,562 INFO > org.apache.flink.configuration.GlobalConfiguration - Loading > configuration property: state.savepoints.dir, > file:///mnt/flink/savepoints/flink_1_10 > 2021-11-02 22:41:09,935 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting > StandaloneSessionClusterEntrypoint. > 2021-11-02 22:41:09,935 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Install > default filesystem. > 2021-11-02 22:41:10,405 INFO org.apache.flink.xxxxxx.fs.FileSystem > - Hadoop is not in the classpath/dependencies. The > extended set of supported File Systems via Hadoop is not available. > 2021-11-02 22:41:10,482 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Install > security context. > 2021-11-02 22:41:10,516 INFO > org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot > create Hadoop Security Module because Hadoop cannot be found in the > Classpath. > 2021-11-02 22:41:10,615 INFO > org.apache.flink.runtime.security.modules.JaasModule - Jaas file > will be created as /tmp/jaas-7770543068119743820.conf. > 2021-11-02 22:41:10,638 INFO > org.apache.flink.runtime.security.SecurityUtils - Cannot > install HadoopSecurityContext because Hadoop cannot be found in the > Classpath. > 2021-11-02 22:41:10,639 INFO > org.apache.flink.runtime.entrypoint.ClusterEntrypoint - > Initializing cluster services. > 2021-11-02 22:41:10,744 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to > start actor system at xxxxxxjob-0003:0 > 2021-11-02 22:41:13,357 INFO akka.event.slf4j.Slf4jLogger > - Slf4jLogger started > 2021-11-02 22:41:13,459 INFO akka.remote.Remoting > - Starting remoting > 2021-11-02 22:41:14,277 INFO akka.remote.Remoting > - Remoting started; listening on addresses > :[akka.tcp://flink@xxxxxxjob-0003:39977] > 2021-11-02 22:41:14,868 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Actor > system started at akka.tcp://flink@xxxxxxjob-0003:39977 > 2021-11-02 22:41:14,966 INFO > org.apache.flink.runtime.blob.FileSystemBlobStore - Creating > highly available BLOB storage directory at > file:/mnt/flink/ha/flink_1_10/flink_1_10_cluster_0001/blob > 2021-11-02 22:41:14,990 INFO org.apache.flink.runtime.util.ZooKeeperUtils > - Enforcing default ACL for ZK connections > 2021-11-02 22:41:14,991 INFO org.apache.flink.runtime.util.ZooKeeperUtils > - Using '/flink_1_10/flink_1_10_cluster_0001' as Zookeeper > namespace. > 2021-11-02 22:41:15,247 INFO > > org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl > - Starting > 2021-11-02 22:41:15,267 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:zookeeper.version=3.4.10-39d3a4f269333c922ed3db283be479f9deacaa0f, > built on 03/23/2017 10:13 GMT > 2021-11-02 22:41:15,267 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:host.name=xxxxxxjob-0003 > 2021-11-02 22:41:15,281 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:java.version=1.8.0_292 > 2021-11-02 22:41:15,281 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:java.vendor=Private Build > 2021-11-02 22:41:15,282 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:java.home=/usr/lib/jvm/java-8-openjdk-amd64/jre > 2021-11-02 22:41:15,282 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:java.class.path=/opt/flink-1.10.0/lib/flink-table-blink_2.12-1.10.0.jar:/opt/flink-1.10.0/lib/flink-table_2.12-1.10.0.jar:/opt/flink-1.10.0/lib/log4j-1.2.17.jar:/opt/flink-1.10.0/lib/slf4j-log4j12-1.7.15.jar:/opt/flink-1.10.0/lib/flink-dist_2.12-1.10.0.jar::: > 2021-11-02 22:41:15,282 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib > 2021-11-02 22:41:15,282 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:java.io.tmpdir=/tmp > 2021-11-02 22:41:15,282 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:java.compiler=<NA> > 2021-11-02 22:41:15,282 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:os.name=Linux > 2021-11-02 22:41:15,282 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:os.arch=amd64 > 2021-11-02 22:41:15,282 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:os.version=4.15.0-161-generic > 2021-11-02 22:41:15,282 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:user.name=flink > 2021-11-02 22:41:15,282 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:user.home=/home/flink > 2021-11-02 22:41:15,282 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - Client > environment:user.dir=/opt/flink-1.10.0 > 2021-11-02 22:41:15,283 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - > Initiating client connection, > connectString=xxxxxx-0001.xxxxxx.xxxxxx:2181,xxxxxx-0002.xxxxxx.xxxxxx:2181,xxxxxx-0003.xxxxxx.xxxxxx:2181 > sessionTimeout=60000 > watcher=org.apache.flink.shaded.curator.org.apache.curator.ConnectionState@27216cd > 2021-11-02 22:41:15,377 WARN > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - SASL > configuration failed: javax.security.auth.login.LoginException: No JAAS > configuration section named 'Client' was found in specified JAAS > configuration file: '/tmp/jaas-7770543068119743820.conf'. Will continue > connection to Zookeeper server without SASL authentication, if Zookeeper > server allows it. > 2021-11-02 22:41:15,379 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - > Opening socket connection to server xxxxxx.35/xxxxxx.35:2181 > 2021-11-02 22:41:15,386 ERROR > org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - > Authentication failed > 2021-11-02 22:41:15,396 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - > Socket connection established to xxxxxx.35/xxxxxx.35:2181, initiating > session > 2021-11-02 22:41:15,421 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - > Session establishment complete on server xxxxxx.35/xxxxxx.35:2181, > sessionid = 0x200000086a20007, negotiated timeout = 40000 > 2021-11-02 22:41:15,425 INFO > > org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager > - State change: CONNECTED > 2021-11-02 22:41:15,438 INFO org.apache.flink.runtime.blob.BlobServer > - Created BLOB server storage directory > /tmp/blobStore-9cb73f27-11db-4c42-a3fc-9b77f558e722 > 2021-11-02 22:41:15,451 INFO org.apache.flink.runtime.blob.BlobServer > - Started BLOB server at 0.0.0.0:34845 - max concurrent > requests: 50 - max backlog: 1000 > 2021-11-02 22:41:15,496 INFO > org.apache.flink.runtime.metrics.MetricRegistryImpl - No metrics > reporter configured, no metrics will be exposed/reported. > 2021-11-02 22:41:15,509 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to > start actor system at xxxxxxjob-0003:0 > 2021-11-02 22:41:15,624 INFO akka.event.slf4j.Slf4jLogger > - Slf4jLogger started > 2021-11-02 22:41:15,654 INFO akka.remote.Remoting > - Starting remoting > 2021-11-02 22:41:15,700 INFO akka.remote.Remoting > - Remoting started; listening on addresses > :[akka.tcp://flink-metrics@xxxxxxjob-0003:38997] > 2021-11-02 22:41:15,733 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Actor > system started at akka.tcp://flink-metrics@xxxxxxjob-0003:38997 > 2021-11-02 22:41:15,755 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting > RPC endpoint for org.apache.flink.runtime.metrics.dump.MetricQueryService > at akka://flink-metrics/user/MetricQueryService . > 2021-11-02 22:41:16,379 INFO > org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore - > Initializing FileArchivedExecutionGraphStore: Storage directory > /tmp/executionGraphStore-40cf7548-25fc-4b2b-a6a8-d504eb611847, expiration > time 3600000, maximum cache size 52428800 bytes. > 2021-11-02 22:41:16,526 INFO org.apache.flink.configuration.Configuration > - Config uses fallback configuration key > 'jobmanager.rpc.address' instead of key 'rest.address' > 2021-11-02 22:41:16,526 INFO org.apache.flink.configuration.Configuration > - Config uses fallback configuration key 'rest.port' > instead of key 'rest.bind-port' > 2021-11-02 22:41:16,536 INFO > org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Upload > directory /mnt/flink/uploads/flink_1_10/flink-web-upload does not exist. > 2021-11-02 22:41:16,558 INFO > org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Created > directory /mnt/flink/uploads/flink_1_10/flink-web-upload for file uploads. > 2021-11-02 22:41:16,563 INFO > org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Starting > rest endpoint. > 2021-11-02 22:41:17,262 INFO > org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined > location of main cluster component log file: > /opt/flink-1.10.0/log/flink-flink-standalonesession-0-xxxxxxjob-0003.log > 2021-11-02 22:41:17,263 INFO > org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined > location of main cluster component stdout file: > /opt/flink-1.10.0/log/flink-flink-standalonesession-0-xxxxxxjob-0003.out > 2021-11-02 22:41:18,135 INFO > org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Rest > endpoint listening at xxxxxxjob-0003:8081 > 2021-11-02 22:41:18,145 INFO > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Starting ZooKeeperLeaderElectionService > ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}. > 2021-11-02 22:41:18,303 INFO > org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Web > frontend listening at http://xxxxxxjob-0003:8081. > 2021-11-02 22:41:18,385 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting > RPC endpoint for > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at > akka://flink/user/resourcemanager . > 2021-11-02 22:41:18,430 INFO > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Starting ZooKeeperLeaderElectionService > ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}. > 2021-11-02 22:41:18,431 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService > - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. > 2021-11-02 22:41:18,431 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService > - Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock. > 2021-11-02 22:41:18,437 INFO > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Starting ZooKeeperLeaderElectionService > ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}. > 2021-11-02 23:20:22,682 ERROR > org.apache.flink.runtime.rest.handler.taskmanager.TaskManagerLogFileHandler > - Failed to transfer file from TaskExecutor > 7e1b7db5918004e4160fdecec1bbdad7. > java.util.concurrent.CompletionException: > org.apache.flink.util.FlinkException: Could not retrieve file from > transient blob store. > at > org.apache.flink.runtime.rest.handler.taskmanager.AbstractTaskManagerFileHandler.lambda$respondToRequest$0(AbstractTaskManagerFileHandler.java:135) > at > java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670) > at > java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646) > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:416) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:515) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) > at > org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.flink.util.FlinkException: Could not retrieve file > from transient blob store. > ... 10 more > Caused by: java.io.FileNotFoundException: Local file > /tmp/blobStore-9cb73f27-11db-4c42-a3fc-9b77f558e722/no_job/blob_t-274d3c2d5acd78ced877d898b1877b10b62a64df-590b54325d599a6782a77413691e0a7b > does not exist and failed to copy from blob store. > at > org.apache.flink.runtime.blob.BlobServer.getFileInternal(BlobServer.java:516) > at > org.apache.flink.runtime.blob.BlobServer.getFileInternal(BlobServer.java:444) > at org.apache.flink.runtime.blob.BlobServer.getFile(BlobServer.java:369) > at > org.apache.flink.runtime.rest.handler.taskmanager.AbstractTaskManagerFileHandler.lambda$respondToRequest$0(AbstractTaskManagerFileHandler.java:133) > ... 9 more > 2021-11-02 23:20:22,703 ERROR > org.apache.flink.runtime.rest.handler.taskmanager.TaskManagerLogFileHandler > - Unhandled exception. > org.apache.flink.util.FlinkException: Could not retrieve file from > transient blob store. > at > org.apache.flink.runtime.rest.handler.taskmanager.AbstractTaskManagerFileHandler.lambda$respondToRequest$0(AbstractTaskManagerFileHandler.java:135) > at > java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670) > at > java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646) > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:416) > at > org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:515) > at > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) > at > org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.FileNotFoundException: Local file > /tmp/blobStore-9cb73f27-11db-4c42-a3fc-9b77f558e722/no_job/blob_t-274d3c2d5acd78ced877d898b1877b10b62a64df-590b54325d599a6782a77413691e0a7b > does not exist and failed to copy from blob store. > at > org.apache.flink.runtime.blob.BlobServer.getFileInternal(BlobServer.java:516) > at > org.apache.flink.runtime.blob.BlobServer.getFileInternal(BlobServer.java:444) > at org.apache.flink.runtime.blob.BlobServer.getFile(BlobServer.java:369) > at > org.apache.flink.runtime.rest.handler.taskmanager.AbstractTaskManagerFileHandler.lambda$respondToRequest$0(AbstractTaskManagerFileHandler.java:133) > ... 9 more > 2021-11-02 23:47:57,865 WARN akka.remote.transport.netty.NettyTransport > - Remote connection to [xxxxxxjob-0001/xxxxxx.72:37007] > failed with java.io.IOException: Connection reset by peer > 2021-11-02 23:47:57,912 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@xxxxxxjob-0001:37007] has failed, address is now gated > for [50] ms. Reason: [Disassociated] > 2021-11-02 23:53:41,565 WARN akka.remote.transport.netty.NettyTransport > - Remote connection to [xxxxxxjob-0001/xxxxxx.72:42961] > failed with java.io.IOException: Connection reset by peer > 2021-11-02 23:53:41,571 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink-metrics@xxxxxxjob-0001:42961] has failed, address is > now gated for [50] ms. Reason: [Disassociated] > > On Thu, 4 Nov 2021 at 03:45, Guowei Ma <guowei....@gmail.com> wrote: > >> >>>Ok I missed the log below. I guess when the task manager was stopped >> this happened. >> I think if the TM stopped you also would not get the log. But It will >> throw another "UnknownTaskExecutorException", which would include something >> like “No TaskExecutor registered under ”. >> >> >>> But I guess it's ok and not a big issue??? >> Does this happen continuously? >> >> Best, >> Guowei >> >> >> On Thu, Nov 4, 2021 at 12:39 AM John Smith <java.dev....@gmail.com> >> wrote: >> >>> Ok I missed the log below. I guess when the task manager was stopped >>> this happened. >>> >>> I attached the full sequence. But I guess it's ok and not a big issue??? >>> >>> >>> 2021-11-02 23:20:22,682 ERROR >>> org.apache.flink.runtime.rest.handler.taskmanager. >>> TaskManagerLogFileHandler - Failed to transfer file from TaskExecutor >>> 7e1b7db5918004e4160fdecec1bbdad7. >>> java.util.concurrent.CompletionException: org.apache.flink.util. >>> FlinkException: Could not retrieve file from transient blob store. >>> at org.apache.flink.runtime.rest.handler.taskmanager. >>> AbstractTaskManagerFileHandler.lambda$respondToRequest$0( >>> AbstractTaskManagerFileHandler.java:135) >>> at java.util.concurrent.CompletableFuture.uniAccept( >>> CompletableFuture.java:670) >>> at java.util.concurrent.CompletableFuture$UniAccept.tryFire( >>> CompletableFuture.java:646) >>> at java.util.concurrent.CompletableFuture$Completion.run( >>> CompletableFuture.java:456) >>> at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>> AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) >>> at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>> SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:416 >>> ) >>> at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop >>> .run(NioEventLoop.java:515) >>> at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>> SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) >>> at org.apache.flink.shaded.netty4.io.netty.util.internal. >>> ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) >>> at java.lang.Thread.run(Thread.java:748) >>> Caused by: org.apache.flink.util.FlinkException: Could not retrieve >>> file from transient blob store. >>> ... 10 more >>> Caused by: java.io.FileNotFoundException: Local file /tmp/blobStore-9 >>> cb73f27-11db-4c42-a3fc-9b77f558e722/no_job/blob_t-274d3 >>> c2d5acd78ced877d898b1877b10b62a64df-590b54325d599a6782a77413691e0a7b >>> does not exist and failed to copy from blob store. >>> at org.apache.flink.runtime.blob.BlobServer.getFileInternal( >>> BlobServer.java:516) >>> at org.apache.flink.runtime.blob.BlobServer.getFileInternal( >>> BlobServer.java:444) >>> at org.apache.flink.runtime.blob.BlobServer.getFile(BlobServer.java: >>> 369) >>> at org.apache.flink.runtime.rest.handler.taskmanager. >>> AbstractTaskManagerFileHandler.lambda$respondToRequest$0( >>> AbstractTaskManagerFileHandler.java:133) >>> ... 9 more >>> 2021-11-02 23:20:22,703 ERROR >>> org.apache.flink.runtime.rest.handler.taskmanager. >>> TaskManagerLogFileHandler - Unhandled exception. >>> org.apache.flink.util.FlinkException: Could not retrieve file from >>> transient blob store. >>> at org.apache.flink.runtime.rest.handler.taskmanager. >>> AbstractTaskManagerFileHandler.lambda$respondToRequest$0( >>> AbstractTaskManagerFileHandler.java:135) >>> at java.util.concurrent.CompletableFuture.uniAccept( >>> CompletableFuture.java:670) >>> at java.util.concurrent.CompletableFuture$UniAccept.tryFire( >>> CompletableFuture.java:646) >>> at java.util.concurrent.CompletableFuture$Completion.run( >>> CompletableFuture.java:456) >>> at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>> AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) >>> at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>> SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:416 >>> ) >>> at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop >>> .run(NioEventLoop.java:515) >>> at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>> SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) >>> at org.apache.flink.shaded.netty4.io.netty.util.internal. >>> ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) >>> at java.lang.Thread.run(Thread.java:748) >>> Caused by: java.io.FileNotFoundException: Local file /tmp/blobStore-9 >>> cb73f27-11db-4c42-a3fc-9b77f558e722/no_job/blob_t-274d3 >>> c2d5acd78ced877d898b1877b10b62a64df-590b54325d599a6782a77413691e0a7b >>> does not exist and failed to copy from blob store. >>> at org.apache.flink.runtime.blob.BlobServer.getFileInternal( >>> BlobServer.java:516) >>> at org.apache.flink.runtime.blob.BlobServer.getFileInternal( >>> BlobServer.java:444) >>> at org.apache.flink.runtime.blob.BlobServer.getFile(BlobServer.java: >>> 369) >>> at org.apache.flink.runtime.rest.handler.taskmanager. >>> AbstractTaskManagerFileHandler.lambda$respondToRequest$0( >>> AbstractTaskManagerFileHandler.java:133) >>> ... 9 more >>> >>> On Wed, 3 Nov 2021 at 02:48, Guowei Ma <guowei....@gmail.com> wrote: >>> >>>> Hi, Smith >>>> >>>> It seems that the log file(blob_t-274d3c2d5acd78ced877d89 >>>> 8b1877b10b62a64df-590b54325d599a6782a77413691e0a7b) is deleted for >>>> some reason. But AFAIK there are no other guys reporting this >>>> exception.(Maybe other guys know what would happen). >>>> 1. I think if you could refresh the page and you would see the correct >>>> result because this would trigger another file retrieving from TM. >>>> 2. And It might be more safe that setting an dedicated blob >>>> directory path(other than /tmp) `blob.storage.directory`[1] >>>> >>>> [1] >>>> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#blob-storage-directory >>>> >>>> >>>> Best, >>>> Guowei >>>> >>>> >>>> On Wed, Nov 3, 2021 at 7:50 AM John Smith <java.dev....@gmail.com> >>>> wrote: >>>> >>>>> Hi running Flink 1.10.0 With 3 zookeepers, 3 job nodes and 3 task >>>>> nodes. and I saw this exception on the job node logs... >>>>> 2021-11-02 23:20:22,703 ERROR >>>>> org.apache.flink.runtime.rest.handler.taskmanager. >>>>> TaskManagerLogFileHandler - Unhandled exception. >>>>> org.apache.flink.util.FlinkException: Could not retrieve file from >>>>> transient blob store. >>>>> at org.apache.flink.runtime.rest.handler.taskmanager. >>>>> AbstractTaskManagerFileHandler.lambda$respondToRequest$0( >>>>> AbstractTaskManagerFileHandler.java:135) >>>>> at java.util.concurrent.CompletableFuture.uniAccept( >>>>> CompletableFuture.java:670) >>>>> at java.util.concurrent.CompletableFuture$UniAccept.tryFire( >>>>> CompletableFuture.java:646) >>>>> at java.util.concurrent.CompletableFuture$Completion.run( >>>>> CompletableFuture.java:456) >>>>> at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>>>> AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) >>>>> at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>>>> SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java: >>>>> 416) >>>>> at org.apache.flink.shaded.netty4.io.netty.channel.nio. >>>>> NioEventLoop.run(NioEventLoop.java:515) >>>>> at org.apache.flink.shaded.netty4.io.netty.util.concurrent. >>>>> SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) >>>>> at org.apache.flink.shaded.netty4.io.netty.util.internal. >>>>> ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) >>>>> at java.lang.Thread.run(Thread.java:748) >>>>> Caused by: java.io.FileNotFoundException: Local file /tmp/blobStore-9 >>>>> cb73f27-11db-4c42-a3fc-9b77f558e722/no_job/blob_t-274d3 >>>>> c2d5acd78ced877d898b1877b10b62a64df-590b54325d599a6782a77413691e0a7b >>>>> does not exist and failed to copy from blob store. >>>>> at org.apache.flink.runtime.blob.BlobServer.getFileInternal( >>>>> BlobServer.java:516) >>>>> at org.apache.flink.runtime.blob.BlobServer.getFileInternal( >>>>> BlobServer.java:444) >>>>> at org.apache.flink.runtime.blob.BlobServer.getFile(BlobServer >>>>> .java:369) >>>>> at org.apache.flink.runtime.rest.handler.taskmanager. >>>>> AbstractTaskManagerFileHandler.lambda$respondToRequest$0( >>>>> AbstractTaskManagerFileHandler.java:133) >>>>> ... 9 more >>>>> >>>>