遇到了同样的问题,也是启动了 taskmanager-query-state-service.yaml 这个服务后,作业才能正常提交的,另外我是在本地装的 
k8s 集群进行测试的,如果是 GC 的问题,启不启动 TM service 应该不会有影响的


--

Best,
Matt Wang


On 07/27/2020 15:01,Yang Wang<[email protected]> wrote:
建议先配置heartbeat.timeout的值大一些,然后把gc log打出来
看看是不是经常发生fullGC,每次持续时间是多长,从你目前提供的log看,进程内JM->RM都会心跳超时
怀疑还是和GC有关的

env.java.opts.jobmanager: -Xloggc:<LOG_DIR>/jobmanager-gc.log
-XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=2 -XX:GCLogFileSize=512M


Best,
Yang

SmileSmile <[email protected]> 于2020年7月27日周一 下午1:50写道:

Hi,Yang Wang

因为日志太长了,删了一些重复的内容。
一开始怀疑过jm gc的问题,将jm的内存调整为10g也是一样的情况。

Best



| |
a511955993
|
|
邮箱:[email protected]
|

签名由 网易邮箱大师 定制

On 07/27/2020 11:36, Yang Wang wrote:
看你这个任务,失败的根本原因并不是“No hostname could be resolved
”,这个WARNING的原因可以单独讨论(如果在1.10里面不存在的话)。
你可以本地起一个Standalone的集群,也会有这样的WARNING,并不影响正常使用


失败的原因是slot 5分钟申请超时了,你给的日志里面2020-07-23 13:55:45,519到2020-07-23
13:58:18,037是空白的,没有进行省略吧?
这段时间按理应该是task开始deploy了。在日志里看到了JM->RM的心跳超时,同一个Pod里面的同一个进程通信也超时了
所以怀疑JM一直在FullGC,这个需要你确认一下


Best,
Yang

SmileSmile <[email protected]> 于2020年7月23日周四 下午2:43写道:

Hi Yang Wang

先分享下我这边的环境版本


kubernetes:1.17.4.   CNI: weave


1 2 3 是我的一些疑惑

4 是JM日志


1. 去掉taskmanager-query-state-service.yaml后确实不行  nslookup

kubectl exec -it busybox2 -- /bin/sh
/ # nslookup 10.47.96.2
Server:          10.96.0.10
Address:     10.96.0.10:53

** server can't find 2.96.47.10.in-addr.arpa: NXDOMAIN



2. Flink1.11和Flink1.10

detail subtasks taskmanagers xxx x 这行

1.11变成了172-20-0-50。1.10是flink-taskmanager-7b5d6958b6-sfzlk:36459。这块的改动是?(目前这个集群跑着1.10和1.11,1.10可以正常运行,如果coredns有问题,1.10版本的flink应该也有一样的情况吧?)

3. coredns是否特殊配置?

在容器中解析域名是正常的,只是反向解析没有service才会有问题。coredns是否有什么需要配置?


4. time out时候的JM日志如下:



2020-07-23 13:53:00,228 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
ResourceManager akka.tcp://flink@flink-jobmanager
:6123/user/rpc/resourcemanager_0
was granted leadership with fencing token
00000000000000000000000000000000
2020-07-23 13:53:00,232 INFO
org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] -
Starting
RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher
at akka://flink/user/rpc/dispatcher_1 .
2020-07-23 13:53:00,233 INFO
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl []
-
Starting the SlotManager.
2020-07-23 13:53:03,472 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering TaskManager with ResourceID 1f9ae0cd95a28943a73be26323588696
(akka.tcp://[email protected]:6122/user/rpc/taskmanager_0) at
ResourceManager
2020-07-23 13:53:03,777 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering TaskManager with ResourceID cac09e751264e61615329c20713a84b4
(akka.tcp://[email protected]:6122/user/rpc/taskmanager_0) at
ResourceManager
2020-07-23 13:53:03,787 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering TaskManager with ResourceID 93c72d01d09f9ae427c5fc980ed4c1e4
(akka.tcp://[email protected]:6122/user/rpc/taskmanager_0) at
ResourceManager
2020-07-23 13:53:04,044 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering TaskManager with ResourceID 8adf2f8e81b77a16d5418a9e252c61e2
(akka.tcp://[email protected]:6122/user/rpc/taskmanager_0) at
ResourceManager
2020-07-23 13:53:04,099 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering TaskManager with ResourceID 23e9d2358f6eb76b9ae718d879d4f330
(akka.tcp://[email protected]:6122/user/rpc/taskmanager_0) at
ResourceManager
2020-07-23 13:53:04,146 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering TaskManager with ResourceID 092f8dee299e32df13db3111662b61f8
(akka.tcp://[email protected]:6122/user/rpc/taskmanager_0) at
ResourceManager


2020-07-23 13:55:44,220 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] -
Received
JobGraph submission 99a030d0e3f428490a501c0132f27a56 (JobTest).
2020-07-23 13:55:44,222 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] -
Submitting job 99a030d0e3f428490a501c0132f27a56 (JobTest).
2020-07-23 13:55:44,251 INFO
org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] -
Starting
RPC endpoint for org.apache.flink.runtime.jobmaster.JobMaster at
akka://flink/user/rpc/jobmanager_2 .
2020-07-23 13:55:44,260 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Initializing job JobTest
(99a030d0e3f428490a501c0132f27a56).
2020-07-23 13:55:44,278 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Using restart back off time strategy
NoRestartBackoffTimeStrategy for JobTest
(99a030d0e3f428490a501c0132f27a56).
2020-07-23 13:55:44,319 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Running initialization on master for job JobTest
(99a030d0e3f428490a501c0132f27a56).
2020-07-23 13:55:44,319 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Successfully ran initialization on master in 0 ms.
2020-07-23 13:55:44,428 INFO
org.apache.flink.runtime.scheduler.adapter.DefaultExecutionTopology [] -
Built 1 pipelined regions in 25 ms
2020-07-23 13:55:44,437 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Loading state backend via factory
org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory
2020-07-23 13:55:44,456 INFO
org.apache.flink.contrib.streaming.state.RocksDBStateBackend [] - Using
predefined options: DEFAULT.
2020-07-23 13:55:44,457 INFO
org.apache.flink.contrib.streaming.state.RocksDBStateBackend [] - Using
default options factory:
DefaultConfigurableOptionsFactory{configuredOptions={}}.
2020-07-23 13:55:44,466 WARN  org.apache.flink.runtime.util.HadoopUtils
[] - Could not find Hadoop configuration via any of the
supported methods (Flink configuration, environment variables).
2020-07-23 13:55:45,276 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Using failover strategy

org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@72bd8533
for JobTest (99a030d0e3f428490a501c0132f27a56).
2020-07-23 13:55:45,280 INFO
org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl      [] -
JobManager runner for job JobTest (99a030d0e3f428490a501c0132f27a56) was
granted leadership with session id 00000000-0000-0000-0000-000000000000
at
akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2.
2020-07-23 13:55:45,286 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Starting scheduling with scheduling strategy
[org.apache.flink.runtime.scheduler.strategy.EagerSchedulingStrategy]



2020-07-23 13:55:45,436 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{e092b12b96b0a98bbf057e71b9705c23}]
2020-07-23 13:55:45,436 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{4ad15f417716c9e07fca383990c0f52a}]
2020-07-23 13:55:45,436 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{345fdb427a893b7fc3f4f040f93445d2}]
2020-07-23 13:55:45,437 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{e559485ea7b0b7e17367816882538d90}]
2020-07-23 13:55:45,437 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{7be8f6c1aedb27b04e7feae68078685c}]
2020-07-23 13:55:45,437 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{582a86197884206652dff3aea2306bb3}]
2020-07-23 13:55:45,437 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{0cc24260eda3af299a0b321feefaf2cb}]
2020-07-23 13:55:45,437 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{240ca6f3d3b5ece6a98243ec8cadf616}]
2020-07-23 13:55:45,438 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{c35033d598a517acc108424bb9f809fb}]
2020-07-23 13:55:45,438 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{ad35013c3b532d4b4df1be62395ae0cf}]
2020-07-23 13:55:45,438 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] - Cannot
serve slot request, no ResourceManager connected. Adding as pending
request
[SlotRequestId{c929bd5e8daf432d01fad1ece3daec1a}]
2020-07-23 13:55:45,487 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Connecting to ResourceManager
akka.tcp://flink@flink-jobmanager
:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
2020-07-23 13:55:45,492 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Resolved ResourceManager address, beginning
registration
2020-07-23 13:55:45,493 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering job manager 00000000000000000000000000000000
@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
99a030d0e3f428490a501c0132f27a56.
2020-07-23 13:55:45,499 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registered job manager 00000000000000000000000000000000
@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
99a030d0e3f428490a501c0132f27a56.
2020-07-23 13:55:45,501 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - JobManager successfully registered at
ResourceManager,
leader id: 00000000000000000000000000000000.
2020-07-23 13:55:45,501 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{15fd2a9565c2b080748c1d1592b1cbbc}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,502 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Request slot with profile ResourceProfile{UNKNOWN} for job
99a030d0e3f428490a501c0132f27a56 with allocation id
d420d08bf2654d9ea76955c70db18b69.
2020-07-23 13:55:45,502 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{8cd72cc16f0e319d915a9a096a1096d7}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,503 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{e7e422409acebdb385014a9634af6a90}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,503 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{cef1af73546ca1fc27ca7a3322e9e815}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,503 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{108fe0b3086567ad79275eccef2fdaf8}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,503 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{265e67985eab7a6dc08024e53bf2708d}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,503 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{7087497a17c441f1a1d6fefcbc7cd0ea}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,503 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{14ac08438e79c8db8d25d93b99d62725}] and
profile ResourceProfile{UNKNOWN} from resource manager.

2020-07-23 13:55:45,514 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Request slot with profile ResourceProfile{UNKNOWN} for job
99a030d0e3f428490a501c0132f27a56 with allocation id
fce526bbe3e1be91caa3e4b536b20e35.
2020-07-23 13:55:45,514 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{40c7abbb12514c405323b0569fb21647}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,514 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{a4985a9647b65b30a571258b45c8f2ce}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,515 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{c52a6eb2fa58050e71e7903590019fd1}] and
profile ResourceProfile{UNKNOWN} from resource manager.

2020-07-23 13:55:45,517 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Request slot with profile ResourceProfile{UNKNOWN} for job
99a030d0e3f428490a501c0132f27a56 with allocation id
18ac7ec802ebfcfed8c05ee9324a55a4.

2020-07-23 13:55:45,518 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Request slot with profile ResourceProfile{UNKNOWN} for job
99a030d0e3f428490a501c0132f27a56 with allocation id
7ec76cbe689eb418b63599e90ade19be.
2020-07-23 13:55:45,518 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{46d65692a8b5aad11b51f9a74a666a74}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,518 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{3670bb4f345eedf941cc18e477ba1e9d}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,518 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{4a12467d76b9e3df8bc3412c0be08e14}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,518 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{e092b12b96b0a98bbf057e71b9705c23}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,518 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{4ad15f417716c9e07fca383990c0f52a}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,518 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{345fdb427a893b7fc3f4f040f93445d2}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,519 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Requesting new slot [SlotRequestId{e559485ea7b0b7e17367816882538d90}] and
profile ResourceProfile{UNKNOWN} from resource manager.
2020-07-23 13:55:45,519 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Request slot with profile ResourceProfile{UNKNOWN} for job
99a030d0e3f428490a501c0132f27a56 with allocation id
b78837a29b4032924ac25be70ed21a3c.


2020-07-23 13:58:18,037 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.47.96.2, using IP
address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.
2020-07-23 13:58:22,192 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.34.64.14, using IP
address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.
2020-07-23 13:58:22,358 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.34.128.9, using IP
address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.
2020-07-23 13:58:24,562 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.32.160.6, using IP
address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.
2020-07-23 13:58:25,487 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.38.64.7, using IP
address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.
2020-07-23 13:58:27,636 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.42.160.6, using IP
address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.
2020-07-23 13:58:27,767 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.43.64.12, using IP
address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.
2020-07-23 13:58:29,651 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
The heartbeat of JobManager with id 456a18b6c404cb11a359718e16de1c6b
timed
out.
2020-07-23 13:58:29,651 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Disconnect job manager 00000000000000000000000000000000
@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
99a030d0e3f428490a501c0132f27a56 from the resource manager.
2020-07-23 13:58:29,854 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.39.0.8, using IP address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.
2020-07-23 13:58:33,623 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.35.0.10, using IP
address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.
2020-07-23 13:58:35,756 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.36.32.8, using IP
address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.
2020-07-23 13:58:36,694 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     [] - No
hostname could be resolved for the IP address 10.42.128.6, using IP
address
as host name. Local input split assignment (such as for HDFS files) may
be
impacted.


2020-07-23 14:01:17,814 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Close ResourceManager connection
83b1ff14900abfd54418e7fa3efb3f8a: The heartbeat of JobManager with id
456a18b6c404cb11a359718e16de1c6b timed out..
2020-07-23 14:01:17,815 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Connecting to ResourceManager
akka.tcp://flink@flink-jobmanager
:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
2020-07-23 14:01:17,816 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Resolved ResourceManager address, beginning
registration
2020-07-23 14:01:17,816 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering job manager 00000000000000000000000000000000
@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
99a030d0e3f428490a501c0132f27a56.
2020-07-23 14:01:17,836 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
Source:
host_relation -> Timestamps/Watermarks -> Map (1/1)
(302ca9640e2d209a543d843f2996ccd2) switched from SCHEDULED to FAILED on
not
deployed.

org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Could not allocate the required slot within slot request timeout. Please
make sure that the cluster has enough resources.
at

org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at

akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at

akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[flink-dist_2.11-1.11.1.jar:1.11.1]
Caused by: java.util.concurrent.CompletionException:
java.util.concurrent.TimeoutException
at

java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
~[?:1.8.0_242]
... 25 more
Caused by: java.util.concurrent.TimeoutException
... 23 more
2020-07-23 14:01:17,848 INFO

org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
[] - Calculating tasks to restart to recover the failed task
cbc357ccb763df2852fee8c4fc7d55f2_0.
2020-07-23 14:01:17,910 INFO

org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
[] - 902 tasks should be restarted to recover the failed task
cbc357ccb763df2852fee8c4fc7d55f2_0.
2020-07-23 14:01:17,913 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job
JobTest (99a030d0e3f428490a501c0132f27a56) switched from state RUNNING to
FAILING.
org.apache.flink.runtime.JobException: Recovery is suppressed by
NoRestartBackoffTimeStrategy
at

org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:503)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1710)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1287)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1255)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1086)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.ExecutionVertex.markFailed(ExecutionVertex.java:748)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultExecutionVertexOperations.markFailed(DefaultExecutionVertexOperations.java:41)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskDeploymentFailure(DefaultScheduler.java:435)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at

akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at

akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[flink-dist_2.11-1.11.1.jar:1.11.1]
Caused by:

org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Could not allocate the required slot within slot request timeout. Please
make sure that the cluster has enough resources.
at

org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
... 45 more
Caused by: java.util.concurrent.CompletionException:
java.util.concurrent.TimeoutException
at

java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
~[?:1.8.0_242]
... 25 more
Caused by: java.util.concurrent.TimeoutException
... 23 more



2020-07-23 14:01:18,109 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
Discarding the results produced by task execution
1809eb912d69854f2babedeaf879df6a.
2020-07-23 14:01:18,110 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job
JobTest (99a030d0e3f428490a501c0132f27a56) switched from state FAILING to
FAILED.
org.apache.flink.runtime.JobException: Recovery is suppressed by
NoRestartBackoffTimeStrategy
at

org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:192)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:503)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1710)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1287)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1255)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1086)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.executiongraph.ExecutionVertex.markFailed(ExecutionVertex.java:748)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultExecutionVertexOperations.markFailed(DefaultExecutionVertexOperations.java:41)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskDeploymentFailure(DefaultScheduler.java:435)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:422)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:168)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:726)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1120)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_242]
at

org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1036)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at

org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at

akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[flink-dist_2.11-1.11.1.jar:1.11.1]
at

akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[flink-dist_2.11-1.11.1.jar:1.11.1]
Caused by:

org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Could not allocate the required slot within slot request timeout. Please
make sure that the cluster has enough resources.
at

org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:441)
~[flink-dist_2.11-1.11.1.jar:1.11.1]
... 45 more
Caused by: java.util.concurrent.CompletionException:
java.util.concurrent.TimeoutException
at

java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
~[?:1.8.0_242]
at

java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
~[?:1.8.0_242]
... 25 more
Caused by: java.util.concurrent.TimeoutException
... 23 more
2020-07-23 14:01:18,114 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
Stopping
checkpoint coordinator for job 99a030d0e3f428490a501c0132f27a56.
2020-07-23 14:01:18,117 INFO
org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore
[]
- Shutting down
2020-07-23 14:01:18,118 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
Discarding the results produced by task execution
302ca9640e2d209a543d843f2996ccd2.
2020-07-23 14:01:18,120 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Pending
slot request [SlotRequestId{15fd2a9565c2b080748c1d1592b1cbbc}] timed out.
2020-07-23 14:01:18,120 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Pending
slot request [SlotRequestId{8cd72cc16f0e319d915a9a096a1096d7}] timed out.
2020-07-23 14:01:18,120 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Pending
slot request [SlotRequestId{e7e422409acebdb385014a9634af6a90}] timed out.
2020-07-23 14:01:18,121 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Pending
slot request [SlotRequestId{cef1af73546ca1fc27ca7a3322e9e815}] timed out.
2020-07-23 14:01:18,121 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Pending
slot request [SlotRequestId{108fe0b3086567ad79275eccef2fdaf8}] timed out.
2020-07-23 14:01:18,121 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Pending
slot request [SlotRequestId{265e67985eab7a6dc08024e53bf2708d}] timed out.
2020-07-23 14:01:18,122 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Pending
slot request [SlotRequestId{7087497a17c441f1a1d6fefcbc7cd0ea}] timed out.
2020-07-23 14:01:18,122 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Pending
slot request [


2020-07-23 14:01:18,151 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registering job manager 00000000000000000000000000000000
@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
99a030d0e3f428490a501c0132f27a56.
2020-07-23 14:01:18,157 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registered job manager 00000000000000000000000000000000
@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
99a030d0e3f428490a501c0132f27a56.
2020-07-23 14:01:18,157 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registered job manager 00000000000000000000000000000000
@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
99a030d0e3f428490a501c0132f27a56.
2020-07-23 14:01:18,157 INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Job
99a030d0e3f428490a501c0132f27a56 reached globally terminal state FAILED.
2020-07-23 14:01:18,162 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Registered job manager 00000000000000000000000000000000
@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
99a030d0e3f428490a501c0132f27a56.
2020-07-23 14:01:18,162 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - JobManager successfully registered at
ResourceManager,
leader id: 00000000000000000000000000000000.
2020-07-23 14:01:18,225 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Stopping the JobMaster for job
JobTest(99a030d0e3f428490a501c0132f27a56).
2020-07-23 14:01:18,381 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Suspending SlotPool.
2020-07-23 14:01:18,382 INFO
org.apache.flink.runtime.jobmaster.JobMaster
[] - Close ResourceManager connection
83b1ff14900abfd54418e7fa3efb3f8a: JobManager is shutting down..
2020-07-23 14:01:18,382 INFO
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     [] -
Stopping
SlotPool.
2020-07-23 14:01:18,382 INFO
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] -
Disconnect job manager 00000000000000000000000000000000
@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2 for job
99a030d0e3f428490a501c0132f27a56 from the resource manager.

a511955993
邮箱:[email protected]

<
https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt
;

签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制

On 07/23/2020 13:26, Yang Wang <[email protected]> wrote:
很高兴你的问题解决了,但我觉得根本原因应该不是加上了taskmanager-query-state-service.yaml的关系。
我这边不创建这个服务也是正常的,而且nslookup {tm_ip_address}是可以正常反解析到hostname的。

注意这里不是解析hostname,而是通过ip地址来反解析进行验证


回答你说的两个问题:
1. 不是必须的,我这边验证不需要创建,集群也是可以正常运行任务的。Rest
service的暴露方式是ClusterIP、NodePort、LoadBalancer都正常
2. 如果没有配置taskmanager.bind-host,
[Flink-15911][Flink-15154]这两个JIRA并不会影响TM向RM注册时候的使用的地址

如果你想找到根本原因,那可能需要你这边提供JM/TM的完整log,这样方便分析


Best,
Yang

SmileSmile <[email protected]> 于2020年7月23日周四 上午11:30写道:


Hi Yang Wang

刚刚在测试环境测试了一下,taskManager没有办法nslookup出来,JM可以nslookup,这两者的差别在于是否有service。

解决方案:我这边给集群加上了taskmanager-query-state-service.yaml(按照官网上是可选服务)。就不会刷No
hostname could be resolved for ip
address,将NodePort改为ClusterIp,作业就可以成功提交,不会出现time out的问题了,问题得到了解决。


1. 如果按照上面的情况,那么这个配置文件是必须配置的?

2. 在1.11的更新中,发现有 [Flink-15911][Flink-15154]
支持分别配置用于本地监听绑定的网络接口和外部访问的地址和端口。是否是这块的改动,
需要JM去通过TM上报的ip反向解析出service?


Bset!


[1]


https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html

a511955993
邮箱:[email protected]

<

https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt
;


签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制

On 07/23/2020 10:11, Yang Wang <[email protected]> wrote:
我的意思就是你在Flink任务运行的过程中,然后下面的命令在集群里面起一个busybox的pod,
在里面执行 nslookup {ip_address},看看是否能够正常解析到。如果不能应该就是coredns的
问题了

kubectl run -i -t busybox --image=busybox --restart=Never

你需要确认下集群的coredns pod是否正常,一般是部署在kube-system这个namespace下的



Best,
Yang


SmileSmile <[email protected]> 于2020年7月22日周三 下午7:57写道:


Hi,Yang Wang!

很开心可以收到你的回复,你的回复帮助很大,让我知道了问题的方向。我再补充些信息,希望可以帮我进一步判断一下问题根源。

在JM报错的地方,No hostname could be resolved for ip address xxxxx
,报出来的ip是k8s分配给flink pod的内网ip,不是宿主机的ip。请问这个问题可能出在哪里呢

Best!


a511955993
邮箱:[email protected]

<


https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=a511955993&uid=a511955993%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Aa511955993%40163.com%22%5D&gt
;



签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88&gt; 定制

On 07/22/2020 18:18, Yang Wang <[email protected]> wrote:
如果你的日志里面一直在刷No hostname could be resolved for the IP
address,应该是集群的coredns
有问题,由ip地址反查hostname查不到。你可以起一个busybox验证一下是不是这个ip就解析不了,有
可能是coredns有问题


Best,
Yang

Congxian Qiu <[email protected]> 于2020年7月21日周二 下午7:29写道:

Hi
不确定 k8s 环境中能否看到 pod 的完整日志?类似 Yarn 的 NM 日志一样,如果有的话,可以尝试看一下这个 pod
的完整日志有没有什么发现
Best,
Congxian


SmileSmile <[email protected]> 于2020年7月21日周二 下午3:19写道:

Hi,Congxian

因为是测试环境,没有配置HA,目前看到的信息,就是JM刷出来大量的no hostname could be
resolved,jm失联,作业提交失败。
将jm内存配置为10g也是一样的情况(jobmanager.memory.pprocesa.size:10240m)。

在同一个环境将版本回退到1.10没有出现该问题,也不会刷如上报错。


是否有其他排查思路?

Best!




| |
a511955993
|
|
邮箱:[email protected]
|

签名由 网易邮箱大师 定制

On 07/16/2020 13:17, Congxian Qiu wrote:
Hi
如果没有异常,GC 情况也正常的话,或许可以看一下 pod 的相关日志,如果开启了 HA 也可以看一下 zk
的日志。之前遇到过一次在
Yarn
环境中类似的现象是由于其他原因导致的,通过看 NM 日志以及 zk 日志发现的原因。

Best,
Congxian


SmileSmile <[email protected]> 于2020年7月15日周三 下午5:20写道:

Hi Roc

该现象在1.10.1版本没有,在1.11版本才出现。请问这个该如何查比较合适



| |
a511955993
|
|
邮箱:[email protected]
|

签名由 网易邮箱大师 定制

On 07/15/2020 17:16, Roc Marshal wrote:
Hi,SmileSmile.
个人之前有遇到过 类似 的host解析问题,可以从k8s的pod节点网络映射角度排查一下。
希望这对你有帮助。


祝好。
Roc Marshal











在 2020-07-15 17:04:18,"SmileSmile" <[email protected]> 写道:

Hi

使用版本Flink 1.11,部署方式 kubernetes session。 TM个数30个,每个TM 4个slot。
job
并行度120.提交作业的时候出现大量的No hostname could be resolved for the IP
address,JM
time
out,作业提交失败。web ui也会卡主无响应。

用wordCount,并行度只有1提交也会刷,no
hostname的日志会刷个几条,然后正常提交,如果并行度一上去,就会超时。


部分日志如下:

2020-07-15 16:58:46,460 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     []
-
No
hostname could be resolved for the IP address 10.32.160.7,
using
IP
address
as host name. Local input split assignment (such as for HDFS
files)
may
be
impacted.
2020-07-15 16:58:46,460 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     []
-
No
hostname could be resolved for the IP address 10.44.224.7,
using
IP
address
as host name. Local input split assignment (such as for HDFS
files)
may
be
impacted.
2020-07-15 16:58:46,461 WARN
org.apache.flink.runtime.taskmanager.TaskManagerLocation     []
-
No
hostname could be resolved for the IP address 10.40.32.9, using
IP
address
as host name. Local input split assignment (such as for HDFS
files)
may
be
impacted.

2020-07-15 16:59:10,236 INFO

org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
[] -
The
heartbeat of JobManager with id
69a0d460de468888a9f41c770d963c0a
timed
out.
2020-07-15 16:59:10,236 INFO

org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
[] -
Disconnect job manager 00000000000000000000000000000000
@akka.tcp://flink@flink-jobmanager:6123/user/rpc/jobmanager_2
for
job
e1554c737e37ed79688a15c746b6e9ef from the resource manager.


how to deal with ?


beset !

| |
a511955993
|
|
邮箱:[email protected]
|

签名由 网易邮箱大师 定制










回复