我都是80G、100G这么分配资源的。。。 guanxianchun <[email protected]> 于2020年10月28日周三 下午5:02写道:
> flink版本: flink-1.11 > taskmanager memory: 8G > jobmanager memory: 2G > akka.ask.timeout:20s > akka.retry-gate-closed-for: 5000 > client.timeout:600s > > 运行一段时间后报the remote task manager was lost ,错误信息如下: > 2020-10-28 00:25:30,608 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 411 for job 031e5f122711786fcc11ee6eb47291fa (2703770 bytes in > 336 ms). > 2020-10-28 00:27:30,273 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > Triggering > checkpoint 412 (type=CHECKPOINT) @ 1603816050239 for job > 031e5f122711786fcc11ee6eb47291fa. > 2020-10-28 00:27:30,776 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 412 for job 031e5f122711786fcc11ee6eb47291fa (3466688 bytes in > 509 ms). > 2020-10-28 00:29:30,246 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > Triggering > checkpoint 413 (type=CHECKPOINT) @ 1603816170239 for job > 031e5f122711786fcc11ee6eb47291fa. > 2020-10-28 00:29:30,597 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 413 for job 031e5f122711786fcc11ee6eb47291fa (2752681 bytes in > 334 ms). > 2020-10-28 00:29:47,353 WARN akka.remote.ReliableDeliverySupervisor > > [] - Association with remote system > [akka.tcp://[email protected]:13912] has failed, address is now > gated for [5000] ms. Reason: [Disassociated] > 2020-10-28 00:29:47,353 WARN akka.remote.ReliableDeliverySupervisor > > [] - Association with remote system > [akka.tcp://[email protected]:31260] has failed, address > is > now gated for [5000] ms. Reason: [Disassociated] > 2020-10-28 00:29:47,377 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > KeyedProcess -> async wait operator -> Map (1/3) > (f84731e57528b326ad15ddc17821d1b8) switched from RUNNING to FAILED on > org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@538198b8. > org.apache.flink.runtime.io > .network.netty.exception.RemoteTransportException: > Connection unexpectedly closed by remote task manager > 'hadoop01.dev.test.cn/192.168.1.21:7527'. This might indicate that the > remote task manager was lost. > at > org.apache.flink.runtime.io > .network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:144) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:81) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.runtime.io > .network.netty.NettyMessageClientDecoderDelegate.channelInactive(NettyMessageClientDecoderDelegate.java:97) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1416) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:912) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:816) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:416) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:331) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] > at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_131] > 2020-10-28 00:29:47,442 INFO > > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy > [] - Calculating tasks to restart to recover the failed task > abf129c3bc11e5b145c2f3103110a0b2_0. > 2020-10-28 00:29:47,443 INFO > > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy > [] - 19 tasks should be restarted to recover the failed task > abf129c3bc11e5b145c2f3103110a0b2_0. > 2020-10-28 00:29:47,444 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job > static_order_gmv_by_paydate (031e5f122711786fcc11ee6eb47291fa) switched > from > state RUNNING to RESTARTING. > 2020-10-28 00:29:47,445 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Sink: > Unnamed (1/1) (c9d8de20cf8d58d3cd5e9f2dfadd7b70) switched from RUNNING to > CANCELING. > 2020-10-28 00:29:47,447 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: > Custom Source -> Flat Map -> Timestamps/Watermarks (2/3) > (828066cde4cda22eb4756366eafac229) switched from RUNNING to CANCELING. > 2020-10-28 00:29:47,447 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: > Custom Source -> Flat Map -> Timestamps/Watermarks (1/3) > (ae5e40830a57bbd118db2f8ee86a00ae) switched from RUNNING to CANCELING. > 2020-10-28 00:29:47,447 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > KeyedProcess -> async wait operator -> Map (2/3) > (70eb6b6d5a363910f8fd808024d68b8a) switched from RUNNING to CANCELING. > 2020-10-28 00:29:47,447 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > KeyedProcess -> async wait operator -> Map (3/3) > (a42963633bf0a142c082ec0e424666b3) switched from RUNNING to CANCELING. > 2020-10-28 00:29:47,448 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: > Custom Source -> Flat Map -> Timestamps/Watermarks (3/3) > (591b6fa2ad487cc2fe91cb9ac5a0d19e) switched from RUNNING to CANCELING. > 2020-10-28 00:29:47,448 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > KeyedProcess (1/3) (9a35a07b539502ec2d23ec35d3d507db) switched from RUNNING > to CANCELING. > 2020-10-28 00:29:47,448 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > KeyedProcess (2/3) (82734fa6851b2dcd769b34f7d8d1afaa) switched from RUNNING > to CANCELING. > 2020-10-28 00:29:47,448 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > KeyedProcess (3/3) (f13b2ef5feba6b65ad276cf87bdf2218) switched from RUNNING > to CANCELING. > 2020-10-28 00:29:47,448 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Co-Flat > Map (2/3) (6442e15db194a591c32a821e18198686) switched from RUNNING to > CANCELING. > 2020-10-28 00:29:47,448 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Co-Flat > Map (1/3) (6961b6cff72d1c41d8345944d246b433) switched from RUNNING to > CANCELING. > 2020-10-28 00:29:47,448 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Co-Flat > Map (3/3) (41edc64886544d8a542b23074c99f614) switched from RUNNING to > CANCELING. > 2020-10-28 00:29:47,448 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Window(TumblingEventTimeWindows(120000), EventTimeTrigger, > ViewAggregateFunction, ViewSumWindowFunction) (1/3) > (e9bd1a3fb4f3d0786831a439189e6240) switched from RUNNING to CANCELING. > 2020-10-28 00:29:47,448 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > KeyedProcess (1/3) (057233f7fa678b0a54e5c3d682caab24) switched from RUNNING > to CANCELING. > 2020-10-28 00:29:47,448 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > KeyedProcess (2/3) (11cde122ba8a22ef37269c8cd051e079) switched from RUNNING > to CANCELING. > 2020-10-28 00:29:47,448 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Window(TumblingEventTimeWindows(120000), EventTimeTrigger, > ViewAggregateFunction, ViewSumWindowFunction) (3/3) > (40b1bb8ce62b6b2062dc68bd63c2f60a) switched from RUNNING to CANCELING. > 2020-10-28 00:29:47,449 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Window(TumblingEventTimeWindows(120000), EventTimeTrigger, > ViewAggregateFunction, ViewSumWindowFunction) (2/3) > (88e1242700ba1d5a9cba5c466f51cac2) switched from RUNNING to CANCELING. > 2020-10-28 00:29:47,449 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > KeyedProcess (3/3) (e67f3c240663d5949872fa5988568e40) switched from RUNNING > to CANCELING. > 2020-10-28 00:29:47,452 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Window(TumblingEventTimeWindows(120000), EventTimeTrigger, > ViewAggregateFunction, ViewSumWindowFunction) (1/3) > (e9bd1a3fb4f3d0786831a439189e6240) switched from CANCELING to CANCELED. > 2020-10-28 00:29:47,452 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Discarding > the results produced by task execution e9bd1a3fb4f3d0786831a439189e6240. > 2020-10-28 00:29:47,457 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Discarding > the results produced by task execution e9bd1a3fb4f3d0786831a439189e6240. > 2020-10-28 00:29:47,459 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > KeyedProcess (1/3) (057233f7fa678b0a54e5c3d682caab24) switched from > CANCELING to CANCELED. > 2020-10-28 00:29:47,459 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Discarding > the results produced by task execution 057233f7fa678b0a54e5c3d682caab24. > 2020-10-28 00:29:47,460 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Discarding > the results produced by task execution 057233f7fa678b0a54e5c3d682caab24. > 2020-10-28 00:29:47,460 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Co-Flat > Map (1/3) (6961b6cff72d1c41d8345944d246b433) switched from CANCELING to > CANCELED. > 2020-10-28 00:29:47,460 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Discarding > the results produced by task execution 6961b6cff72d1c41d8345944d246b433. > 2020-10-28 00:29:47,460 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Discarding > the results produced by task execution 6961b6cff72d1c41d8345944d246b433. > 2020-10-28 00:29:47,461 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > KeyedProcess (1/3) (9a35a07b539502ec2d23ec35d3d507db) switched from > CANCELING to CANCELED. > 2020-10-28 00:29:47,461 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Discarding > the results produced by task execution 9a35a07b539502ec2d23ec35d3d507db. > 2020-10-28 00:29:47,461 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Discarding > the results produced by task execution 9a35a07b539502ec2d23ec35d3d507db. > 2020-10-28 00:29:47,517 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: > Custom Source -> Flat Map -> Timestamps/Watermarks (3/3) > (591b6fa2ad487cc2fe91cb9ac5a0d19e) switched from CANCELING to CANCELED. > 2020-10-28 00:29:47,566 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Sink: > Unnamed (1/1) (c9d8de20cf8d58d3cd5e9f2dfadd7b70) switched from CANCELING to > CANCELED. > 2020-10-28 00:29:47,567 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > KeyedProcess (3/3) (f13b2ef5feba6b65ad276cf87bdf2218) switched from > CANCELING to CANCELED. > 2020-10-28 00:29:47,568 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Co-Flat > Map (3/3) (41edc64886544d8a542b23074c99f614) switched from CANCELING to > CANCELED. > 2020-10-28 00:29:47,568 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > KeyedProcess (3/3) (e67f3c240663d5949872fa5988568e40) switched from > CANCELING to CANCELED. > 2020-10-28 00:29:47,569 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Window(TumblingEventTimeWindows(120000), EventTimeTrigger, > ViewAggregateFunction, ViewSumWindowFunction) (3/3) > (40b1bb8ce62b6b2062dc68bd63c2f60a) switched from CANCELING to CANCELED. > 2020-10-28 00:29:47,570 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: > Custom Source -> Flat Map -> Timestamps/Watermarks (1/3) > (ae5e40830a57bbd118db2f8ee86a00ae) switched from CANCELING to CANCELED. > 2020-10-28 00:29:47,594 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > KeyedProcess -> async wait operator -> Map (3/3) > (a42963633bf0a142c082ec0e424666b3) switched from CANCELING to CANCELED. > 2020-10-28 00:29:50,845 INFO org.apache.flink.yarn.YarnResourceManager > > [] - Closing TaskExecutor connection > container_1591067037248_153639_01_000003 because: Container killed on > request. Exit code is 137 > Container exited with a non-zero exit code 137 > Killed by external signal > > 2020-10-28 00:29:50,846 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: > Custom Source -> Flat Map -> Timestamps/Watermarks (2/3) > (828066cde4cda22eb4756366eafac229) switched from CANCELING to CANCELED. > 2020-10-28 00:29:50,846 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Discarding > the results produced by task execution 828066cde4cda22eb4756366eafac229. > 2020-10-28 00:29:50,846 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Discarding > the results produced by task execution 828066cde4cda22eb4756366eafac229. > 2020-10-28 00:29:50,846 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > KeyedProcess -> async wait operator -> Map (2/3) > (70eb6b6d5a363910f8fd808024d68b8a) switched from CANCELING to CANCELED. > 2020-10-28 00:29:50,847 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Discarding > the results produced by task execution 70eb6b6d5a363910f8fd808024d68b8a. > 2020-10-28 00:29:50,847 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Discarding > the results produced by task execution 70eb6b6d5a363910f8fd808024d68b8a. > 2020-10-28 00:29:50,847 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > KeyedProcess (2/3) (11cde122ba8a22ef37269c8cd051e079) switched from > CANCELING to CANCELED. > 2020-10-28 00:29:50,847 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Discarding > the results produced by task execution 11cde122ba8a22ef37269c8cd051e079. > 2020-10-28 00:29:50,847 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Discarding > the results produced by task execution 11cde122ba8a22ef37269c8cd051e079. > 2020-10-28 00:29:50,847 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Window(TumblingEventTimeWindows(120000), EventTimeTrigger, > ViewAggregateFunction, ViewSumWindowFunction) (2/3) > (88e1242700ba1d5a9cba5c466f51cac2) switched from CANCELING to CANCELED. > 2020-10-28 00:29:50,847 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Discarding > the results produced by task execution 88e1242700ba1d5a9cba5c466f51cac2. > 2020-10-28 00:29:50,847 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Discarding > the results produced by task execution 88e1242700ba1d5a9cba5c466f51cac2. > 2020-10-28 00:29:50,847 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Co-Flat > Map (2/3) (6442e15db194a591c32a821e18198686) switched from CANCELING to > CANCELED. > 2020-10-28 00:29:50,847 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Discarding > the results produced by task execution 6442e15db194a591c32a821e18198686. > 2020-10-28 00:29:50,847 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Discarding > the results produced by task execution 6442e15db194a591c32a821e18198686. > 2020-10-28 00:29:50,847 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > KeyedProcess (2/3) (82734fa6851b2dcd769b34f7d8d1afaa) switched from > CANCELING to CANCELED. > 2020-10-28 00:29:50,847 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Discarding > the results produced by task execution 82734fa6851b2dcd769b34f7d8d1afaa. > 2020-10-28 00:29:50,847 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - > Discarding > the results produced by task execution 82734fa6851b2dcd769b34f7d8d1afaa. > 2020-10-28 00:29:50,850 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job > static_order_gmv_by_paydate (031e5f122711786fcc11ee6eb47291fa) switched > from > state RESTARTING to RUNNING. > 2020-10-28 00:29:50,851 INFO > org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore [] - > Recovering checkpoints from ZooKeeper. > > > > > > -- > Sent from: http://apache-flink.147419.n8.nabble.com/ >
