date:20230714

建议Flink ROWKIND做成元数据metadata

2023-07-14 文章 casel.chen

Flink社区能否考虑将ROWKIND做成元数据metadata，类似于kafka 
connector中的offset和partition等，用户可以使用这些ROWKIND 
metadata进行处理，例如对特定ROWKIND过滤或者答输出数据中添加ROWKIND字段

Re:flink on k8s 任务状态监控问题

2023-07-14 文章 casel.chen

可以查看history server














在 2023-07-14 18:36:42，"阿华田"  写道：
>
>
>hello  各位大佬， flink on K8S  ，运行模式是native 模式，任务状态实时监控遇到一个问题： 就是pod退出了 
>无法判断flink任务是正常Finished  还是异常失败了，各位大佬有什么建议吗
>| |
>阿华田
>|
>|
>a15733178...@163.com
>|
>签名由网易邮箱大师定制
>

flink on k8s 任务状态监控问题

2023-07-14 文章阿华田



hello  各位大佬， flink on K8S  ，运行模式是native 模式，任务状态实时监控遇到一个问题： 就是pod退出了 
无法判断flink任务是正常Finished  还是异常失败了，各位大佬有什么建议吗
| |
阿华田
|
|
a15733178...@163.com
|
签名由网易邮箱大师定制

Re: Re: PartitionNotFoundException循环重启

2023-07-14 文章 Shammon FY

Hi,

我觉得增加到3分钟可能不是一个合适的方法，这会增加作业恢复时间。建议还是追查一下为什么上游task这么长时间没有部署启动成功比较好。

Best,
Shammon FY


On Fri, Jul 14, 2023 at 2:25 PM zhan...@eastcom-sw.com <
zhan...@eastcom-sw.com> wrote:

> hi, 上次将`taskmanager.network.request-backoff.max` 从默认的10s增加到30s后 跑了5天还是出现
> PartitionNotFoundException循环重启
> 从日志看是连续三次checkpoint超时失败后自动重启job （Checkpointed Data
> Size一直在增长，即便当前无数据处理，也有几十上百M），某个算子会一直失败重启任务
>
> 以下是整个过程的失败日志，是否将 `taskmanager.network.request-backoff.max` 再增加到3分钟可以避免
> PartitionNotFoundException ？
>
> 2023-07-12 11:07:49,490 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] -
> Checkpoint 19177 of job 3b800d54fb6a002be7feadb1a8b6894e expired before
> completing.
> 2023-07-12 11:07:49,490 WARN
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to
> trigger or complete checkpoint 19177 for job
> 3b800d54fb6a002be7feadb1a8b6894e. (3 consecutive failed attempts so far)
> org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint
> expired before completing.
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2216)
> [flink-dist-1.17.1.jar:1.17.1]
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [?:1.8.0_77]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_77]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> [?:1.8.0_77]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> [?:1.8.0_77]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [?:1.8.0_77]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [?:1.8.0_77]
> at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]
> 2023-07-12 11:07:49,490 INFO
> org.apache.flink.runtime.checkpoint.CheckpointRequestDecider [] -
> checkpoint request time in queue: 2280006
> 2023-07-12 11:07:49,492 INFO
> org.apache.flink.runtime.jobmaster.JobMaster [] - Trying to
> recover from a global failure.
> org.apache.flink.util.FlinkRuntimeException: Exceeded checkpoint tolerable
> failure threshold. The latest checkpoint failed due to Checkpoint expired
> before completing., view the Checkpoint History tab or the Job Manager log
> to find out why continuous checkpoints failed.
> at
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.checkFailureAgainstCounter(CheckpointFailureManager.java:212)
> ~[flink-dist-1.17.1.jar:1.17.1]
> at
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:169)
> ~[flink-dist-1.17.1.jar:1.17.1]
> at
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:122)
> ~[flink-dist-1.17.1.jar:1.17.1]
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2155)
> ~[flink-dist-1.17.1.jar:1.17.1]
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:2134)
> ~[flink-dist-1.17.1.jar:1.17.1]
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$700(CheckpointCoordinator.java:101)
> ~[flink-dist-1.17.1.jar:1.17.1]
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:2216)
> ~[flink-dist-1.17.1.jar:1.17.1]
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[?:1.8.0_77]
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_77]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> ~[?:1.8.0_77]
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> ~[?:1.8.0_77]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ~[?:1.8.0_77]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> ~[?:1.8.0_77]
> at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_77]
> 2023-07-12 11:07:49,492 INFO
> org.apache.flink.runtime.jobmaster.JobMaster [] - 51 tasks
> will be restarted to recover from a global failure.
> 2023-07-12 11:07:49,492 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Job Phase
> 2 Data Warehouse Processing (3b800d54fb6a002be7feadb1a8b6894e) switched
> from state RUNNING to RESTARTING.
>
> 2023-07-12 11:07:50,007 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Group
> Windowing (5s) (16/16)
> (ce053119a37c9c5ca107f7386cd9fd8f_d952d2a6aebfb900c453884c57f96b82_15_0)
> switched from CANCELING to CANCELED.
> 2023-07-12 11:07:50,007 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Clearing

建议Flink ROWKIND做成元数据metadata

Re:flink on k8s 任务状态监控问题

flink on k8s 任务状态监控问题

Re: Re: PartitionNotFoundException循环重启

4 matches

Site Navigation

Mail list logo

Footer information