Re: Memory usage UI

2021-07-01 Thread Sudharsan R
Hi Xintong, Thanks very much for the response. Let me check out the new UI on flink 1.12. The reason I asked this question is because our flink cluster on k8s shows a container_working_set_bytes(used by OOMkiller) to be > 3Gb. I assume that the used(heap, non-heap) values on the UI are correct.

Re: [DISCUSS] Better user experience in the WindowAggregate upon Changelog (contains update message)

2021-07-01 Thread 刘建刚
Thanks for the discussion, JING ZHANG. I like the first proposal since it is simple and consistent with dataStream API. It is helpful to add more docs about the special late case in WindowAggregate. Also, I expect the more flexible emit strategies later. Jark Wu 于2021年7月2日周五 上午10:33写道: > Sorry,

Re: How to calculate how long an event stays in flink?

2021-07-01 Thread JING ZHANG
Hi Xiuming, +1 on your idea. BTW, Flink also provides a debug tool to track the latency of records travelling through the system[1]. But you should note the following issue if enable the latency tracking. (1) It's a tool for debugging purposes because enabling latency metrics can significantly

Re: [DISCUSS] Better user experience in the WindowAggregate upon Changelog (contains update message)

2021-07-01 Thread Jark Wu
Sorry, I made a typo above. I mean I prefer proposal (1) that only needs to set `table.exec.emit.allow-lateness` to handle late events. `table.exec.emit.late-fire.delay` can be optional which is 0s by default. `table.exec.state.ttl` will not affect window state anymore, so window state is still

Re: Memory usage UI

2021-07-01 Thread Xintong Song
Hi Sudharsan, The non-heap max is decided by JVM automatically and is not controlled by Flink. Moreover, it doesn't mean Flink will use up to that size of non-heap memory. These metrics are fetched directly from JVM and do not correspond well with Flink's memory configurations, which very often

回复:flink-1.13.1 ddl kafka消费JSON数据 (ObjectNode) jsonNode错误

2021-07-01 Thread kcz
大佬们,帮看一下,为什么那里会出现类型转换异常了。 -- 原始邮件 -- 发件人: kcz <573693...@qq.com 发送时间: 2021年7月1日 22:49 收件人: user-zh

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Lu Niu
Another side question, Shall we add metric to cover the complete restarting time (phase 1 + phase 2)? Current metric jm.restartingTime only covers phase 1. Thanks! Best Lu On Thu, Jul 1, 2021 at 12:09 PM Lu Niu wrote: > Thanks TIll and Yang for help! Also Thanks Till for a quick fix! > > I did

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Lu Niu
Thanks TIll and Yang for help! Also Thanks Till for a quick fix! I did another test yesterday. In this test, I intentionally throw exception from the source operator: ``` if (runtimeContext.getIndexOfThisSubtask() == 1 && errorFrenquecyInMin > 0 && System.currentTimeMillis() -

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Lu Niu
Thanks TIll and Yang for help! Also Thanks Till for a quick fix! I did another test yesterday. In this test, I intentionally throw exception from the source operator: ``` if (runtimeContext.getIndexOfThisSubtask() == 1 && errorFrenquecyInMin > 0 && System.currentTimeMillis() -

Memory usage UI

2021-07-01 Thread Sudharsan R
Hi, On my flink setup, I have taskmanager.memory.process.size set to 2536M. I expect all the memory components shown on the UI to add up to this number. However, I don't see this. I have flink managed memory: 811Mb JVM heap max: 886Mb JVM non-heap max: 744Mb Direct memory: 204Mb This adds

Using Flink's Kubernetes API inside Java

2021-07-01 Thread Alexis Sarda-Espinosa
Hello everyone, I'm testing a custom Kubernetes operator that should fulfill some specific requirements I have for Flink. I know of this WIP project: https://github.com/wangyang0918/flink-native-k8s-operator I can see that it uses some classes that aren't publicly documented, and I believe it

Re: Jobmanagers are in a crash loop after upgrade from 1.12.2 to 1.13.1

2021-07-01 Thread Austin Cawley-Edwards
Hi Shilpa, I've confirmed that "recovered" jobs are not compatible between minor versions of Flink (e.g., between 1.12 and 1.13). I believe the issue is that the session cluster was upgraded to 1.13 without first stopping the jobs running on it. If this is the case, the workaround is to stop

退订

2021-07-01 Thread 高耀军
退订 | | 高耀军 | | 18221112...@163.com | 签名由网易邮箱大师定制

flink-1.13.1 ddl kafka????JSON???? (ObjectNode) jsonNode????

2021-07-01 Thread kcz
:1.13.1 : Caused by: java.lang.ClassCastException: org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.node.TextNode cannot be cast to org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.node.ObjectNode at

退订

2021-07-01 Thread 张保淇
退订

Re: Recommended way to submit a SQL job via code without getting tied to a Flink version?

2021-07-01 Thread Sonam Mandal
Hi Stephan, Thanks for the detailed explanation! This really helps understand all this better. Appreciate your help! Regards, Sonam From: Stephan Ewen Sent: Wednesday, June 30, 2021 3:56:22 AM To: Sonam Mandal ; user@flink.apache.org Cc:

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Till Rohrmann
A quick addition, I think with FLINK-23202 it should now also be possible to improve the heartbeat mechanism in the general case. We can leverage the unreachability exception thrown if a remote target is no longer reachable to mark an heartbeat target as no longer reachable [1]. This can then be

Re: [DISCUSS] Better user experience in the WindowAggregate upon Changelog (contains update message)

2021-07-01 Thread Jark Wu
Thanks Jing for bringing up this topic, The emit strategy configs are annotated as Experiential and not public on documentations. However, I see this is a very useful feature which many users are looking for. I have posted these configs for many questions like "how to handle late events in SQL".

Re: Jobmanagers are in a crash loop after upgrade from 1.12.2 to 1.13.1

2021-07-01 Thread Shilpa Shankar
Hi Zhu, Does is mean our upgrades are going to fail and the jobs are not backward compatible? I did verify the job itself is built using 1.13.0. Is there a workaround for this? Thanks, Shilpa On Wed, Jun 30, 2021 at 11:14 PM Zhu Zhu wrote: > Hi Shilpa, > > JobType was introduced in 1.13. So

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Yang Wang
Since you are deploying Flink workloads on Yarn, the Flink ResourceManager should get the container completion event after the heartbeat of Yarn NM->Yarn RM->Flink RM, which is 8 seconds by default. And Flink ResourceManager will release the dead TaskManager container once received the completion

Re: Exception in snapshotState suppresses subsequent checkpoints

2021-07-01 Thread tao xiao
Thanks for the pointer. let me try upgrading the flink On Thu, Jul 1, 2021 at 5:29 PM Yun Tang wrote: > Hi Tao, > > I run your program with Flink-1.12.1 and found the problem you described > really existed. And things would go normal if switching to Flink-1.12.2 > version. > > After dig into

Re: Exception in snapshotState suppresses subsequent checkpoints

2021-07-01 Thread Yun Tang
Hi Tao, I run your program with Flink-1.12.1 and found the problem you described really existed. And things would go normal if switching to Flink-1.12.2 version. After dig into the root cause, I found this is caused by a fixed bug [1]: If a legacy source task fails outside of the legacy

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Till Rohrmann
The analysis of Gen is correct. Flink currently uses its heartbeat as the primary means to detect dead TaskManagers. This means that Flink will take at least `heartbeat.timeout` time before the system recovers. Even if the cancellation happens fast (e.g. by having configured a low

Re: Exception in snapshotState suppresses subsequent checkpoints

2021-07-01 Thread Matthias Pohl
Hi Tao, it looks like it should work considering that you have a sleep of 1 second before each emission. I'm going to add Roman to this thread. Maybe, he has sees something obvious which I'm missing. Could you run the job with the log level set to debug and provide the logs once more?

Re: specify number of TM; how stream app use state of batch app; orc / parquet file format have different impact on tpcds performance benchmark.

2021-07-01 Thread Yangze Guo
> 1. how to specify the number of TaskManager? > In batch mode, I tried to use (Max Parallelism / (cores per tm)), but it > does not work. Number of TaskManager is muchlarger than (Max Parallelism / > cores per tm). It not the cores per tm, but the number of slots per tm. Please refer to

回复:退订

2021-07-01 Thread 谢治平
| | 谢治平 | | 邮箱:xiezhiping...@163.com | 在2021年06月03日 19:06,李朋辉 写道: 退订 | | 李朋辉 | | 邮箱:lipengh...@126.com | 签名由 网易邮箱大师 定制

Re: Regarding state access in UDF

2021-07-01 Thread Kai Fu
Hi Ingo, Thank you for your advice, we've not tried it yet, we just thought it may work that way, but now it seems not then. We'll see how it could match our use case with the AggregateFunction interface. On Thu, Jul 1, 2021 at 1:57 PM Ingo Bürk wrote: > Hi Kai, > > CheckpointedFunction is not

How to calculate how long an event stays in flink?

2021-07-01 Thread xm lian
Hello community, I would like to know how long it takes for an event to flow through the whole Flink pipeline, that consumes from Kafka and sinks to Redis. My current idea is, for each event: 1. calculate a start_time in source (timestamp field of [metadata](