Default Flink S3 FileSource timeout due to large file listing

2023-09-25 文章 Eleanore Jin
Hello Flink Community, Flink Version: 1.16.1, Zookeeper for HA. My Flink Applications reads raw parquet files hosted in S3, applies transformations and re-writes them to S3, under a different location. Below is my code to read from parquets from S3: ``` final Configuration configuration = new

Re: Container is running beyond physical memory limits

2021-02-20 文章 Eleanore Jin
Hi 这是我之前看到一篇关于OOM KILL 的分析文章,不知道对你有没有用 http://www.whitewood.me/2021/01/02/%E8%AF%A6%E8%A7%A3-Flink-%E5%AE%B9%E5%99%A8%E5%8C%96%E7%8E%AF%E5%A2%83%E4%B8%8B%E7%9A%84-OOM-Killed/ On Thu, Feb 18, 2021 at 9:01 AM lian wrote: > 各位大佬好: > 1. 背景:使用Flink >

Re: Checkpoint dir is not cleaned up after cancel the job with monitoring API

2020-10-02 文章 Eleanore Jin
27/2020 7:02 PM, Eleanore Jin wrote: > > I have noticed this: if I have Thread.sleep(1500); after the patch call > returned 202, then the directory gets cleaned up, in the meanwhile, it > shows the job-manager pod is in completed state before getting terminated: > see screenshot:

Re: Checkpoint dir is not cleaned up after cancel the job with monitoring API

2020-09-27 文章 Eleanore Jin
the job? Is there a way to check if cancel is completed? So that the stop tm and jm can be called afterwards? Thanks a lot! Eleanore On Sun, Sep 27, 2020 at 9:37 AM Eleanore Jin wrote: > Hi Congxian, > I am making rest call to get the checkpoint config: curl -X GET \ > > http://lo

Re: Checkpoint dir is not cleaned up after cancel the job with monitoring API

2020-09-27 文章 Eleanore Jin
NCELLATION` is set, then the > checkpoint will be kept when canceling a job. > > PS the image did not show > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints > Best, > Congxian > > > Eleanore Jin 于2020年9月27日

Checkpoint dir is not cleaned up after cancel the job with monitoring API

2020-09-26 文章 Eleanore Jin
Hi experts, I am running flink 1.10.2 on kubernetes as per job cluster. Checkpoint is enabled, with interval 3s, minimumPause 1s, timeout 10s. I'm using FsStateBackend, snapshots are persisted to azure blob storage (Microsoft cloud storage service). Checkpointed state is just source kafka topic

Re: 关于sink失败 不消费kafka消息的处理

2020-08-28 文章 Eleanore Jin
rier消息并且把barrier消息之前的数据处理完成。 > 所以chk机制提交offset时,可以保证之前的数据已经写入sink,是at least once的。 > > Eleanore Jin 于2020年8月28日周五 上午1:17写道: > > > 感谢大家的回答, > > > > 我用的是APACHE BEAM, 然后RUNNER 用的是Flink, 这里是Beam 提供的KAFKA 的CONNECTOR, > 如果看source > > 的话,它是有state checkpointed: Beam

Re: 关于sink失败 不消费kafka消息的处理

2020-08-27 文章 Eleanore Jin
ache.org] > 发送时间: 2020年8月27日 星期四 10:06 > 收件人: user-zh > 主题: Re: 关于sink失败 不消费kafka消息的处理 > > Hi Eleanore,shizk233 同学给出的解释已经很全面了。 > > 对于你后面提的这个问题,我感觉这个理解应该不太正确。 > 开了checkpoint之后,虽然kafka producer没有用两阶段提交,但是也可以保证在checkpoint成功的时候 > 会将当前的所有数据flush出去。如果flush失败,那应该是会导致checkpoint失败

Re: 关于sink失败 不消费kafka消息的处理

2020-08-26 文章 Eleanore Jin
数据也可以回滚,这样就能保证一致性。 > > 这样的话 chk就是at least once,chk+upsert或者chk+2pc就是exactly once了。 > > 3.kafka auto commit > chk快照的state不仅仅是source的offset,还有数据流中各个算子的state,chk机制会在整个数据流完成同一个chk > n的时候才提交offset。 > kafka auto commit不能保障这种全局的一致性,因为auto commit是自动的,不会等待整个数据流上同一chk n的完成。 > > Eleanore

Re: 关于sink失败 不消费kafka消息的处理

2020-08-26 文章 Eleanore Jin
Hi Benchao 可以解释一下为什么sink没有两阶段提交,那就是at least once 的语义吗? 比如source和 sink 都是kafka, 如果 sink 不是两段式提交,那么checkpoint 的state 就只是source 的 offset,这种情况下和使用kafka auto commit offset 看起来似乎没有什么区别 可否具体解释一下? 谢谢! Eleanore On Tue, Aug 25, 2020 at 9:59 PM Benchao Li wrote: >

改动source或sink operator后 无法从savepoint恢复作业

2020-08-10 文章 Eleanore Jin
请教各位 我用的是 Beam 2.23.0, with flink runner 1.8.2. 想要实验启动checkpoint 和 Beam KafkaIO EOS(exactly once semantics) 以后,添加或删除source/sink operator 然后从savepoint恢复作业的情况。我是在电脑上run kafka 和 flink cluster (1 job manager, 1 task manager) 下面是我尝试的不同场景: 1. 在SAVEPOINT 后,添加一个source topic 在savepoint之前: read from

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-06 文章 Eleanore Jin
eporter > [2]. > https://ci.apache.org/projects/flink/flink-docs-master/monitoring/metrics.html#availability > > Best, > Yang > > Eleanore Jin 于2020年8月5日周三 下午11:52写道: > >> Hi Yang and Till, >> >> Thanks a lot for the help! I have the similar question as Till

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-05 文章 Eleanore Jin
et the resartPolicy and >>>> backoffLimit, >>>> this is not a clean and correct way to go. We should terminate the >>>> jobmanager process with zero exit code in such situation. >>>> >>>> @Till Rohrmann I just have one concern. Is it a >&

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-04 文章 Eleanore Jin
ec.template.spec.restartPolicy = "Never" and >>> spec.backoffLimit = 0. >>> Refer here[1] for more information. >>> >>> Then, when the jobmanager failed because of any reason, the K8s job will >>> be marked failed. And K8s will not restart the job

Re: Behavior for flink job running on K8S failed after restart strategy exhausted

2020-08-03 文章 Eleanore Jin
clusterframework/ApplicationStatus.java#L32 > > On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin > wrote: > >> Hi Experts, >> >> I have a flink cluster (per job mode) running on kubernetes. The job is >> configured with restart strategy >> >> restart-strate

Behavior for flink job running on K8S failed after restart strategy exhausted

2020-07-31 文章 Eleanore Jin
Hi Experts, I have a flink cluster (per job mode) running on kubernetes. The job is configured with restart strategy restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: 10 s So after 3 times retry, the job will be marked as FAILED, hence the pods are not running.

Re: Flink Training - why cannot keyBy hour?

2020-07-02 文章 Eleanore Jin
is unbounded, and we can't intervene to clean up > stale state. > > Regards, > David > > On Wed, Jul 1, 2020 at 2:26 AM Eleanore Jin > wrote: > >> Hi experts, >> >> I am going through Ververica flink training, and when doing the lab with >> wind

Flink Training - why cannot keyBy hour?

2020-06-30 文章 Eleanore Jin
Hi experts, I am going through Ververica flink training, and when doing the lab with window (https://training.ververica.com/exercises/windows), basically it requires to compute within an hour which driver earns the most tip. The logic is to 0. keyBy driverId 1. create 1 hour window based on

Re: run flink on edge vs hub

2020-05-18 文章 Eleanore Jin
S-ARM-support-for-Flink-td30298.html > [2] > https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/datastream_api.html#data-sources > [3] > https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/queryable_state.html > [4] > https://ci.apache.org/projects/

run flink on edge vs hub

2020-05-17 文章 Eleanore Jin
Hi Community, Currently we are running flink in 'hub' data centers where data is ingested into the platform via kafka, and flink job will read from kafka, do the transformations, and publish to another kafka topic. I would also like to see if the same logic (read input message -> do

Re: Broadcast stream causing GC overhead limit exceeded

2020-05-06 文章 Eleanore Jin
ut from the > other side. > Since the error is not about lack of memory, the buffering in Flink state > might not be the problem here. > > Best, Fabian > > > > > > Am So., 3. Mai 2020 um 03:39 Uhr schrieb Eleanore Jin < > eleanore@gmail.com>: > >> Hi

Export user metrics with Flink Prometheus endpoint

2020-05-05 文章 Eleanore Jin
Hi all, I just wonder is it possible to use Flink Metrics endpoint to allow Prometheus to scrape user defined metrics? Context: In addition to Flink metrics, we also collect some application level metrics using opencensus. And we run opencensus agent as side car in kubernetes pod to collect

Broadcast stream causing GC overhead limit exceeded

2020-05-02 文章 Eleanore Jin
Hi All, I am using apache Beam with Flink (1.8.2). In my job, I am using Beam sideinput (which translates into Flink NonKeyedBroadcastStream) to do filter of the data from main stream. I have experienced OOM: GC overhead limit exceeded continuously. After did some experiments, I observed

Flink Task Manager GC overhead limit exceeded

2020-04-29 文章 Eleanore Jin
Hi All, Currently I am running a flink job cluster (v1.8.2) on kubernetes with 4 pods, each pod with 4 parallelism. The flink job reads from a source topic with 96 partitions, and does per element filter, the filtered value comes from a broadcast topic and it always use the latest message as the

Re: how to send back result via job manager to client

2020-04-19 文章 Eleanore Jin
Hi Kurt, 谢谢, 我了解过后如果有问题再请教 Best Eleanore On Sun, Apr 19, 2020 at 7:18 PM Kurt Young wrote: > 可以看下这个jira:https://issues.apache.org/jira/browse/FLINK-14807 > > Best, > Kurt > > > On Mon, Apr 20, 2020 at 7:07 AM Eleanore Jin > wrote: > > > Hi, > > 刚

how to send back result via job manager to client

2020-04-19 文章 Eleanore Jin
Hi, 刚刚读到一篇关于Flink 在OLAP 上的使用案例 ( https://ververica.cn/developers/olap-engine-performance-optimization-and-application-cases/), 其中一点提到了: [image: image.png] 这部分优化,源于 OLAP 的一个特性:OLAP 会将最终计算结果发给客户端,通过JobManager 转发给 Client。 想请问一下这一点是如何实现的: 通过JobManager 把结果转发给Client. 谢谢! Eleanore

Start flink job from the latest checkpoint programmatically

2020-03-12 文章 Eleanore Jin
Hi All, The setup of my flink application is allow user to start and stop. The Flink job is running in job cluster (application jar is available to flink upon startup). When stop a running application, it means exit the program. When restart a stopped job, it means to spin up new job cluster

Re: scaling issue Running Flink on Kubernetes

2020-03-11 文章 Eleanore Jin
is one of the community experts in this area. >>> >>> Thank you~ >>> >>> Xintong Song >>> >>> >>> >>> On Wed, Mar 11, 2020 at 10:37 AM Eleanore Jin >>> wrote: >>> >>>> _Hi Xintong, >

Re: scaling issue Running Flink on Kubernetes

2020-03-10 文章 Eleanore Jin
g., the existing TMs might have some file already localized, > or some memory buffers already promoted to the JVM tenured area, while the > new TMs have not. > > Thank you~ > > Xintong Song > > > > On Wed, Mar 11, 2020 at 9:25 AM Eleanore Jin > wrote: > >> Hi E

scaling issue Running Flink on Kubernetes

2020-03-10 文章 Eleanore Jin
Hi Experts, I have my flink application running on Kubernetes, initially with 1 Job Manager, and 2 Task Managers. Then we have the custom operator that watches for the CRD, when the CRD replicas changed, it will patch the Flink Job Manager deployment parallelism and max parallelism according to

Re: Is incremental checkpoints needed?

2020-03-10 文章 Eleanore Jin
; On Tue, Mar 10, 2020 at 5:43 PM Eleanore Jin > wrote: > >> Hi All, >> >> I am using Apache Beam to construct the pipeline, and this pipeline is >> running with Flink Runner. >> >> Both Source and Sink are Kafka topics, I have enabled Beam Exactly

Is incremental checkpoints needed?

2020-03-10 文章 Eleanore Jin
Hi All, I am using Apache Beam to construct the pipeline, and this pipeline is running with Flink Runner. Both Source and Sink are Kafka topics, I have enabled Beam Exactly once semantics. I believe how it works in beam is: the messages will be cached and not processed by the

How to test flink job recover from checkpoint

2020-03-04 文章 Eleanore Jin
Hi, I have a flink application and checkpoint is enabled, I am running locally using miniCluster. I just wonder if there is a way to simulate the failure, and verify that flink job restarts from checkpoint? Thanks a lot! Eleanore