AW: Kafka SQL Connector: dropping events if more partitions then source tasks

2021-03-01 Thread Jan Oelschlegel
Hi Shengkai,

thanks for this hint. I solves the issue having more consumer tasks than kafka 
partitions.

But the case with dropping events while having less consumer tasks than kafka 
partitions is still there. I think it will be okay in version 1.12 [1]

[1] https://issues.apache.org/jira/browse/FLINK-20041

Best,
Jan

Von: Shengkai Fang 
Gesendet: Samstag, 27. Februar 2021 05:03
An: Jan Oelschlegel 
Cc: Benchao Li ; Arvid Heise ; user 
; Timo Walther 
Betreff: Re: Kafka SQL Connector: dropping events if more partitions then 
source tasks

Hi Jan.

Thanks for your reply. Do you set the option `table.exec.source.idle-timeout`  
and `pipeline.auto-watermark-interval` ? If the 
`pipeline.auto-watermark-interval ` is zero, it will not trigger the detection 
of the idle source.

Best,
Shengkai

Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>> 
于2021年2月26日周五 下午11:09写道:
Hi Shengkai,

i’m using Flink 1.11.2. The problem is if I use a parallelism higher than my 
kafka partition count, the watermarks are not increasing and so the windows are 
never ggot fired.

I suspect that then a source task is not marked as idle and thus the watermark 
is not increased. In any case I have observed how with a larger number of 
source tasks no results are produced.

Best,
Jan
Von: Shengkai Fang mailto:fskm...@gmail.com>>
Gesendet: Freitag, 26. Februar 2021 15:32
An: Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>>
Cc: Benchao Li mailto:libenc...@apache.org>>; Arvid Heise 
mailto:ar...@apache.org>>; user 
mailto:user@flink.apache.org>>; Timo Walther 
mailto:twal...@apache.org>>
Betreff: Re: Kafka SQL Connector: dropping events if more partitions then 
source tasks

Hi, Jan.

Could you tell us which Flink version you use? As far as I know, the kafka sql 
connector has implemented `SupportWatermarkPushDown` in Flink-1.12. The 
`SupportWatermarkPushDown` pushes the watermark generator into the source and 
emits the minimum watermark among all the partitions. For more details, you can 
refer to the doc for more details[1].

If you use the version before FLINK-1.12,  I think the best approach to solve 
this problem is to increase source tasks.

Best,
Shengkai

[1] 
https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/kafka/#source-per-partition-watermarks

Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>> 
于2021年2月25日周四 下午4:24写道:
Hi Benchao,

i’m observing this behaviour only for the SQL API. With the Datastream API i 
can take more or less source-tasks then kafka partition count. And FLIP-27 
seems to belong to the Datastream API.

The problem is only on the SQL site.


Best,
Jan

Von: Benchao Li mailto:libenc...@apache.org>>
Gesendet: Donnerstag, 25. Februar 2021 00:04
An: Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>>
Cc: Arvid Heise mailto:ar...@apache.org>>; user 
mailto:user@flink.apache.org>>; Timo Walther 
mailto:twal...@apache.org>>
Betreff: Re: Kafka SQL Connector: dropping events if more partitions then 
source tasks

Hi Jan,

What you are observing is correct for the current implementation.

Current watermark generation is based on subtask instead of partition. Hence if 
there are
more than on partition in the same subtask, it's very easy to see more data 
dropped.

AFAIK, FLIP-27 could solve this problem, however the Kafka Connector has not 
been
migrated to FLIP-27 for now.


Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>> 
于2021年2月24日周三 下午10:07写道:
Hi Arvid,

thanks for bringing back this topic.

Yes, I’m running on historic data, but as you mentioned that should not be the 
problem, even there is a event-time skew between partitions.

But maybe this issue with the missing watermark pushdown per partition  is the 
important fact:

https://issues.apache.org/jira/browse/FLINK-20041


Best,
Jan

Von: Arvid Heise mailto:ar...@apache.org>>
Gesendet: Mittwoch, 24. Februar 2021 14:10
An: Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>>
Cc: user mailto:user@flink.apache.org>>; Timo Walther 
mailto:twal...@apache.org>>
Betreff: Re: Kafka SQL Connector: dropping events if more partitions then 
source tasks

Hi Jan,

Are you running on historic data? Then your partitions might drift apart 
quickly.

However, I still suspect that this is a bug (Watermark should only be from the 
slowest partition). I'm pulling in Timo who should know more.



On Fri, Feb 19, 2021 at 10:50 AM Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>> 
wrote:
If i increase the watermark, the dropped events getting lower. But why is the 
DataStream API Job still running with 12 hours watermark delay?

By the way, I’m using Flink 1.11. It would be nice if someone could give me 
some advice.

Best,
Jan

Von: Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>>
Gesendet: Donnerstag, 18. Februar 2021 09:51
An: Jan Oelschl

Re: Kafka SQL Connector: dropping events if more partitions then source tasks

2021-02-26 Thread Shengkai Fang
Hi Jan.

Thanks for your reply. Do you set the option
`table.exec.source.idle-timeout`  and `pipeline.auto-watermark-interval` ?
If the `pipeline.auto-watermark-interval ` is zero, it will not trigger the
detection of the idle source.

Best,
Shengkai

Jan Oelschlegel  于2021年2月26日周五
下午11:09写道:

> Hi Shengkai,
>
>
>
> i’m using Flink 1.11.2. The problem is if I use a parallelism higher than
> my kafka partition count, the watermarks are not increasing and so the
> windows are never ggot fired.
>
>
>
> I suspect that then a source task is not marked as idle and thus the
> watermark is not increased. In any case I have observed how with a larger
> number of source tasks no results are produced.
>
>
>
> Best,
>
> Jan
>
> *Von:* Shengkai Fang 
> *Gesendet:* Freitag, 26. Februar 2021 15:32
> *An:* Jan Oelschlegel 
> *Cc:* Benchao Li ; Arvid Heise ;
> user ; Timo Walther 
> *Betreff:* Re: Kafka SQL Connector: dropping events if more partitions
> then source tasks
>
>
>
> Hi, Jan.
>
>
>
> Could you tell us which Flink version you use? As far as I know, the kafka
> sql connector has implemented `SupportWatermarkPushDown` in Flink-1.12. The
> `SupportWatermarkPushDown` pushes the watermark generator into the source
> and emits the minimum watermark among all the partitions. For more details,
> you can refer to the doc for more details[1].
>
>
>
> If you use the version before FLINK-1.12,  I think the best approach to
> solve this problem is to increase source tasks.
>
>
>
> Best,
>
> Shengkai
>
>
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/kafka/#source-per-partition-watermarks
>
>
>
> Jan Oelschlegel  于2021年2月25日周四 下午4:24
> 写道:
>
> Hi Benchao,
>
>
>
> i’m observing this behaviour only for the SQL API. With the Datastream API
> i can take more or less source-tasks then kafka partition count. And
> FLIP-27 seems to belong to the Datastream API.
>
>
>
> The problem is only on the SQL site.
>
>
>
>
>
> Best,
>
> Jan
>
>
>
> *Von:* Benchao Li 
> *Gesendet:* Donnerstag, 25. Februar 2021 00:04
> *An:* Jan Oelschlegel 
> *Cc:* Arvid Heise ; user ; Timo
> Walther 
> *Betreff:* Re: Kafka SQL Connector: dropping events if more partitions
> then source tasks
>
>
>
> Hi Jan,
>
>
>
> What you are observing is correct for the current implementation.
>
>
>
> Current watermark generation is based on subtask instead of partition.
> Hence if there are
>
> more than on partition in the same subtask, it's very easy to see more
> data dropped.
>
>
>
> AFAIK, FLIP-27 could solve this problem, however the Kafka Connector has
> not been
>
> migrated to FLIP-27 for now.
>
>
>
>
>
> Jan Oelschlegel  于2021年2月24日周三 下午10:07
> 写道:
>
> Hi Arvid,
>
>
>
> thanks for bringing back this topic.
>
>
>
> Yes, I’m running on historic data, but as you mentioned that should not be
> the problem, even there is a event-time skew between partitions.
>
>
>
> But maybe this issue with the missing watermark pushdown per partition  is
> the important fact:
>
>
>
> https://issues.apache.org/jira/browse/FLINK-20041
>
>
>
>
>
> Best,
>
> Jan
>
>
>
> *Von:* Arvid Heise 
> *Gesendet:* Mittwoch, 24. Februar 2021 14:10
> *An:* Jan Oelschlegel 
> *Cc:* user ; Timo Walther 
> *Betreff:* Re: Kafka SQL Connector: dropping events if more partitions
> then source tasks
>
>
>
> Hi Jan,
>
>
>
> Are you running on historic data? Then your partitions might drift apart
> quickly.
>
>
>
> However, I still suspect that this is a bug (Watermark should only be from
> the slowest partition). I'm pulling in Timo who should know more.
>
>
>
>
>
>
>
> On Fri, Feb 19, 2021 at 10:50 AM Jan Oelschlegel <
> oelschle...@integration-factory.de> wrote:
>
> If i increase the watermark, the dropped events getting lower. But why is
> the DataStream API Job still running with 12 hours watermark delay?
>
> By the way, I’m using Flink 1.11. It would be nice if someone could give
> me some advice.
>
>
>
> Best,
>
> Jan
>
>
>
> *Von:* Jan Oelschlegel 
> *Gesendet:* Donnerstag, 18. Februar 2021 09:51
> *An:* Jan Oelschlegel ; user <
> user@flink.apache.org>
> *Betreff:* AW: Kafka SQL Connector: dropping events if more partitions
> then source tasks
>
>
>
> By  using the DataStream API with the same business logic I’m getting no
> dropped events.
>
>
>
> *Von:* Jan Oelschlegel 
> *Gesendet:* Mittwoch, 17. Februar 2021 19:18
> 

AW: Kafka SQL Connector: dropping events if more partitions then source tasks

2021-02-26 Thread Jan Oelschlegel
Hi Shengkai,

i’m using Flink 1.11.2. The problem is if I use a parallelism higher than my 
kafka partition count, the watermarks are not increasing and so the windows are 
never ggot fired.

I suspect that then a source task is not marked as idle and thus the watermark 
is not increased. In any case I have observed how with a larger number of 
source tasks no results are produced.

Best,
Jan
Von: Shengkai Fang 
Gesendet: Freitag, 26. Februar 2021 15:32
An: Jan Oelschlegel 
Cc: Benchao Li ; Arvid Heise ; user 
; Timo Walther 
Betreff: Re: Kafka SQL Connector: dropping events if more partitions then 
source tasks

Hi, Jan.

Could you tell us which Flink version you use? As far as I know, the kafka sql 
connector has implemented `SupportWatermarkPushDown` in Flink-1.12. The 
`SupportWatermarkPushDown` pushes the watermark generator into the source and 
emits the minimum watermark among all the partitions. For more details, you can 
refer to the doc for more details[1].

If you use the version before FLINK-1.12,  I think the best approach to solve 
this problem is to increase source tasks.

Best,
Shengkai

[1] 
https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/kafka/#source-per-partition-watermarks

Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>> 
于2021年2月25日周四 下午4:24写道:
Hi Benchao,

i’m observing this behaviour only for the SQL API. With the Datastream API i 
can take more or less source-tasks then kafka partition count. And FLIP-27 
seems to belong to the Datastream API.

The problem is only on the SQL site.


Best,
Jan

Von: Benchao Li mailto:libenc...@apache.org>>
Gesendet: Donnerstag, 25. Februar 2021 00:04
An: Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>>
Cc: Arvid Heise mailto:ar...@apache.org>>; user 
mailto:user@flink.apache.org>>; Timo Walther 
mailto:twal...@apache.org>>
Betreff: Re: Kafka SQL Connector: dropping events if more partitions then 
source tasks

Hi Jan,

What you are observing is correct for the current implementation.

Current watermark generation is based on subtask instead of partition. Hence if 
there are
more than on partition in the same subtask, it's very easy to see more data 
dropped.

AFAIK, FLIP-27 could solve this problem, however the Kafka Connector has not 
been
migrated to FLIP-27 for now.


Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>> 
于2021年2月24日周三 下午10:07写道:
Hi Arvid,

thanks for bringing back this topic.

Yes, I’m running on historic data, but as you mentioned that should not be the 
problem, even there is a event-time skew between partitions.

But maybe this issue with the missing watermark pushdown per partition  is the 
important fact:

https://issues.apache.org/jira/browse/FLINK-20041


Best,
Jan

Von: Arvid Heise mailto:ar...@apache.org>>
Gesendet: Mittwoch, 24. Februar 2021 14:10
An: Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>>
Cc: user mailto:user@flink.apache.org>>; Timo Walther 
mailto:twal...@apache.org>>
Betreff: Re: Kafka SQL Connector: dropping events if more partitions then 
source tasks

Hi Jan,

Are you running on historic data? Then your partitions might drift apart 
quickly.

However, I still suspect that this is a bug (Watermark should only be from the 
slowest partition). I'm pulling in Timo who should know more.



On Fri, Feb 19, 2021 at 10:50 AM Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>> 
wrote:
If i increase the watermark, the dropped events getting lower. But why is the 
DataStream API Job still running with 12 hours watermark delay?

By the way, I’m using Flink 1.11. It would be nice if someone could give me 
some advice.

Best,
Jan

Von: Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>>
Gesendet: Donnerstag, 18. Februar 2021 09:51
An: Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>>;
 user mailto:user@flink.apache.org>>
Betreff: AW: Kafka SQL Connector: dropping events if more partitions then 
source tasks

By  using the DataStream API with the same business logic I’m getting no 
dropped events.

Von: Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>>
Gesendet: Mittwoch, 17. Februar 2021 19:18
An: user mailto:user@flink.apache.org>>
Betreff: Kafka SQL Connector: dropping events if more partitions then source 
tasks

Hi,

i have a question regarding FlinkSQL connector for Kafka. I have 3 Kafka 
partitions and 1 Kafka SQL source connector (Parallelism 1). The data within 
the Kafka parttitons are sorted based on a event-time field, which is also my 
event-time in Flink. My Watermark is generated with a delay of 12 hours

WATERMARK FOR eventtime as eventtime - INTERVAL '12' HOUR


But the problem is that I see dropping events due arriving late in Prometheus.  
But with parallelism of 3  there are no drops.

Do I always have to have as much source-tasks as I have Kafka partitions?



Best,
Jan
HINWEIS: 

Re: Kafka SQL Connector: dropping events if more partitions then source tasks

2021-02-26 Thread Shengkai Fang
Hi, Jan.

Could you tell us which Flink version you use? As far as I know, the kafka
sql connector has implemented `SupportWatermarkPushDown` in Flink-1.12. The
`SupportWatermarkPushDown` pushes the watermark generator into the source
and emits the minimum watermark among all the partitions. For more details,
you can refer to the doc for more details[1].

If you use the version before FLINK-1.12,  I think the best approach to
solve this problem is to increase source tasks.

Best,
Shengkai

[1]
https://ci.apache.org/projects/flink/flink-docs-master/docs/connectors/table/kafka/#source-per-partition-watermarks

Jan Oelschlegel  于2021年2月25日周四 下午4:24写道:

> Hi Benchao,
>
>
>
> i’m observing this behaviour only for the SQL API. With the Datastream API
> i can take more or less source-tasks then kafka partition count. And
> FLIP-27 seems to belong to the Datastream API.
>
>
>
> The problem is only on the SQL site.
>
>
>
>
>
> Best,
>
> Jan
>
>
>
> *Von:* Benchao Li 
> *Gesendet:* Donnerstag, 25. Februar 2021 00:04
> *An:* Jan Oelschlegel 
> *Cc:* Arvid Heise ; user ; Timo
> Walther 
> *Betreff:* Re: Kafka SQL Connector: dropping events if more partitions
> then source tasks
>
>
>
> Hi Jan,
>
>
>
> What you are observing is correct for the current implementation.
>
>
>
> Current watermark generation is based on subtask instead of partition.
> Hence if there are
>
> more than on partition in the same subtask, it's very easy to see more
> data dropped.
>
>
>
> AFAIK, FLIP-27 could solve this problem, however the Kafka Connector has
> not been
>
> migrated to FLIP-27 for now.
>
>
>
>
>
> Jan Oelschlegel  于2021年2月24日周三 下午10:07
> 写道:
>
> Hi Arvid,
>
>
>
> thanks for bringing back this topic.
>
>
>
> Yes, I’m running on historic data, but as you mentioned that should not be
> the problem, even there is a event-time skew between partitions.
>
>
>
> But maybe this issue with the missing watermark pushdown per partition  is
> the important fact:
>
>
>
> https://issues.apache.org/jira/browse/FLINK-20041
>
>
>
>
>
> Best,
>
> Jan
>
>
>
> *Von:* Arvid Heise 
> *Gesendet:* Mittwoch, 24. Februar 2021 14:10
> *An:* Jan Oelschlegel 
> *Cc:* user ; Timo Walther 
> *Betreff:* Re: Kafka SQL Connector: dropping events if more partitions
> then source tasks
>
>
>
> Hi Jan,
>
>
>
> Are you running on historic data? Then your partitions might drift apart
> quickly.
>
>
>
> However, I still suspect that this is a bug (Watermark should only be from
> the slowest partition). I'm pulling in Timo who should know more.
>
>
>
>
>
>
>
> On Fri, Feb 19, 2021 at 10:50 AM Jan Oelschlegel <
> oelschle...@integration-factory.de> wrote:
>
> If i increase the watermark, the dropped events getting lower. But why is
> the DataStream API Job still running with 12 hours watermark delay?
>
> By the way, I’m using Flink 1.11. It would be nice if someone could give
> me some advice.
>
>
>
> Best,
>
> Jan
>
>
>
> *Von:* Jan Oelschlegel 
> *Gesendet:* Donnerstag, 18. Februar 2021 09:51
> *An:* Jan Oelschlegel ; user <
> user@flink.apache.org>
> *Betreff:* AW: Kafka SQL Connector: dropping events if more partitions
> then source tasks
>
>
>
> By  using the DataStream API with the same business logic I’m getting no
> dropped events.
>
>
>
> *Von:* Jan Oelschlegel 
> *Gesendet:* Mittwoch, 17. Februar 2021 19:18
> *An:* user 
> *Betreff:* Kafka SQL Connector: dropping events if more partitions then
> source tasks
>
>
>
> Hi,
>
>
>
> i have a question regarding FlinkSQL connector for Kafka. I have 3 Kafka
> partitions and 1 Kafka SQL source connector (Parallelism 1). The data
> within the Kafka parttitons are sorted based on a event-time field, which
> is also my event-time in Flink. My Watermark is generated with a delay of
> 12 hours
>
>
>
> WATERMARK FOR eventtime as eventtime - INTERVAL '12' HOUR
>
>
>
>
>
> But the problem is that I see dropping events due arriving late in
> Prometheus.  But with parallelism of 3  there are no drops.
>
>
>
> Do I always have to have as much source-tasks as I have Kafka partitions?
>
>
>
>
>
>
>
> Best,
>
> Jan
>
> HINWEIS: Dies ist eine vertrauliche Nachricht und nur für den Adressaten
> bestimmt. Es ist nicht erlaubt, diese Nachricht zu kopieren oder Dritten
> zugänglich zu machen. Sollten Sie diese Nachricht irrtümlich erhalten
> haben, bitte ich um Ihre Mitteilung per E-Mail oder unter der oben
> angegebenen Telefonnummer.
>
>

AW: Kafka SQL Connector: dropping events if more partitions then source tasks

2021-02-25 Thread Jan Oelschlegel
Hi Benchao,

i’m observing this behaviour only for the SQL API. With the Datastream API i 
can take more or less source-tasks then kafka partition count. And FLIP-27 
seems to belong to the Datastream API.

The problem is only on the SQL site.


Best,
Jan

Von: Benchao Li 
Gesendet: Donnerstag, 25. Februar 2021 00:04
An: Jan Oelschlegel 
Cc: Arvid Heise ; user ; Timo Walther 

Betreff: Re: Kafka SQL Connector: dropping events if more partitions then 
source tasks

Hi Jan,

What you are observing is correct for the current implementation.

Current watermark generation is based on subtask instead of partition. Hence if 
there are
more than on partition in the same subtask, it's very easy to see more data 
dropped.

AFAIK, FLIP-27 could solve this problem, however the Kafka Connector has not 
been
migrated to FLIP-27 for now.


Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>> 
于2021年2月24日周三 下午10:07写道:
Hi Arvid,

thanks for bringing back this topic.

Yes, I’m running on historic data, but as you mentioned that should not be the 
problem, even there is a event-time skew between partitions.

But maybe this issue with the missing watermark pushdown per partition  is the 
important fact:

https://issues.apache.org/jira/browse/FLINK-20041


Best,
Jan

Von: Arvid Heise mailto:ar...@apache.org>>
Gesendet: Mittwoch, 24. Februar 2021 14:10
An: Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>>
Cc: user mailto:user@flink.apache.org>>; Timo Walther 
mailto:twal...@apache.org>>
Betreff: Re: Kafka SQL Connector: dropping events if more partitions then 
source tasks

Hi Jan,

Are you running on historic data? Then your partitions might drift apart 
quickly.

However, I still suspect that this is a bug (Watermark should only be from the 
slowest partition). I'm pulling in Timo who should know more.



On Fri, Feb 19, 2021 at 10:50 AM Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>> 
wrote:
If i increase the watermark, the dropped events getting lower. But why is the 
DataStream API Job still running with 12 hours watermark delay?

By the way, I’m using Flink 1.11. It would be nice if someone could give me 
some advice.

Best,
Jan

Von: Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>>
Gesendet: Donnerstag, 18. Februar 2021 09:51
An: Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>>;
 user mailto:user@flink.apache.org>>
Betreff: AW: Kafka SQL Connector: dropping events if more partitions then 
source tasks

By  using the DataStream API with the same business logic I’m getting no 
dropped events.

Von: Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>>
Gesendet: Mittwoch, 17. Februar 2021 19:18
An: user mailto:user@flink.apache.org>>
Betreff: Kafka SQL Connector: dropping events if more partitions then source 
tasks

Hi,

i have a question regarding FlinkSQL connector for Kafka. I have 3 Kafka 
partitions and 1 Kafka SQL source connector (Parallelism 1). The data within 
the Kafka parttitons are sorted based on a event-time field, which is also my 
event-time in Flink. My Watermark is generated with a delay of 12 hours

WATERMARK FOR eventtime as eventtime - INTERVAL '12' HOUR


But the problem is that I see dropping events due arriving late in Prometheus.  
But with parallelism of 3  there are no drops.

Do I always have to have as much source-tasks as I have Kafka partitions?



Best,
Jan
HINWEIS: Dies ist eine vertrauliche Nachricht und nur für den Adressaten 
bestimmt. Es ist nicht erlaubt, diese Nachricht zu kopieren oder Dritten 
zugänglich zu machen. Sollten Sie diese Nachricht irrtümlich erhalten haben, 
bitte ich um Ihre Mitteilung per E-Mail oder unter der oben angegebenen 
Telefonnummer.
HINWEIS: Dies ist eine vertrauliche Nachricht und nur für den Adressaten 
bestimmt. Es ist nicht erlaubt, diese Nachricht zu kopieren oder Dritten 
zugänglich zu machen. Sollten Sie diese Nachricht irrtümlich erhalten haben, 
bitte ich um Ihre Mitteilung per E-Mail oder unter der oben angegebenen 
Telefonnummer.
HINWEIS: Dies ist eine vertrauliche Nachricht und nur für den Adressaten 
bestimmt. Es ist nicht erlaubt, diese Nachricht zu kopieren oder Dritten 
zugänglich zu machen. Sollten Sie diese Nachricht irrtümlich erhalten haben, 
bitte ich um Ihre Mitteilung per E-Mail oder unter der oben angegebenen 
Telefonnummer.


--

Best,
Benchao Li
HINWEIS: Dies ist eine vertrauliche Nachricht und nur für den Adressaten 
bestimmt. Es ist nicht erlaubt, diese Nachricht zu kopieren oder Dritten 
zugänglich zu machen. Sollten Sie diese Nachricht irrtümlich erhalten haben, 
bitte ich um Ihre Mitteilung per E-Mail oder unter der oben angegebenen 
Telefonnummer.


Re: Kafka SQL Connector: dropping events if more partitions then source tasks

2021-02-24 Thread Benchao Li
Hi Jan,

What you are observing is correct for the current implementation.

Current watermark generation is based on subtask instead of partition.
Hence if there are
more than on partition in the same subtask, it's very easy to see more data
dropped.

AFAIK, FLIP-27 could solve this problem, however the Kafka Connector has
not been
migrated to FLIP-27 for now.


Jan Oelschlegel  于2021年2月24日周三
下午10:07写道:

> Hi Arvid,
>
>
>
> thanks for bringing back this topic.
>
>
>
> Yes, I’m running on historic data, but as you mentioned that should not be
> the problem, even there is a event-time skew between partitions.
>
>
>
> But maybe this issue with the missing watermark pushdown per partition  is
> the important fact:
>
>
>
> https://issues.apache.org/jira/browse/FLINK-20041
>
>
>
>
>
> Best,
>
> Jan
>
>
>
> *Von:* Arvid Heise 
> *Gesendet:* Mittwoch, 24. Februar 2021 14:10
> *An:* Jan Oelschlegel 
> *Cc:* user ; Timo Walther 
> *Betreff:* Re: Kafka SQL Connector: dropping events if more partitions
> then source tasks
>
>
>
> Hi Jan,
>
>
>
> Are you running on historic data? Then your partitions might drift apart
> quickly.
>
>
>
> However, I still suspect that this is a bug (Watermark should only be from
> the slowest partition). I'm pulling in Timo who should know more.
>
>
>
>
>
>
>
> On Fri, Feb 19, 2021 at 10:50 AM Jan Oelschlegel <
> oelschle...@integration-factory.de> wrote:
>
> If i increase the watermark, the dropped events getting lower. But why is
> the DataStream API Job still running with 12 hours watermark delay?
>
> By the way, I’m using Flink 1.11. It would be nice if someone could give
> me some advice.
>
>
>
> Best,
>
> Jan
>
>
>
> *Von:* Jan Oelschlegel 
> *Gesendet:* Donnerstag, 18. Februar 2021 09:51
> *An:* Jan Oelschlegel ; user <
> user@flink.apache.org>
> *Betreff:* AW: Kafka SQL Connector: dropping events if more partitions
> then source tasks
>
>
>
> By  using the DataStream API with the same business logic I’m getting no
> dropped events.
>
>
>
> *Von:* Jan Oelschlegel 
> *Gesendet:* Mittwoch, 17. Februar 2021 19:18
> *An:* user 
> *Betreff:* Kafka SQL Connector: dropping events if more partitions then
> source tasks
>
>
>
> Hi,
>
>
>
> i have a question regarding FlinkSQL connector for Kafka. I have 3 Kafka
> partitions and 1 Kafka SQL source connector (Parallelism 1). The data
> within the Kafka parttitons are sorted based on a event-time field, which
> is also my event-time in Flink. My Watermark is generated with a delay of
> 12 hours
>
>
>
> WATERMARK FOR eventtime as eventtime - INTERVAL '12' HOUR
>
>
>
>
>
> But the problem is that I see dropping events due arriving late in
> Prometheus.  But with parallelism of 3  there are no drops.
>
>
>
> Do I always have to have as much source-tasks as I have Kafka partitions?
>
>
>
>
>
>
>
> Best,
>
> Jan
>
> HINWEIS: Dies ist eine vertrauliche Nachricht und nur für den Adressaten
> bestimmt. Es ist nicht erlaubt, diese Nachricht zu kopieren oder Dritten
> zugänglich zu machen. Sollten Sie diese Nachricht irrtümlich erhalten
> haben, bitte ich um Ihre Mitteilung per E-Mail oder unter der oben
> angegebenen Telefonnummer.
>
> HINWEIS: Dies ist eine vertrauliche Nachricht und nur für den Adressaten
> bestimmt. Es ist nicht erlaubt, diese Nachricht zu kopieren oder Dritten
> zugänglich zu machen. Sollten Sie diese Nachricht irrtümlich erhalten
> haben, bitte ich um Ihre Mitteilung per E-Mail oder unter der oben
> angegebenen Telefonnummer.
>
> HINWEIS: Dies ist eine vertrauliche Nachricht und nur für den Adressaten
> bestimmt. Es ist nicht erlaubt, diese Nachricht zu kopieren oder Dritten
> zugänglich zu machen. Sollten Sie diese Nachricht irrtümlich erhalten
> haben, bitte ich um Ihre Mitteilung per E-Mail oder unter der oben
> angegebenen Telefonnummer.
>


-- 

Best,
Benchao Li


AW: Kafka SQL Connector: dropping events if more partitions then source tasks

2021-02-24 Thread Jan Oelschlegel
Hi Arvid,

thanks for bringing back this topic.

Yes, I’m running on historic data, but as you mentioned that should not be the 
problem, even there is a event-time skew between partitions.

But maybe this issue with the missing watermark pushdown per partition  is the 
important fact:

https://issues.apache.org/jira/browse/FLINK-20041


Best,
Jan

Von: Arvid Heise 
Gesendet: Mittwoch, 24. Februar 2021 14:10
An: Jan Oelschlegel 
Cc: user ; Timo Walther 
Betreff: Re: Kafka SQL Connector: dropping events if more partitions then 
source tasks

Hi Jan,

Are you running on historic data? Then your partitions might drift apart 
quickly.

However, I still suspect that this is a bug (Watermark should only be from the 
slowest partition). I'm pulling in Timo who should know more.



On Fri, Feb 19, 2021 at 10:50 AM Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>> 
wrote:
If i increase the watermark, the dropped events getting lower. But why is the 
DataStream API Job still running with 12 hours watermark delay?

By the way, I’m using Flink 1.11. It would be nice if someone could give me 
some advice.

Best,
Jan

Von: Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>>
Gesendet: Donnerstag, 18. Februar 2021 09:51
An: Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>>;
 user mailto:user@flink.apache.org>>
Betreff: AW: Kafka SQL Connector: dropping events if more partitions then 
source tasks

By  using the DataStream API with the same business logic I’m getting no 
dropped events.

Von: Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>>
Gesendet: Mittwoch, 17. Februar 2021 19:18
An: user mailto:user@flink.apache.org>>
Betreff: Kafka SQL Connector: dropping events if more partitions then source 
tasks

Hi,

i have a question regarding FlinkSQL connector for Kafka. I have 3 Kafka 
partitions and 1 Kafka SQL source connector (Parallelism 1). The data within 
the Kafka parttitons are sorted based on a event-time field, which is also my 
event-time in Flink. My Watermark is generated with a delay of 12 hours

WATERMARK FOR eventtime as eventtime - INTERVAL '12' HOUR


But the problem is that I see dropping events due arriving late in Prometheus.  
But with parallelism of 3  there are no drops.

Do I always have to have as much source-tasks as I have Kafka partitions?



Best,
Jan
HINWEIS: Dies ist eine vertrauliche Nachricht und nur für den Adressaten 
bestimmt. Es ist nicht erlaubt, diese Nachricht zu kopieren oder Dritten 
zugänglich zu machen. Sollten Sie diese Nachricht irrtümlich erhalten haben, 
bitte ich um Ihre Mitteilung per E-Mail oder unter der oben angegebenen 
Telefonnummer.
HINWEIS: Dies ist eine vertrauliche Nachricht und nur für den Adressaten 
bestimmt. Es ist nicht erlaubt, diese Nachricht zu kopieren oder Dritten 
zugänglich zu machen. Sollten Sie diese Nachricht irrtümlich erhalten haben, 
bitte ich um Ihre Mitteilung per E-Mail oder unter der oben angegebenen 
Telefonnummer.
HINWEIS: Dies ist eine vertrauliche Nachricht und nur für den Adressaten 
bestimmt. Es ist nicht erlaubt, diese Nachricht zu kopieren oder Dritten 
zugänglich zu machen. Sollten Sie diese Nachricht irrtümlich erhalten haben, 
bitte ich um Ihre Mitteilung per E-Mail oder unter der oben angegebenen 
Telefonnummer.


Re: Kafka SQL Connector: dropping events if more partitions then source tasks

2021-02-24 Thread Arvid Heise
Hi Jan,

Are you running on historic data? Then your partitions might drift apart
quickly.

However, I still suspect that this is a bug (Watermark should only be from
the slowest partition). I'm pulling in Timo who should know more.



On Fri, Feb 19, 2021 at 10:50 AM Jan Oelschlegel <
oelschle...@integration-factory.de> wrote:

> If i increase the watermark, the dropped events getting lower. But why is
> the DataStream API Job still running with 12 hours watermark delay?
>
> By the way, I’m using Flink 1.11. It would be nice if someone could give
> me some advice.
>
>
>
> Best,
>
> Jan
>
>
>
> *Von:* Jan Oelschlegel 
> *Gesendet:* Donnerstag, 18. Februar 2021 09:51
> *An:* Jan Oelschlegel ; user <
> user@flink.apache.org>
> *Betreff:* AW: Kafka SQL Connector: dropping events if more partitions
> then source tasks
>
>
>
> By  using the DataStream API with the same business logic I’m getting no
> dropped events.
>
>
>
> *Von:* Jan Oelschlegel 
> *Gesendet:* Mittwoch, 17. Februar 2021 19:18
> *An:* user 
> *Betreff:* Kafka SQL Connector: dropping events if more partitions then
> source tasks
>
>
>
> Hi,
>
>
>
> i have a question regarding FlinkSQL connector for Kafka. I have 3 Kafka
> partitions and 1 Kafka SQL source connector (Parallelism 1). The data
> within the Kafka parttitons are sorted based on a event-time field, which
> is also my event-time in Flink. My Watermark is generated with a delay of
> 12 hours
>
>
>
> WATERMARK FOR eventtime as eventtime - INTERVAL '12' HOUR
>
>
>
>
>
> But the problem is that I see dropping events due arriving late in
> Prometheus.  But with parallelism of 3  there are no drops.
>
>
>
> Do I always have to have as much source-tasks as I have Kafka partitions?
>
>
>
>
>
>
>
> Best,
>
> Jan
>
> HINWEIS: Dies ist eine vertrauliche Nachricht und nur für den Adressaten
> bestimmt. Es ist nicht erlaubt, diese Nachricht zu kopieren oder Dritten
> zugänglich zu machen. Sollten Sie diese Nachricht irrtümlich erhalten
> haben, bitte ich um Ihre Mitteilung per E-Mail oder unter der oben
> angegebenen Telefonnummer.
> HINWEIS: Dies ist eine vertrauliche Nachricht und nur für den Adressaten
> bestimmt. Es ist nicht erlaubt, diese Nachricht zu kopieren oder Dritten
> zugänglich zu machen. Sollten Sie diese Nachricht irrtümlich erhalten
> haben, bitte ich um Ihre Mitteilung per E-Mail oder unter der oben
> angegebenen Telefonnummer.
>


AW: Kafka SQL Connector: dropping events if more partitions then source tasks

2021-02-19 Thread Jan Oelschlegel
If i increase the watermark, the dropped events getting lower. But why is the 
DataStream API Job still running with 12 hours watermark delay?

By the way, I'm using Flink 1.11. It would be nice if someone could give me 
some advice.

Best,
Jan

Von: Jan Oelschlegel 
Gesendet: Donnerstag, 18. Februar 2021 09:51
An: Jan Oelschlegel ; user 

Betreff: AW: Kafka SQL Connector: dropping events if more partitions then 
source tasks

By  using the DataStream API with the same business logic I'm getting no 
dropped events.

Von: Jan Oelschlegel 
mailto:oelschle...@integration-factory.de>>
Gesendet: Mittwoch, 17. Februar 2021 19:18
An: user mailto:user@flink.apache.org>>
Betreff: Kafka SQL Connector: dropping events if more partitions then source 
tasks

Hi,

i have a question regarding FlinkSQL connector for Kafka. I have 3 Kafka 
partitions and 1 Kafka SQL source connector (Parallelism 1). The data within 
the Kafka parttitons are sorted based on a event-time field, which is also my 
event-time in Flink. My Watermark is generated with a delay of 12 hours

WATERMARK FOR eventtime as eventtime - INTERVAL '12' HOUR


But the problem is that I see dropping events due arriving late in Prometheus.  
But with parallelism of 3  there are no drops.

Do I always have to have as much source-tasks as I have Kafka partitions?



Best,
Jan
HINWEIS: Dies ist eine vertrauliche Nachricht und nur für den Adressaten 
bestimmt. Es ist nicht erlaubt, diese Nachricht zu kopieren oder Dritten 
zugänglich zu machen. Sollten Sie diese Nachricht irrtümlich erhalten haben, 
bitte ich um Ihre Mitteilung per E-Mail oder unter der oben angegebenen 
Telefonnummer.
HINWEIS: Dies ist eine vertrauliche Nachricht und nur für den Adressaten 
bestimmt. Es ist nicht erlaubt, diese Nachricht zu kopieren oder Dritten 
zugänglich zu machen. Sollten Sie diese Nachricht irrtümlich erhalten haben, 
bitte ich um Ihre Mitteilung per E-Mail oder unter der oben angegebenen 
Telefonnummer.


AW: Kafka SQL Connector: dropping events if more partitions then source tasks

2021-02-18 Thread Jan Oelschlegel
By  using the DataStream API with the same business logic I'm getting no 
dropped events.

Von: Jan Oelschlegel 
Gesendet: Mittwoch, 17. Februar 2021 19:18
An: user 
Betreff: Kafka SQL Connector: dropping events if more partitions then source 
tasks

Hi,

i have a question regarding FlinkSQL connector for Kafka. I have 3 Kafka 
partitions and 1 Kafka SQL source connector (Parallelism 1). The data within 
the Kafka parttitons are sorted based on a event-time field, which is also my 
event-time in Flink. My Watermark is generated with a delay of 12 hours

WATERMARK FOR eventtime as eventtime - INTERVAL '12' HOUR


But the problem is that I see dropping events due arriving late in Prometheus.  
But with parallelism of 3  there are no drops.

Do I always have to have as much source-tasks as I have Kafka partitions?



Best,
Jan
HINWEIS: Dies ist eine vertrauliche Nachricht und nur für den Adressaten 
bestimmt. Es ist nicht erlaubt, diese Nachricht zu kopieren oder Dritten 
zugänglich zu machen. Sollten Sie diese Nachricht irrtümlich erhalten haben, 
bitte ich um Ihre Mitteilung per E-Mail oder unter der oben angegebenen 
Telefonnummer.
HINWEIS: Dies ist eine vertrauliche Nachricht und nur für den Adressaten 
bestimmt. Es ist nicht erlaubt, diese Nachricht zu kopieren oder Dritten 
zugänglich zu machen. Sollten Sie diese Nachricht irrtümlich erhalten haben, 
bitte ich um Ihre Mitteilung per E-Mail oder unter der oben angegebenen 
Telefonnummer.


Kafka SQL Connector: dropping events if more partitions then source tasks

2021-02-17 Thread Jan Oelschlegel
Hi,

i have a question regarding FlinkSQL connector for Kafka. I have 3 Kafka 
partitions and 1 Kafka SQL source connector (Parallelism 1). The data within 
the Kafka parttitons are sorted based on a event-time field, which is also my 
event-time in Flink. My Watermark is generated with a delay of 12 hours

WATERMARK FOR eventtime as eventtime - INTERVAL '12' HOUR


But the problem is that I see dropping events due arriving late in Prometheus.  
But with parallelism of 3  there are no drops.

Do I always have to have as much source-tasks as I have Kafka partitions?



Best,
Jan
HINWEIS: Dies ist eine vertrauliche Nachricht und nur f?r den Adressaten 
bestimmt. Es ist nicht erlaubt, diese Nachricht zu kopieren oder Dritten 
zug?nglich zu machen. Sollten Sie diese Nachricht irrt?mlich erhalten haben, 
bitte ich um Ihre Mitteilung per E-Mail oder unter der oben angegebenen 
Telefonnummer.