Re: [ANNOUNCE] Apache Flink CDC 3.1.0 released

2024-05-17 Thread Muhammet Orazov via user

Amazing, congrats!

Thanks for your efforts!

Best,
Muhammet

On 2024-05-17 09:32, Qingsheng Ren wrote:

The Apache Flink community is very happy to announce the release of
Apache Flink CDC 3.1.0.

Apache Flink CDC is a distributed data integration tool for real time
data and batch data, bringing the simplicity and elegance of data
integration via YAML to describe the data movement and transformation
in a data pipeline.

Please check out the release blog post for an overview of the release:
https://flink.apache.org/2024/05/17/apache-flink-cdc-3.1.0-release-announcement/

The release is available for download at:
https://flink.apache.org/downloads.html

Maven artifacts for Flink CDC can be found at:
https://search.maven.org/search?q=g:org.apache.flink%20cdc

The full release notes are available in Jira:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12354387

We would like to thank all contributors of the Apache Flink community
who made this release possible!

Regards,
Qingsheng Ren


Re: Checkpointing

2024-05-08 Thread Muhammet Orazov via user

Hey Jacob,

If you understand how the Kafka offset managed in the checkpoint,
then you could map this notion to other Flink sources.

I would suggest to read the Data Sources[1] document and FLIP-27[5].
Each source should define a `Split`, then it is `SourceReaderBase`[2]
class' responsibility to maintain unfinished split states in
checkpointing.

A split could then be anything, for example,
- Filename and offset in that file[3]
- Kafka topic partition and offset[4]
- JDBC query with start and end range

In general each source should define such a split so that on the
failure recovery, source reader would know where to start processing
data.

Hope these are helpful, and hopefully I didn't mix anything up :)

Best,
Muhammet

[1]: 
https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/datastream/sources/
[2]: 
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-base/src/main/java/org/apache/flink/connector/base/source/reader/SourceReaderBase.java#L300-L304
[3]: 
https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-files/src/main/java/org/apache/flink/connector/file/src/FileSourceSplit.java
[4]: 
https://github.com/apache/flink-connector-kafka/blob/main/flink-connector-kafka/src/main/java/org/apache/flink/connector/kafka/source/split/KafkaPartitionSplit.java
[5]: 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface


On 2024-05-08 10:56, Jacob Rollings wrote:

Hello,

I'm curious about how Flink checkpointing would aid in recovering data
if the data source is not Kafka but another system. I understand that
checkpoint snapshots are taken at regular time intervals.

What happens to the data that were read after the previous successful
checkpoint if the system crashes before the next checkpointing
interval, especially for data sources that are not Kafka and don't
support offset management?

-
JB