Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

Péter Váry Sat, 25 May 2024 03:10:08 -0700

> Could the table/database sync with schema evolution (without Flink job
restarts!) potentially work with the Iceberg sink?


Making  this work would be a good addition to the Iceberg-Flink connector.
It is definitely doable, but not a single PR sized task. If you want to try
your hands on it, I will try to find time to review your plans/code, so
your code could be incorporated into the upcoming releases.

Thanks,
Peter



On Fri, May 24, 2024, 17:07 Andrew Otto <o...@wikimedia.org> wrote:

> > What is not is the automatic syncing of entire databases, with schema
> evolution and detection of new (and dropped?) tables. :)
> Wait.  Is it?
>
> > Flink CDC supports synchronizing all tables of source database instance
> to downstream in one job by configuring the captured database list and
> table list.
>
>
> On Fri, May 24, 2024 at 11:04 AM Andrew Otto <o...@wikimedia.org> wrote:
>
>> Indeed, using Flink-CDC to write to Flink Sink Tables, including Iceberg,
>> is supported.
>>
>> What is not is the automatic syncing of entire databases, with schema
>> evolution and detection of new (and dropped?) tables.  :)
>>
>>
>>
>>
>> On Fri, May 24, 2024 at 8:58 AM Giannis Polyzos <ipolyzos...@gmail.com>
>> wrote:
>>
>>> https://nightlies.apache.org/flink/flink-cdc-docs-stable/
>>> All these features come from Flink cdc itself. Because Paimon and Flink
>>> cdc are projects native to Flink there is a strong integration between them.
>>> (I believe it’s on the roadmap to support iceberg as well)
>>>
>>> On Fri, 24 May 2024 at 3:52 PM, Andrew Otto <o...@wikimedia.org> wrote:
>>>
>>>> > I’m curious if there is any reason for choosing Iceberg instead of
>>>> Paimon
>>>>
>>>> No technical reason that I'm aware of.  We are using it mostly because
>>>> of momentum.  We looked at Flink Table Store (before it was Paimon), but
>>>> decided it was too early and the docs were too sparse at the time to really
>>>> consider it.
>>>>
>>>> > Especially for a use case like CDC that iceberg struggles to support.
>>>>
>>>> We aren't doing any CDC right now (for many reasons), but I have never
>>>> seen a feature like Paimon's database sync before.  One job to sync and
>>>> evolve an entire database?  That is amazing.
>>>>
>>>> If we could do this with Iceberg, we might be able to make an argument
>>>> to product managers to push for CDC.
>>>>
>>>>
>>>>
>>>> On Fri, May 24, 2024 at 8:36 AM Giannis Polyzos <ipolyzos...@gmail.com>
>>>> wrote:
>>>>
>>>>> I’m curious if there is any reason for choosing Iceberg instead of
>>>>> Paimon (other than - iceberg is more popular).
>>>>> Especially for a use case like CDC that iceberg struggles to support.
>>>>>
>>>>> On Fri, 24 May 2024 at 3:22 PM, Andrew Otto <o...@wikimedia.org>
>>>>> wrote:
>>>>>
>>>>>> Interesting thank you!
>>>>>>
>>>>>> I asked this in the Paimon users group:
>>>>>>
>>>>>> How coupled to Paimon catalogs and tables is the cdc part of Paimon?
>>>>>> RichCdcMultiplexRecord
>>>>>> <https://github.com/apache/paimon/blob/cc7d308d166a945d8d498231ed8e2fc9c7a27fc5/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/RichCdcMultiplexRecord.java>
>>>>>>  and
>>>>>> related code seem incredibly useful even outside of the context of the
>>>>>> Paimon table format.
>>>>>>
>>>>>> I'm asking because the database sync action
>>>>>> <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases>
>>>>>>  feature
>>>>>> is amazing.  At the Wikimedia Foundation, we are on an all-in journey 
>>>>>> with
>>>>>> Iceberg.  I'm wondering how hard it would be to extract the CDC logic 
>>>>>> from
>>>>>> Paimon and abstract the Sink bits.
>>>>>>
>>>>>> Could the table/database sync with schema evolution (without Flink
>>>>>> job restarts!) potentially work with the Iceberg sink?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, May 23, 2024 at 4:34 PM Péter Váry <
>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>
>>>>>>> If I understand correctly, Paimon is sending `CdcRecord`-s [1] on
>>>>>>> the wire which contain not only the data, but the schema as well.
>>>>>>> With Iceberg we currently only send the row data, and expect to
>>>>>>> receive the schema on job start - this is more performant than sending 
>>>>>>> the
>>>>>>> schema all the time, but has the obvious issue that it is not able to
>>>>>>> handle the schema changes. Another part of the dynamic schema
>>>>>>> synchronization is the update of the Iceberg table schema - the schema
>>>>>>> should be updated for all of the writers and the committer / but only a
>>>>>>> single schema change commit is needed (allowed) to the Iceberg table.
>>>>>>>
>>>>>>> This is a very interesting, but non-trivial change.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java
>>>>>>>
>>>>>>> Andrew Otto <o...@wikimedia.org> ezt írta (időpont: 2024. máj. 23.,
>>>>>>> Cs, 21:59):
>>>>>>>
>>>>>>>> Ah I see, so just auto-restarting to pick up new stuff.
>>>>>>>>
>>>>>>>> I'd love to understand how Paimon does this.  They have a database
>>>>>>>> sync action
>>>>>>>> <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases>
>>>>>>>> which will sync entire databases, handle schema evolution, and I'm 
>>>>>>>> pretty
>>>>>>>> sure (I think I saw this in my local test) also pick up new tables.
>>>>>>>>
>>>>>>>>
>>>>>>>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45
>>>>>>>>
>>>>>>>> I'm sure that Paimon table format is great, but at Wikimedia
>>>>>>>> Foundation we are on the Iceberg train.  Imagine if there was a 
>>>>>>>> flink-cdc
>>>>>>>> full database sync to Flink IcebergSink!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, May 23, 2024 at 3:47 PM Péter Váry <
>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I will ask Marton about the slides.
>>>>>>>>>
>>>>>>>>> The solution was something like this in a nutshell:
>>>>>>>>> - Make sure that on job start the latest Iceberg schema is read
>>>>>>>>> from the Iceberg table
>>>>>>>>> - Throw a SuppressRestartsException when data arrives with the
>>>>>>>>> wrong schema
>>>>>>>>> - Use Flink Kubernetes Operator to restart your failed jobs by
>>>>>>>>> setting
>>>>>>>>> kubernetes.operator.job.restart.failed
>>>>>>>>>
>>>>>>>>> Thanks, Peter
>>>>>>>>>
>>>>>>>>> On Thu, May 23, 2024, 20:29 Andrew Otto <o...@wikimedia.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Wow, I would LOVE to see this talk.  If there is no recording,
>>>>>>>>>> perhaps there are slides somewhere?
>>>>>>>>>>
>>>>>>>>>> On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda <
>>>>>>>>>> sanabria.miranda.car...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi everyone!
>>>>>>>>>>>
>>>>>>>>>>> I have found in the Flink Forward website the following
>>>>>>>>>>> presentation: "Self-service ingestion pipelines with evolving
>>>>>>>>>>> schema via Flink and Iceberg
>>>>>>>>>>> <https://www.flink-forward.org/seattle-2023/agenda#self-service-ingestion-pipelines-with-evolving-schema-via-flink-and-iceberg>"
>>>>>>>>>>> by Márton Balassi from the 2023 conference in Seattle, but I cannot 
>>>>>>>>>>> find
>>>>>>>>>>> the recording anywhere. I have found the recordings of the other
>>>>>>>>>>> presentations in the Ververica Academy website
>>>>>>>>>>> <https://www.ververica.academy/app>, but not this one.
>>>>>>>>>>>
>>>>>>>>>>> Does anyone know where I can find it? Or at least the slides?
>>>>>>>>>>>
>>>>>>>>>>> We are using Flink with the Iceberg sink connector to write
>>>>>>>>>>> streaming events to Iceberg tables, and we are researching how to 
>>>>>>>>>>> handle
>>>>>>>>>>> schema evolution properly. I saw that presentation and I thought it 
>>>>>>>>>>> could
>>>>>>>>>>> be of great help to us.
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance!
>>>>>>>>>>>
>>>>>>>>>>> Carlos
>>>>>>>>>>>
>>>>>>>>>>

Re: "Self-service ingestion pipelines with evolving schema via Flink and Iceberg" presentation recording from Flink Forward Seattle 2023

Reply via email to