> Could the table/database sync with schema evolution (without Flink job restarts!) potentially work with the Iceberg sink?
Making this work would be a good addition to the Iceberg-Flink connector. It is definitely doable, but not a single PR sized task. If you want to try your hands on it, I will try to find time to review your plans/code, so your code could be incorporated into the upcoming releases. Thanks, Peter On Fri, May 24, 2024, 17:07 Andrew Otto <o...@wikimedia.org> wrote: > > What is not is the automatic syncing of entire databases, with schema > evolution and detection of new (and dropped?) tables. :) > Wait. Is it? > > > Flink CDC supports synchronizing all tables of source database instance > to downstream in one job by configuring the captured database list and > table list. > > > On Fri, May 24, 2024 at 11:04 AM Andrew Otto <o...@wikimedia.org> wrote: > >> Indeed, using Flink-CDC to write to Flink Sink Tables, including Iceberg, >> is supported. >> >> What is not is the automatic syncing of entire databases, with schema >> evolution and detection of new (and dropped?) tables. :) >> >> >> >> >> On Fri, May 24, 2024 at 8:58 AM Giannis Polyzos <ipolyzos...@gmail.com> >> wrote: >> >>> https://nightlies.apache.org/flink/flink-cdc-docs-stable/ >>> All these features come from Flink cdc itself. Because Paimon and Flink >>> cdc are projects native to Flink there is a strong integration between them. >>> (I believe it’s on the roadmap to support iceberg as well) >>> >>> On Fri, 24 May 2024 at 3:52 PM, Andrew Otto <o...@wikimedia.org> wrote: >>> >>>> > I’m curious if there is any reason for choosing Iceberg instead of >>>> Paimon >>>> >>>> No technical reason that I'm aware of. We are using it mostly because >>>> of momentum. We looked at Flink Table Store (before it was Paimon), but >>>> decided it was too early and the docs were too sparse at the time to really >>>> consider it. >>>> >>>> > Especially for a use case like CDC that iceberg struggles to support. >>>> >>>> We aren't doing any CDC right now (for many reasons), but I have never >>>> seen a feature like Paimon's database sync before. One job to sync and >>>> evolve an entire database? That is amazing. >>>> >>>> If we could do this with Iceberg, we might be able to make an argument >>>> to product managers to push for CDC. >>>> >>>> >>>> >>>> On Fri, May 24, 2024 at 8:36 AM Giannis Polyzos <ipolyzos...@gmail.com> >>>> wrote: >>>> >>>>> I’m curious if there is any reason for choosing Iceberg instead of >>>>> Paimon (other than - iceberg is more popular). >>>>> Especially for a use case like CDC that iceberg struggles to support. >>>>> >>>>> On Fri, 24 May 2024 at 3:22 PM, Andrew Otto <o...@wikimedia.org> >>>>> wrote: >>>>> >>>>>> Interesting thank you! >>>>>> >>>>>> I asked this in the Paimon users group: >>>>>> >>>>>> How coupled to Paimon catalogs and tables is the cdc part of Paimon? >>>>>> RichCdcMultiplexRecord >>>>>> <https://github.com/apache/paimon/blob/cc7d308d166a945d8d498231ed8e2fc9c7a27fc5/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/RichCdcMultiplexRecord.java> >>>>>> and >>>>>> related code seem incredibly useful even outside of the context of the >>>>>> Paimon table format. >>>>>> >>>>>> I'm asking because the database sync action >>>>>> <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases> >>>>>> feature >>>>>> is amazing. At the Wikimedia Foundation, we are on an all-in journey >>>>>> with >>>>>> Iceberg. I'm wondering how hard it would be to extract the CDC logic >>>>>> from >>>>>> Paimon and abstract the Sink bits. >>>>>> >>>>>> Could the table/database sync with schema evolution (without Flink >>>>>> job restarts!) potentially work with the Iceberg sink? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Thu, May 23, 2024 at 4:34 PM Péter Váry < >>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>> >>>>>>> If I understand correctly, Paimon is sending `CdcRecord`-s [1] on >>>>>>> the wire which contain not only the data, but the schema as well. >>>>>>> With Iceberg we currently only send the row data, and expect to >>>>>>> receive the schema on job start - this is more performant than sending >>>>>>> the >>>>>>> schema all the time, but has the obvious issue that it is not able to >>>>>>> handle the schema changes. Another part of the dynamic schema >>>>>>> synchronization is the update of the Iceberg table schema - the schema >>>>>>> should be updated for all of the writers and the committer / but only a >>>>>>> single schema change commit is needed (allowed) to the Iceberg table. >>>>>>> >>>>>>> This is a very interesting, but non-trivial change. >>>>>>> >>>>>>> [1] >>>>>>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/sink/cdc/CdcRecord.java >>>>>>> >>>>>>> Andrew Otto <o...@wikimedia.org> ezt írta (időpont: 2024. máj. 23., >>>>>>> Cs, 21:59): >>>>>>> >>>>>>>> Ah I see, so just auto-restarting to pick up new stuff. >>>>>>>> >>>>>>>> I'd love to understand how Paimon does this. They have a database >>>>>>>> sync action >>>>>>>> <https://paimon.apache.org/docs/master/flink/cdc-ingestion/mysql-cdc/#synchronizing-databases> >>>>>>>> which will sync entire databases, handle schema evolution, and I'm >>>>>>>> pretty >>>>>>>> sure (I think I saw this in my local test) also pick up new tables. >>>>>>>> >>>>>>>> >>>>>>>> https://github.com/apache/paimon/blob/master/paimon-flink/paimon-flink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/SyncDatabaseActionBase.java#L45 >>>>>>>> >>>>>>>> I'm sure that Paimon table format is great, but at Wikimedia >>>>>>>> Foundation we are on the Iceberg train. Imagine if there was a >>>>>>>> flink-cdc >>>>>>>> full database sync to Flink IcebergSink! >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, May 23, 2024 at 3:47 PM Péter Váry < >>>>>>>> peter.vary.apa...@gmail.com> wrote: >>>>>>>> >>>>>>>>> I will ask Marton about the slides. >>>>>>>>> >>>>>>>>> The solution was something like this in a nutshell: >>>>>>>>> - Make sure that on job start the latest Iceberg schema is read >>>>>>>>> from the Iceberg table >>>>>>>>> - Throw a SuppressRestartsException when data arrives with the >>>>>>>>> wrong schema >>>>>>>>> - Use Flink Kubernetes Operator to restart your failed jobs by >>>>>>>>> setting >>>>>>>>> kubernetes.operator.job.restart.failed >>>>>>>>> >>>>>>>>> Thanks, Peter >>>>>>>>> >>>>>>>>> On Thu, May 23, 2024, 20:29 Andrew Otto <o...@wikimedia.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Wow, I would LOVE to see this talk. If there is no recording, >>>>>>>>>> perhaps there are slides somewhere? >>>>>>>>>> >>>>>>>>>> On Thu, May 23, 2024 at 11:00 AM Carlos Sanabria Miranda < >>>>>>>>>> sanabria.miranda.car...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi everyone! >>>>>>>>>>> >>>>>>>>>>> I have found in the Flink Forward website the following >>>>>>>>>>> presentation: "Self-service ingestion pipelines with evolving >>>>>>>>>>> schema via Flink and Iceberg >>>>>>>>>>> <https://www.flink-forward.org/seattle-2023/agenda#self-service-ingestion-pipelines-with-evolving-schema-via-flink-and-iceberg>" >>>>>>>>>>> by Márton Balassi from the 2023 conference in Seattle, but I cannot >>>>>>>>>>> find >>>>>>>>>>> the recording anywhere. I have found the recordings of the other >>>>>>>>>>> presentations in the Ververica Academy website >>>>>>>>>>> <https://www.ververica.academy/app>, but not this one. >>>>>>>>>>> >>>>>>>>>>> Does anyone know where I can find it? Or at least the slides? >>>>>>>>>>> >>>>>>>>>>> We are using Flink with the Iceberg sink connector to write >>>>>>>>>>> streaming events to Iceberg tables, and we are researching how to >>>>>>>>>>> handle >>>>>>>>>>> schema evolution properly. I saw that presentation and I thought it >>>>>>>>>>> could >>>>>>>>>>> be of great help to us. >>>>>>>>>>> >>>>>>>>>>> Thanks in advance! >>>>>>>>>>> >>>>>>>>>>> Carlos >>>>>>>>>>> >>>>>>>>>>