Re: RichCdcSinkBuilder with Iceberg catalog?

Giannis Polyzos Fri, 19 Jul 2024 12:05:47 -0700

Thats great, it’s always great to see how companies leverages these
technologies technologies.


I will leave some PMCs to address this with more accuracy, but to the best
of my knowledge there shouldn’t be any differences between Paimon actions
jar and Flink CDC.

It probably boils down to preferences I guess, i.e yaml vs deploying a jar
file.

Best

On Fri, 19 Jul 2024 at 10:01 PM, Andrew Otto <o...@wikimedia.org> wrote:

> > the Paimon / Iceberg snapshot was shipped last week in the master
> branch and will come with the official 0.9 release I believe.
>
> Very cool!  Look forward to trying it.
>
> > Flink CDC added paimon support in the latest release thus you might see
> both projects supporting a variant.
>
> I see, so Paimon's sync actions predate the Flink-CDC Paimon pipeline
> connector?  Are there any major differences? Are there any reasons to use 
> Paimon
> sync action
> <https://paimon.apache.org/docs/0.8/flink/cdc-ingestion/mysql-cdc/> over
> Flink-CDC Paimon pipeline connector?
>
> > Hope this provides some more context
>
> Thanks Giannis!  All that sounds amazing.  I hope to find some time to
> play with Paimon more.  I'm interested in seamless MariaDB -> Hadoop CDC,
> which Paimon seems to be pretty good at.   Paimon's Spark query support may
> make the fact that it is not Iceberg a non-issue at our org. (Spark users
> shouldn't really have to care...I think.)
>
> I'm also interested in the possibility of using CDC to produce state
> change events to Kafka for use outside of a Data Lake.
>
> We'll be hopefully documenting our experiments and findings for the
> Wikimedia Foundation's Data Platform Engineering team here.
>
> Thank you!
>
> On Fri, Jul 19, 2024 at 2:52 PM Giannis Polyzos <ipolyzos...@gmail.com>
> wrote:
>
>> Hi Andrew,
>> the Paimon / Iceberg snapshot was shipped last week in the master branch
>> and will come with the official 0.9 release I believe.
>> As noted in the thread, paimon has done lots of work in terms of CDC and
>> strong integration with Flink CDC.
>> Flink CDC added paimon support in the latest release thus you might see
>> both projects supporting a variant.. Flink CDC itself aims to make the
>> process easier via yaml files and you use it as well.
>>
>> In terms of paimon in the context of Flink, the reason it has done so
>> much work on CDC is because data mutation is also what is required for
>> stream processing i.e. changelog streams.
>> It allows a more cost-efficient way (with some cost/latency trade-off) to
>> replace Flink operations like aggregations or expensive streaming joins
>> (via partial-updates).
>> At the same time it allows to replace expensive message queues via the
>> bucketed append as it provides a consistent message queue functionality.
>> Along with that and with the use of deletion vectors it also allows an
>> environment (cheaper storage) for OLAP.
>>
>> Some rough numbers from production use cases:
>> - message queue functionality with 20-30 second latencies (atm there are
>> a few use cases in production with 10-second latencies but there are still
>> a few challenges there to keep the resources low)
>> - OLAP queries: 30-60 seconds data freshness and OLAP queries ~1-5 seconds
>> and of course, all the CDC and stream processing stuff that was done
>> amazing work there.
>>
>> Overall the recommended latency for CDC and processing is around to
>> the minute level, to account also for small file problems in case you don't
>> have enough data.
>>
>> Hope this provides some more context on the project and see if it can fit
>> more use cases.
>>
>> On Fri, Jul 19, 2024 at 9:25 PM Andrew Otto <o...@wikimedia.org> wrote:
>>
>>> TIL about XTable.  Cool!
>>>
>>>
>>> On Fri, Jul 19, 2024 at 2:11 PM Kyle Weller <k...@onehouse.ai> wrote:
>>>
>>>> I wonder if Apache XTable <https://xtable.apache.org/> is also a
>>>> viable option to consider? Data could still be written and stored natively
>>>> as Paimon and asynchronously generate the iceberg manifest files and sync
>>>> to an Iceberg catalog. It is working great between Iceberg, Hudi, Delta
>>>> today in production. There may be some code in that project to leverage or
>>>> adding paimon XTable interface would auto unlock omni directional
>>>> translation to all 4 table formats versus a 1 by 1 integration.
>>>>
>>>> On Fri, Jul 19, 2024 at 8:41 AM Andrew Otto <o...@wikimedia.org> wrote:
>>>>
>>>>> > > Another approach is to create a snapshot compatible way for Paimon
>>>>>  to generate Iceberg, which is what we are working on.
>>>>> Hi, just checking in!  How is this going? Thanks!
>>>>>
>>>>> On Mon, Jun 10, 2024 at 9:17 AM Andrew Otto <o...@wikimedia.org>
>>>>> wrote:
>>>>>
>>>>>> Awesome, I look forward to it!  Thank you!
>>>>>>
>>>>>> On Mon, Jun 10, 2024 at 2:35 AM Jingsong Li <jingsongl...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> We are developing prototype in our internal.
>>>>>>>
>>>>>>> It takes about 2 to 3 months.
>>>>>>>
>>>>>>> Andrew Otto <o...@wikimedia.org>于2024年5月29日 周三21:46写道：
>>>>>>>
>>>>>>>> > Another approach is to create a snapshot compatible way for
>>>>>>>> Paimon to generate Iceberg, which is what we are working on.
>>>>>>>>
>>>>>>>> Oh!  Very interesting.  Can you say more? And/or do you have links
>>>>>>>> to Jira or anything?
>>>>>>>>
>>>>>>>> Thanks for your response! :)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, May 29, 2024 at 7:41 AM Jingsong Li <jingsongl...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Andrew,
>>>>>>>>>
>>>>>>>>> It is difficult to move this mechanism to the Iceberg sink. The
>>>>>>>>> table
>>>>>>>>> structure change in Iceberg's design requires generating a new
>>>>>>>>> snapshot, which poses significant challenges to schema evolution.
>>>>>>>>>
>>>>>>>>> Another approach is to create a snapshot compatible way for Paimon
>>>>>>>>> to
>>>>>>>>> generate Iceberg, which is what we are working on.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Jingsong
>>>>>>>>>
>>>>>>>>> On Fri, May 24, 2024 at 8:11 PM Andrew Otto <o...@wikimedia.org>
>>>>>>>>> wrote:
>>>>>>>>> >
>>>>>>>>> > Hi!
>>>>>>>>> >
>>>>>>>>> > How coupled to Paimon catalogs and tables is the cdc part of
>>>>>>>>> Paimon?  RichCdcMultiplexRecord and related code seem incredibly 
>>>>>>>>> useful
>>>>>>>>> even outside of the context of the Paimon table format.
>>>>>>>>> >
>>>>>>>>> > I'm asking because the database sync action feature is amazing.
>>>>>>>>> At the Wikimedia Foundation, we are on an all-in journey with 
>>>>>>>>> Iceberg.  I'm
>>>>>>>>> wondering how hard it would be to extract the CDC logic from Paimon 
>>>>>>>>> and
>>>>>>>>> abstract the Sink bits.
>>>>>>>>> >
>>>>>>>>> > Could the table/database sync with schema evolution (without
>>>>>>>>> Flink job restarts!) potentially work with the Iceberg sink?
>>>>>>>>> >
>>>>>>>>> > Thanks!
>>>>>>>>> > -Andrew Otto
>>>>>>>>> >  Wikimedia Foundation
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>

Re: RichCdcSinkBuilder with Iceberg catalog?

Reply via email to