Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-28 Thread Vinoth Chandar
Hi Danny, Thanks, I will review this asap. Already, in the "review in progress" column. Thanks Vinoth On Thu, Apr 22, 2021 at 12:49 AM Danny Chan wrote: > > Should we throw together a PoC/test code for an example Flink pipeline > that > will use hudi cdc flags + state ful operators? > > I

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-22 Thread Danny Chan
> Should we throw together a PoC/test code for an example Flink pipeline that will use hudi cdc flags + state ful operators? I have updated the pr https://github.com/apache/hudi/pull/2854, see the test case HoodieDataSourceITCase#testStreamReadWithDeletes. A data source: change_flag | uuid |

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-20 Thread Vinoth Chandar
Keeping compatibility is a must. i.e users should be able to upgrade to the new release with the _hoodie_cdc_flag meta column, and be able to query new data (with this new meta col) alongside old data (without this new meta col). In fact, they should be able to downgrade back to previous versions

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-20 Thread Danny Chan
> Is it providing the ability to author continuous queries on Hudi source tables end-end, given Flink can use the flags to generate retract/upsert streams Yes,that's the key point, with these flags plus flink stateful operators, we can have a real time incremental ETL pipeline. For example, a

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-20 Thread Vinoth Chandar
Hi Danny, Read up on the Flink docs as well. If we don't actually publish data to the metacolumn, I think the overhead is pretty low w.r.t avro/parquet. Both are very good at encoding nulls. But, I feel it's worth adding a HoodieWriteConfig to control this and since addition of meta columns

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-20 Thread Danny Chan
Hi, i have created a PR here: https://github.com/apache/hudi/pull/2854/files In the PR i do these changes: 1. Add a metadata column: "_hoodie_cdc_operation", i did not add a config option because i can not find a good way to make the code clean, a metadata column is very primitive and a config

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-19 Thread Danny Chan
Thanks @Sivabalan ~ I agree that parquet and log files should keep sync in metadata columns in case there are confusions and special handling in some use cases like compaction. I also agree add a metadata column is more ease to use for SQL connectors. We can add a metadata column named

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-16 Thread Sivabalan
wrt changes if we plan to add this only to log files, compaction needs to be fixed to omit this column to the minimum. On Fri, Apr 16, 2021 at 9:07 PM Sivabalan wrote: > Just got a chance to read about dynamic tables. sounds interesting. > > some thoughts on your questions: > - yes, just MOR

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-16 Thread Sivabalan
Just got a chance to read about dynamic tables. sounds interesting. some thoughts on your questions: - yes, just MOR makes sense. - But adding this new meta column only to avro logs might incur some non trivial changes. Since as of today, schema of avro and base files are in sync. If this new col

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-15 Thread Danny Chan
Thanks Vinoth ~ Here is a document about the notion of 《Flink Dynamic Table》[1] , every operator that has accumulate state can handle retractions(UPDATE_BEFORE or DELETE) then apply new changes (INSERT or UPDATE_AFTER), so that each operator can consume the CDC format messages in streaming way.

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-15 Thread Vinoth Chandar
Hi, Is the intent of the flag to convey if an insert delete or update changed the record? If so I would imagine that we do this even for cow tables, since that also supports a logical notion of a change stream using the commit_time meta field. You may be right, but I am trying to understand the

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-08 Thread Danny Chan
I tries to do a POC for flink locally and it works well, in the PR i add a new metadata column named "_hoodie_change_flag", but actually i found that only log format needs this flag, and the Spark may has no ability to handle the flag for incremental processing yet. So should i add the

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-04-01 Thread Danny Chan
Thanks cool, then the left questions are: - where we record these change, should we add a builtin meta field such as the _change_flag_ like the other system columns for e.g _hoodie_commit_time - what kind of table should keep these flags, in my thoughts, we should only add these flags for

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-03-31 Thread vino yang
>> Oops, the image crushes, for "change flags", i mean: insert, update(before and after) and delete. Yes, the image I attached is also about these flags. [image: image (3).png] +1 for the idea. Best, Vino Danny Chan 于2021年4月1日周四 上午10:03写道: > Oops, the image crushes, for "change flags", i

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-03-31 Thread Danny Chan
Oops, the image crushes, for "change flags", i mean: insert, update(before and after) and delete. The Flink engine can propagate the change flags internally between its operators, if HUDI can send the change flags to Flink, the incremental calculation of CDC would be very natural (almost

Re: [DISCUSS] Incremental computation pipeline for HUDI

2021-03-31 Thread vino yang
Hi Danny, Thanks for kicking off this discussion thread. Yes, incremental query( or says "incremental processing") has always been an important feature of the Hudi framework. If we can make this feature better, it will be even more exciting. In the data warehouse, in some complex calculations,

[DISCUSS] Incremental computation pipeline for HUDI

2021-03-31 Thread Danny Chan
Hi dear HUDI community ~ Here i want to fire a discuss about using HUDI as the unified storage/format for data warehouse/lake incremental computation. Usually people divide data warehouse production into several levels, such as the ODS(operation data store), DWD(data warehouse details), DWS(data