Re: Re: Re: [DISCUSS] split source of kafka partition by count

2023-04-07 Thread Vinoth Chandar
Pulled in another reviewer as well. Left a comment. We can move the discussion to the PR? Thanks for the useful contribution! On Thu, Apr 6, 2023 at 12:34 AM 孔维 <18701146...@163.com> wrote: > Hi, vinoth, > > I created a PR(https://github.com/apache/hudi/pull/8376) for this > feature, could you

Re:Re: Re: [DISCUSS] split source of kafka partition by count

2023-04-06 Thread 孔维
Hi, vinoth, I created a PR(https://github.com/apache/hudi/pull/8376) for this feature, could you help review it? BR, Kong At 2023-04-05 00:19:20, "Vinoth Chandar" wrote: >Look forward to this! could really help backfill/rebootstrap scenarios. > >On Tue, Apr 4, 2023 at 9:18 AM

Re: Re: [DISCUSS] split source of kafka partition by count

2023-04-04 Thread Vinoth Chandar
Look forward to this! could really help backfill/rebootstrap scenarios. On Tue, Apr 4, 2023 at 9:18 AM Vinoth Chandar wrote: > Thinking out loud. > > 1. For insert operations, it should not matter anyway. > 2. For upsert etc, the preCombine would handle the ordering problems. > > Is that what

Re: Re: [DISCUSS] split source of kafka partition by count

2023-04-04 Thread Vinoth Chandar
Thinking out loud. 1. For insert operations, it should not matter anyway. 2. For upsert etc, the preCombine would handle the ordering problems. Is that what you are saying? I feel we don't want to leak any Kafka specific logic or force use of special payloads etc. thoughts? I assigned the jira

Re:Re: [DISCUSS] split source of kafka partition by count

2023-04-03 Thread 孔维
Hi, Yea, we can create multiple spark input partitions per Kafka partition. I think the write operations can handle the potentially out-of-order events, because before writing we need to preCombine the incoming events using source-ordering-field and we also need to combineAndGetUpdateValue

Re: [DISCUSS] split source of kafka partition by count

2023-04-03 Thread Vinoth Chandar
Hi, Does your implementation read out offset ranges from Kafka partitions? which means - we can create multiple spark input partitions per Kafka partitions? if so, +1 for overall goals here. How does this affect ordering? Can you think about how/if Hudi write operations can handle potentially

[DISCUSS] split source of kafka partition by count

2023-03-30 Thread 孔维
Hi team, for the kafka source, when pulling data from kafka, the default parallelism is the number of kafka partitions. There are cases: Pulling large amount of data from kafka (eg. maxEvents=1), but the # of kafka partition is not enough, the procedure of the pulling will cost too much