Pulled in another reviewer as well. Left a comment. We can move the
discussion to the PR?
Thanks for the useful contribution!
On Thu, Apr 6, 2023 at 12:34 AM 孔维 <18701146...@163.com> wrote:
> Hi, vinoth,
>
> I created a PR(https://github.com/apache/hudi/pull/8376) for this
> feature, could you
Hi, vinoth,
I created a PR(https://github.com/apache/hudi/pull/8376) for this feature,
could you help review it?
BR,
Kong
At 2023-04-05 00:19:20, "Vinoth Chandar" wrote:
>Look forward to this! could really help backfill/rebootstrap scenarios.
>
>On Tue, Apr 4, 2023 at 9:18 AM
Look forward to this! could really help backfill/rebootstrap scenarios.
On Tue, Apr 4, 2023 at 9:18 AM Vinoth Chandar wrote:
> Thinking out loud.
>
> 1. For insert operations, it should not matter anyway.
> 2. For upsert etc, the preCombine would handle the ordering problems.
>
> Is that what
Thinking out loud.
1. For insert operations, it should not matter anyway.
2. For upsert etc, the preCombine would handle the ordering problems.
Is that what you are saying? I feel we don't want to leak any Kafka
specific logic or force use of special payloads etc. thoughts?
I assigned the jira
Hi,
Yea, we can create multiple spark input partitions per Kafka partition.
I think the write operations can handle the potentially out-of-order events,
because before writing we need to preCombine the incoming events using
source-ordering-field and we also need to combineAndGetUpdateValue
Hi,
Does your implementation read out offset ranges from Kafka partitions?
which means - we can create multiple spark input partitions per Kafka
partitions?
if so, +1 for overall goals here.
How does this affect ordering? Can you think about how/if Hudi write
operations can handle potentially
Hi team, for the kafka source, when pulling data from kafka, the default
parallelism is the number of kafka partitions.
There are cases:
Pulling large amount of data from kafka (eg. maxEvents=1), but the # of
kafka partition is not enough, the procedure of the pulling will cost too much