Hi, i have updated the CWIKI as a new RFC there [1], lets move the
discussion there ~

[1]
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+24%3A+Hoodie+Flink+Writer+Proposal

Best,
Danny Chan

Danny Chan <danny0...@apache.org> 于2021年1月8日周五 上午10:34写道:

> > We can maintain one or two, although, we both try to find a good way to
> maintain one app (entry-point).
>
> Yes, after some offline discussion, we can have only one app there. For
> both Flink version 1.11 + or lower version.
>
> >Hudi has a rich set of table services already built out (cleaning,
> compaction, clustering,
>
> That's true, personally I want to integrate these features into the Flink
> writer as soon as possible, but before that, we should make the masic
> functionality robust enough, e.g. the write and read.
>
> vino yang <yanghua1...@gmail.com> 于2021年1月7日周四 下午8:12写道:
>
>> +1 on Gary's opinion,
>>
>> Yes, the public APIs that come from AbstractHoodieWriteClient should be
>> able to reuse.
>>
>> We could try to make the HoodieFlinkWriteClient a common implementation.
>>
>> IIUC, there is a mapping like this:
>>
>> SparkRDDWriteClient -> HoodieFlinkWriteClient
>> HoodieDeltaStreamer -> HoodieFlinkStreamer (it could be multiple?)
>>
>> Actually, I and Danny's divergence is that we need one HoodieFlinkStreamer
>> or two HoodieFlinkStreamers.
>>
>> We can maintain one or two, although, we both try to find a good way to
>> maintain one app (entry-point).
>>
>> Correct me, if I am wrong.
>>
>> Best,
>> Vino
>>
>>
>> Gary Li <garyli1...@outlook.com> 于2021年1月7日周四 下午4:31写道:
>>
>> > Hi all,
>> >
>> > IIUC the current flink writer is like an app, just like the delta
>> > streamer. If we want to build another Flink writer, we can still share
>> the
>> > same flink client right? Does the flink client also have to use the new
>> > feature only available on Flink 1.12?
>> >
>> > Thanks,
>> > Gary Li
>> > ________________________________
>> > From: Danny Chan <danny0...@apache.org>
>> > Sent: Thursday, January 7, 2021 10:19 AM
>> > To: dev@hudi.apache.org <dev@hudi.apache.org>
>> > Subject: Re: Re: [DISCUSS] New Flink Writer Proposal
>> >
>> > Thanks vino yang ~
>> >
>> > IMO, we should not put much put too much energy for current Flink
>> writer,
>> > it is not production-ready in the long run. There are so many features
>> need
>> > to add/support for the Flink write/read(MOR write, COR read, MOR read,
>> the
>> > new index), we should focus on one version first, make it robust.
>> >
>> > I really hope that we can work together to make the writer
>> production-ready
>> > as soon as possible, it is competitive that we have competitors like
>> Apache
>> > Iceberg and Delta lake, so from this perspective, there is no benefit
>> to be
>> > compatible with the current version writer.
>> >
>> > My idea is that i propose the new infrastructure first as quickly as
>> > possible(the basic pipeline, the test framework.), and then we can work
>> > together for the new version (MOR write, COR read, MOR read, the new
>> > index), we better not distract from promote the old writer.
>> >
>> > What do you think?
>> >
>> > vino yang <yanghua1...@gmail.com> 于2021年1月6日周三 下午2:14写道:
>> >
>> > > Hi Danny,
>> > >
>> > > As we discussed in the doc, we should agree on if we should be
>> compatible
>> > > with the version less than Flink 1.11/1.12.
>> > >
>> > > We all know that there are some bottlenecks in the current plan. You
>> > > proposed some improvements, yes it is great, but it radically uses the
>> > > newer features provided by Flink. It is a pity that some users of old
>> > > versions of Flink have no way to benefit from these features.
>> > >
>> > > The information I can provide is that some users have already used the
>> > > current Flink write client or its improved version in a production
>> > > environment. For example, SF Technology, and the Flink versions they
>> use
>> > > are 1.8.x and 1.10.x.
>> > >
>> > > Therefore, I personally suggest that there are two options:
>> > >
>> > > 1) The new design takes into account users of lower versions as much
>> as
>> > > possible and maintains a client version;
>> > > 2) The new design is based on the features of the new version and
>> evolves
>> > > separately from the old version(we also have a plan to optimize the
>> > current
>> > > implementation), but the public abstraction can be reused. I think it
>> is
>> > > not impossible to maintain multiple versions. Flink used to support 4+
>> > > versions (0.8.2, 0.9, 0.10, 0.11, universal connector) for Kafka
>> > Connector,
>> > > but they share the same code base.
>> > >
>> > > Any thoughts and opinions are welcome and appreciated.
>> > >
>> > > Best,
>> > > Vino
>> > >
>> > > vino yang <yanghua1...@gmail.com> 于2021年1月6日周三 下午1:37写道:
>> > >
>> > > > Hi Danny,
>> > > >
>> > > > You should have cwiki edit permission now.
>> > > > Any problems let me know.
>> > > >
>> > > > Best,
>> > > > Vino
>> > > >
>> > > > Danny Chan <danny0...@apache.org> 于2021年1月6日周三 下午12:05写道:
>> > > >
>> > > >> Sorry ~
>> > > >>
>> > > >> Forget to say that my Confluence ID is danny0405.
>> > > >>
>> > > >> It would be nice if any of you can help on this.
>> > > >>
>> > > >> Best,
>> > > >> Danny Chan
>> > > >>
>> > > >> Danny Chan <danny0...@apache.org> 于2021年1月6日周三 下午12:00写道:
>> > > >>
>> > > >> > Hi, can someone give me the CWIKI permission so that i can update
>> > the
>> > > >> > design details to that (maybe as a new RFC though ~).
>> > > >> >
>> > > >> > wangxianghu <wxhj...@126.com> 于2021年1月5日周二 下午2:43写道:
>> > > >> >
>> > > >> >> + 1, Thanks Danny!
>> > > >> >> I believe this new feature OperatorConrdinator in flink-1.11
>> will
>> > > help
>> > > >> >> improve the current implementation
>> > > >> >>
>> > > >> >> Best,
>> > > >> >>
>> > > >> >> XianghuWang
>> > > >> >>
>> > > >> >> At 2021-01-05 14:17:37, "vino yang" <yanghua1...@gmail.com>
>> wrote:
>> > > >> >> >Hi,
>> > > >> >> >
>> > > >> >> >Sharing more details, the OperatorConrdinator is the part of
>> the
>> > new
>> > > >> Data
>> > > >> >> >Source API(Beta) involved in the Flink 1.11's release note[1].
>> > > >> >> >
>> > > >> >> >Flink 1.11 was released only about half a year ago. The design
>> of
>> > > >> RFC-13
>> > > >> >> >began at the end of 2019, and most of the implementation was
>> > > completed
>> > > >> >> when
>> > > >> >> >Flink 1.11 was released.
>> > > >> >> >
>> > > >> >> >I believe that the production environment of many large
>> companies
>> > > has
>> > > >> not
>> > > >> >> >been upgraded so quickly (As far as our company is concerned,
>> we
>> > > still
>> > > >> >> have
>> > > >> >> >some jobs running on flink release packages below 1.9).
>> > > >> >> >
>> > > >> >> >So, maybe we need to find a mechanism to benefit both new and
>> old
>> > > >> users.
>> > > >> >> >
>> > > >> >> >[1]:
>> > > >> >> >
>> > > >> >>
>> > > >>
>> > >
>> >
>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fflink.apache.org%2Fnews%2F2020%2F07%2F06%2Frelease-1.11.0.html%23new-data-source-api-beta&amp;data=04%7C01%7C%7C13a290f5a7384113903908d8b2b2b61c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637455827910723007%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=1E2aobHPbYSUouGLmcTE2Lt225%2F9DFWxXTjz5oxtqR0%3D&amp;reserved=0
>> > > >> >> >
>> > > >> >> >Best,
>> > > >> >> >Vino
>> > > >> >> >
>> > > >> >> >vino yang <yanghua1...@gmail.com> 于2021年1月5日周二 下午12:30写道:
>> > > >> >> >
>> > > >> >> >> Hi,
>> > > >> >> >>
>> > > >> >> >> +1, thank you Danny for introducing this new feature
>> > > >> >> >> (OperatorCoordinator)[1] of Flink in the recently latest
>> > version.
>> > > >> >> >> This feature is very helpful for improving the implementation
>> > > >> >> mechanism of
>> > > >> >> >> Flink write-client.
>> > > >> >> >>
>> > > >> >> >> But this feature is only available after Flink 1.11. Before
>> > that,
>> > > >> there
>> > > >> >> >> was no good way to realize the mechanism of task upstream and
>> > > >> >> downstream
>> > > >> >> >> coordination through the public API provided by Flink.
>> > > >> >> >> I just have a concern, whether we need to take into account
>> the
>> > > >> users
>> > > >> >> of
>> > > >> >> >> earlier versions (less than Flink 1.11).
>> > > >> >> >>
>> > > >> >> >> [1]:
>> >
>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FFLINK-15099&amp;data=04%7C01%7C%7C13a290f5a7384113903908d8b2b2b61c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637455827910723007%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=vKn0x0ePL0rmJk1H%2BZgHevVBjNIdXrbEI8srchOl1c4%3D&amp;reserved=0
>> > > >> >> >>
>> > > >> >> >> Best,
>> > > >> >> >> Vino
>> > > >> >> >>
>> > > >> >> >> Gary Li <garyli1...@outlook.com> 于2021年1月5日周二 上午10:40写道:
>> > > >> >> >>
>> > > >> >> >>> Hi Danny,
>> > > >> >> >>>
>> > > >> >> >>> Thanks for the proposal. I'd recommend starting a new RFC.
>> > RFC-13
>> > > >> was
>> > > >> >> >>> done and including some work about the refactoring so we
>> should
>> > > >> mark
>> > > >> >> it as
>> > > >> >> >>> completed. Looking forward to having further discussion on
>> the
>> > > RFC.
>> > > >> >> >>>
>> > > >> >> >>> Best,
>> > > >> >> >>> Gary Li
>> > > >> >> >>> ________________________________
>> > > >> >> >>> From: Danny Chan <danny0...@apache.org>
>> > > >> >> >>> Sent: Tuesday, January 5, 2021 10:22 AM
>> > > >> >> >>> To: dev@hudi.apache.org <dev@hudi.apache.org>
>> > > >> >> >>> Subject: Re: [DISCUSS] New Flink Writer Proposal
>> > > >> >> >>>
>> > > >> >> >>> Sure, i can update the RFC-13 cwiki if you agree with that.
>> > > >> >> >>>
>> > > >> >> >>> Vinoth Chandar <vin...@apache.org> 于2021年1月5日周二 上午2:58写道:
>> > > >> >> >>>
>> > > >> >> >>> > Overall +1 on the idea.
>> > > >> >> >>> >
>> > > >> >> >>> > Danny, could we move this to the apache cwiki if you don't
>> > > mind?
>> > > >> >> >>> > That's what we have been using for other RFC discussions.
>> > > >> >> >>> >
>> > > >> >> >>> > On Mon, Jan 4, 2021 at 1:22 AM Danny Chan <
>> > > danny0...@apache.org>
>> > > >> >> wrote:
>> > > >> >> >>> >
>> > > >> >> >>> > > The RFC-13 Flink writer has some bottlenecks that make
>> it
>> > > hard
>> > > >> to
>> > > >> >> >>> adapter
>> > > >> >> >>> > > to production:
>> > > >> >> >>> > >
>> > > >> >> >>> > > - The InstantGeneratorOperator is parallelism 1, which
>> is a
>> > > >> limit
>> > > >> >> for
>> > > >> >> >>> > > high-throughput consumption; because all the split
>> inputs
>> > > drain
>> > > >> >> to a
>> > > >> >> >>> > single
>> > > >> >> >>> > > thread, the network IO would gains pressure too
>> > > >> >> >>> > > - The WriteProcessOperator handles inputs by partition,
>> > that
>> > > >> >> means,
>> > > >> >> >>> > within
>> > > >> >> >>> > > each partition write process, the BUCKETs are written
>> one
>> > by
>> > > >> one,
>> > > >> >> the
>> > > >> >> >>> > FILE
>> > > >> >> >>> > > IO is limit to adapter to high-throughput inputs
>> > > >> >> >>> > > - It buffers the data by checkpoints, which is too hard
>> to
>> > be
>> > > >> >> robust
>> > > >> >> >>> for
>> > > >> >> >>> > > production, the checkpoint function is blocking and
>> should
>> > > not
>> > > >> >> have IO
>> > > >> >> >>> > > operations.
>> > > >> >> >>> > > - The FlinkHoodieIndex is only valid for a per-job
>> scope,
>> > it
>> > > >> does
>> > > >> >> not
>> > > >> >> >>> > work
>> > > >> >> >>> > > for existing bootstrap data or for different Flink jobs
>> > > >> >> >>> > >
>> > > >> >> >>> > > Thus, here I propose a new design for the Flink writer
>> to
>> > > solve
>> > > >> >> these
>> > > >> >> >>> > > problems[1]. Overall, the new design tries to remove the
>> > > single
>> > > >> >> >>> > parallelism
>> > > >> >> >>> > > operators and make the index more powerful and scalable.
>> > > >> >> >>> > >
>> > > >> >> >>> > > I plan to solve these bottlenecks incrementally (4
>> steps),
>> > > >> there
>> > > >> >> are
>> > > >> >> >>> > > already some local POCs for these proposals.
>> > > >> >> >>> > >
>> > > >> >> >>> > > I'm looking forward to your feedback. Any suggestions
>> are
>> > > >> >> appreciated
>> > > >> >> >>> ~
>> > > >> >> >>> > >
>> > > >> >> >>> > > [1]
>> > > >> >> >>> > >
>> > > >> >> >>> > >
>> > > >> >> >>> >
>> > > >> >> >>>
>> > > >> >>
>> > > >>
>> > >
>> >
>> https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1oOcU0VNwtEtZfTRt3v9z4xNQWY-Hy5beu7a1t5B-75I%2Fedit%3Fusp%3Dsharing&amp;data=04%7C01%7C%7C13a290f5a7384113903908d8b2b2b61c%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637455827910723007%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8B3RQOYq2Yn1u7ndq18FayMhhdVjCMHPt96PHRn3JqE%3D&amp;reserved=0
>> > > >> >> >>> > >
>> > > >> >> >>> >
>> > > >> >> >>>
>> > > >> >> >>
>> > > >> >>
>> > > >> >
>> > > >>
>> > > >
>> > >
>> >
>>
>

Reply via email to