Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-10-06 Thread Vinoth Chandar
Hi Gary, We can pass the constructed timeline and filesystem view into the IOHandle. I think it makes sense for how Flink does things. Thanks Vinoth On Fri, Sep 24, 2021 at 2:04 AM Gary Li wrote: > Hi Vinoth, > > Currently, each executor of Flink has a timeline server I believe. Do you >

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-24 Thread Gary Li
Hi Vinoth, Currently, each executor of Flink has a timeline server I believe. Do you think we can avoid passing the timeline and filesystem view into the IOHandle? I mean one IOHandle is handling the IO of one filegroup, and it doesn't need to know the timeline and filesystem view of the table,

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-23 Thread Vinoth Chandar
Thanks for the explanation. I get the streaming aspect better now. Esp in Flink land. Timeline server and remote file system view are what the defaults are. Assuming its a RPC call that takes 10-100 ms to the timeline server, not sure how much room there is for optimization for loading of the file

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-23 Thread Gary Li
Hi Vinoth, IMO the IOHandle should be as lightweight as possible, especially when we want to do streaming and near-real-time update(possibly real-time in the future?). Constructing the timeline and filesystem view inside the handle is time-consuming. In some cases, some handles only write a few

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-23 Thread Vinoth Chandar
Hi Gary, So in effect you want to pull all the timeline filtering out of the handles and pass a plan i.e what file slice to work on - to the handle? That does sound cleaner. but we need to introduce this additional layer. The timeline and filesystem view do live within the table, I believe today.

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-22 Thread Gary Li
Hi Vinoth, Thanks for your response. For HoodieIOHandle, IMO we could define the scope of the Handle during the initialization, so we don't need to care about the timeline and table view when actually writing the data. Is that possible? A HoodieTable could have many Handles writing data at the

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-17 Thread Vinoth Chandar
Hi Gary, Thanks for the detailed response. Let me add my take on it. >>HoodieFlinkMergeOnReadTable.upsert(List) to use the AppendHandle.write(HoodieRecord) directly, I have the same issue on JavaClient, for the Kafka Connect implementation. I have an idea of how we can implement this. Will

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-16 Thread Gary Li
Huge +1. Recently I am working on making the Flink writer in a streaming fashion and found the List interface is limiting the streaming power of Flink. By switching from HoodieFlinkMergeOnReadTable.upsert(List) to use the AppendHandle.write(HoodieRecord) directly, the throughput was almost doubled

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-15 Thread Raymond Xu
+1 that's a great improvement. On Wed, Sep 15, 2021 at 10:40 AM Sivabalan wrote: > ++1. definitely help's Hudi scale and makes it more maintainable. Thanks > for driving this effort. Mostly devs show interest in major features and > don't like to spend time in such foundational work. But as the

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-15 Thread Sivabalan
++1. definitely help's Hudi scale and makes it more maintainable. Thanks for driving this effort. Mostly devs show interest in major features and don't like to spend time in such foundational work. But as the project scales, these foundational work will have a higher returns in the long run. On

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-15 Thread Vinoth Chandar
Another +1 , HoodieData abstraction will go a long way in reducing LoC. Happy to work with you to see this through! I really encourage top contributors to the Flink and Java clients as well, actively review all PRs, given there are subtle differences everywhere. This will help us smoothly

Re: [DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-15 Thread vino yang
Hi Ethan, Big +1 for the proposal. Actually, we have discussed this topic before.[1] Will review your refactor PR later. Best, Vino [1]: https://lists.apache.org/thread.html/r71d96d285c735b1611920fb3e7224c9ce6fd53d09bf0e8f144f4fcbd%40%3Cdev.hudi.apache.org%3E Y Ethan Guo 于2021年9月15日周三

[DISCUSS] Refactor hudi-client module for better support of multiple engines

2021-09-15 Thread Y Ethan Guo
Hi all, hudi-client module has core Hudi abstractions and client logic for different engines like Spark, Flink, and Java. While previous effort (HUDI-538 [1]) has decoupled the integration with Spark, there is quite some code duplication across different engines for almost the same logic due to