[
https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018664#comment-17018664
]
Vinoth Chandar edited comment on HUDI-538 at 1/18/20 6:05 PM:
--
+1 [~yanghua] , I added a second task for moving classes around based on your
changes..
Core issue we need a solution for IMO is the following .. (if we solve this,
rest is more or less easy) I will illustrate using Spark (since my
understanding of Flink is somewhat limited atm) ..
So, even for Spark I would like the writing to be done via _RDD_ or
_DataFrame_ routes and the current code converts the dataframe into RDDs to
perform writes. This has some performance side-effects (suprisingly, :P)
1) If you take a single class like _HoodieWriteClient_, then it currently does
something like `hoodieRecordRDD.map().sort()` internally.. if we want to
support Flink DataStream or Spark DataFrame as the object, then we need to
somehow define an abstraction like `HoodieExecutionContext` which will have
a common set of map(T) -> T, sortBy(T) -> T, filter(), repartition() methods?
There will be subclasses like _HoodieSparkRDDExecutionContext,_
_HoodieSparkDataFrameExecutionContext_,
_HoodieFlinkDataStreamExecutionContext_ which will implement them
in engine specific ways and hand back the transformed T object?
2) Right now, we work with _HoodieRecord_, as the record level abstraction..
i.e we eagerly parse the input into a HoodieKey (String recordKey, String
partitionPath) and HoodieRecordPayload. The key is needed during indexing, and
the payload is needed to precombine duplicates within a batch (may be spark
specific)/combine incoming record with whats stored in the table during
writing.. We need a way to do these lazily by pushing the key extraction
function into the entire writing path.
I think we should deeply think about these issues.. have concrete approaches
before we embark more deeply.. We will hit these issues..
was (Author: vc):
+1 [~yanghua] , I added a second task for moving classes around based on your
changes..
Core issue we need a solution for IMO is the following .. (if we solve this,
rest is more or less easy) I will illustrate using Spark (since my
understanding of Flink is somewhat limited atm) ..
So, even for Spark I would like the writing to be done via _RDD_ or
_DataFrame_ routes and the current code converts the dataframe into RDDs to
perform writes. This has some performance side-effects (suprisingly, :P)
1) If you take a single class like _HoodieWriteClient_, then it currently does
something like `hoodieRecordRDD.map().sort()` internally.. if we want to
support Flink DataStream or Spark DataFrame as the object, then we need to
somehow define an abstraction like `HoodieExecutionContext` which will have
a common set of map(T) -> T, sortBy(T) -> T, filter(), repartition() methods?
There will be subclasses like _HoodieSparkRDDExecutionContext,_
_HoodieSparkDataFrameExecutionContext_,
_HoodieFlinkDataStreamExecutionContext_ which will implement them
in engine specific ways and hand back the transformed T object?
2) Right now, we work with _HoodieRecord_, as the record level abstraction..
i.e we eagerly parse the input into a HoodieKey (String recordKey, String
partitionPath) and HoodieRecordPayload. The key is needed during indexing, and
the payload is needed to precombine duplicates within a batch (may be spark
specific)/combine incoming record with whats stored in the table during
writing.. We need a way to do these lazily by pushing the key extraction
function into the entire writing path.
I think we should deeply think about these issues.. have concrete approaches
before we embark more deeply.. We will hit these issues..
> Restructuring hudi client module for multi engine support
> -
>
> Key: HUDI-538
> URL: https://issues.apache.org/jira/browse/HUDI-538
> Project: Apache Hudi (incubating)
> Issue Type: Wish
> Components: Code Cleanup
>Reporter: vinoyang
>Priority: Major
>
> Hudi is currently tightly coupled with the Spark framework. It caused the
> integration with other computing engine more difficult. We plan to decouple
> it with Spark. This umbrella issue used to track this work.
> Some thoughts wrote here:
> https://docs.google.com/document/d/1Q9w_4K6xzGbUrtTS0gAlzNYOmRXjzNUdbbe0q59PX9w/edit?usp=sharing
> The feature branch is {{restructure-hudi-client}}.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)