vinoyang created HUDI-613:
-----------------------------

             Summary: Refactor and enhance the Transformer component
                 Key: HUDI-613
                 URL: https://issues.apache.org/jira/browse/HUDI-613
             Project: Apache Hudi (incubating)
          Issue Type: Bug
            Reporter: vinoyang


Currently, Hudi has a component that has not been widely used: Transformer. As 
we all know, before the original data fell into the data lake, a very common 
operation is data preprocessing and ETL. This is also the most common use 
scenario of many computing engines, such as Flink and Spark. Now that Hudi has 
taken advantage of the power of the computing engine, it can also naturally 
take advantage of its ability of data preprocessing. We can refactor the 
Transformer to make it become more flexible. To summarize, we can refactor from 
the following aspects:

* Decouple Transformer from Spark
* Enrich the Transformer and provide built-in transformer
* Support Transformer-chain

For the first point, the Transformer interface is tightly coupled with Spark in 
design, and it contains a Spark-specific context. This makes it impossible for 
us to take advantage of the transform capabilities provided by other engines 
(such as Flink) after supporting multiple engines. Therefore, we need to 
decouple it from Spark in design.

For the second point, we can enhance the Transformer and provide some 
out-of-the-box Transformers, such as FilterTransformer, FlatMapTrnasformer, and 
so on.

For the third point, the most common pattern for data processing is the 
pipeline model, and the common implementation of the pipeline model is the 
responsibility chain model, which can be compared to the Apache commons 
chain[1], combining multiple Transformers can make data-processing become more 
flexible and expandable.

If we enhance the capabilities of Transformer components, Hudi will provide 
richer data processing capabilities based on the computing engine.

The relevant discussion thread is here: 
https://lists.apache.org/thread.html/rfad2e71fc432922ca567432b7b6e1dd9c3bb102822177b73dbff2d90%40%3Cdev.hudi.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to