Hemant Sakharkar created SPARK-47742:
----------------------------------------

             Summary: Spark Transformation with Multi Case filter can improve 
efficiency
                 Key: SPARK-47742
                 URL: https://issues.apache.org/jira/browse/SPARK-47742
             Project: Spark
          Issue Type: New Feature
          Components: Spark Core
    Affects Versions: 4.0.0
            Reporter: Hemant Sakharkar


In Feature Engineering we need to process the input data to create feature and 
feature vectors which are required to train the model. For which we need to do 
multiple spark transformations (etc:map, filter etc) the spark has very good 
optimization for multiple transformations due to its lazy execution. It 
combines multiple transformations into fewer transformations which helps to 
optimize the overall execution time.

I found that we can still improve the execution time in the case of filters. 

*Sample Run Results:*

Records :50,000,000

5 filter Execution Time: (t2-t1) 24854 millisec

5 filter with Map Execution Time: (t3-t2) 5212 millisec

We can very well improve multiple X times and reduce significant memory 
footprint for a complex DAG of Spark Transformation.

Sample illustration can be found here

[https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing]

Need support of such transformation in Spark Core so that more complex 
transformation can be supported. Some illustration is provided in above 
document.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to