[jira] [Updated] (SPARK-47742) Spark Transformation with Multi Case filter can improve efficiency
[ https://issues.apache.org/jira/browse/SPARK-47742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hemant Sakharkar updated SPARK-47742: - Description: In Feature Engineering we need to process the input data to create feature and feature vectors which are required to train the model. For which we need to do multiple spark transformations (etc:map, filter etc) the spark has very good optimization for multiple transformations due to its lazy execution. It combines multiple transformations into fewer transformations which helps to optimize the overall execution time. I found that we can still improve the execution time in the case of filters. {code:java} val rddfilter0 = personRdd.filter(t => t.age>5 && t.age<=12) val rddfilter1 = personRdd.filter(t => t.age>12 && t.age<=18) val rddfilter2 = personRdd.filter(t => t.age>18 && t.age<=25) val rddfilter3 = personRdd.filter(t => t.age>25 && t.age<=35) val rddfilter4 = personRdd.filter(t => t.age>35 && t.age<=65) {code} *Sample Run Results:* Records :50,000,000 5 filter Execution Time: 24854 milli sec 5 filter with Map Execution Time: 5212 milli sec We can very well improve multiple X times and reduce significant memory footprint for a complex DAG of Spark Transformation. Sample illustration can be found here [https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing] Need support of such transformation in Spark Core so that more complex transformation can be supported. Some illustration is provided in above document. was: In Feature Engineering we need to process the input data to create feature and feature vectors which are required to train the model. For which we need to do multiple spark transformations (etc:map, filter etc) the spark has very good optimization for multiple transformations due to its lazy execution. It combines multiple transformations into fewer transformations which helps to optimize the overall execution time. I found that we can still improve the execution time in the case of filters. *Sample Run Results:* Records :50,000,000 5 filter Execution Time: 24854 milli sec 5 filter with Map Execution Time: 5212 milli sec We can very well improve multiple X times and reduce significant memory footprint for a complex DAG of Spark Transformation. Sample illustration can be found here [https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing] Need support of such transformation in Spark Core so that more complex transformation can be supported. Some illustration is provided in above document. > Spark Transformation with Multi Case filter can improve efficiency > -- > > Key: SPARK-47742 > URL: https://issues.apache.org/jira/browse/SPARK-47742 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Hemant Sakharkar >Priority: Major > Labels: performance > Attachments: spark_chain_transformation.png > > > In Feature Engineering we need to process the input data to create feature > and feature vectors which are required to train the model. For which we need > to do multiple spark transformations (etc:map, filter etc) the spark has very > good optimization for multiple transformations due to its lazy execution. It > combines multiple transformations into fewer transformations which helps to > optimize the overall execution time. > I found that we can still improve the execution time in the case of filters. > > {code:java} > val rddfilter0 = personRdd.filter(t => t.age>5 && t.age<=12) > val rddfilter1 = personRdd.filter(t => t.age>12 && t.age<=18) > val rddfilter2 = personRdd.filter(t => t.age>18 && t.age<=25) > val rddfilter3 = personRdd.filter(t => t.age>25 && t.age<=35) > val rddfilter4 = personRdd.filter(t => t.age>35 && t.age<=65) {code} > *Sample Run Results:* > Records :50,000,000 > 5 filter Execution Time: 24854 milli sec > 5 filter with Map Execution Time: 5212 milli sec > We can very well improve multiple X times and reduce significant memory > footprint for a complex DAG of Spark Transformation. > Sample illustration can be found here > [https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing] > Need support of such transformation in Spark Core so that more complex > transformation can be supported. Some illustration is provided in above > document. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47742) Spark Transformation with Multi Case filter can improve efficiency
[ https://issues.apache.org/jira/browse/SPARK-47742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hemant Sakharkar updated SPARK-47742: - Description: In Feature Engineering we need to process the input data to create feature and feature vectors which are required to train the model. For which we need to do multiple spark transformations (etc:map, filter etc) the spark has very good optimization for multiple transformations due to its lazy execution. It combines multiple transformations into fewer transformations which helps to optimize the overall execution time. I found that we can still improve the execution time in the case of filters. *Sample Run Results:* Records :50,000,000 5 filter Execution Time: 24854 milli sec 5 filter with Map Execution Time: 5212 milli sec We can very well improve multiple X times and reduce significant memory footprint for a complex DAG of Spark Transformation. Sample illustration can be found here [https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing] Need support of such transformation in Spark Core so that more complex transformation can be supported. Some illustration is provided in above document. was: In Feature Engineering we need to process the input data to create feature and feature vectors which are required to train the model. For which we need to do multiple spark transformations (etc:map, filter etc) the spark has very good optimization for multiple transformations due to its lazy execution. It combines multiple transformations into fewer transformations which helps to optimize the overall execution time. I found that we can still improve the execution time in the case of filters. *Sample Run Results:* Records :50,000,000 5 filter Execution Time: (t2-t1) 24854 millisec 5 filter with Map Execution Time: (t3-t2) 5212 millisec We can very well improve multiple X times and reduce significant memory footprint for a complex DAG of Spark Transformation. Sample illustration can be found here [https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing] Need support of such transformation in Spark Core so that more complex transformation can be supported. Some illustration is provided in above document. > Spark Transformation with Multi Case filter can improve efficiency > -- > > Key: SPARK-47742 > URL: https://issues.apache.org/jira/browse/SPARK-47742 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Hemant Sakharkar >Priority: Major > Labels: performance > Attachments: spark_chain_transformation.png > > > In Feature Engineering we need to process the input data to create feature > and feature vectors which are required to train the model. For which we need > to do multiple spark transformations (etc:map, filter etc) the spark has very > good optimization for multiple transformations due to its lazy execution. It > combines multiple transformations into fewer transformations which helps to > optimize the overall execution time. > I found that we can still improve the execution time in the case of filters. > *Sample Run Results:* > Records :50,000,000 > 5 filter Execution Time: 24854 milli sec > 5 filter with Map Execution Time: 5212 milli sec > We can very well improve multiple X times and reduce significant memory > footprint for a complex DAG of Spark Transformation. > Sample illustration can be found here > [https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing] > Need support of such transformation in Spark Core so that more complex > transformation can be supported. Some illustration is provided in above > document. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47742) Spark Transformation with Multi Case filter can improve efficiency
[ https://issues.apache.org/jira/browse/SPARK-47742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hemant Sakharkar updated SPARK-47742: - Attachment: spark_chain_transformation.png > Spark Transformation with Multi Case filter can improve efficiency > -- > > Key: SPARK-47742 > URL: https://issues.apache.org/jira/browse/SPARK-47742 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Hemant Sakharkar >Priority: Major > Labels: performance > Attachments: spark_chain_transformation.png > > > In Feature Engineering we need to process the input data to create feature > and feature vectors which are required to train the model. For which we need > to do multiple spark transformations (etc:map, filter etc) the spark has very > good optimization for multiple transformations due to its lazy execution. It > combines multiple transformations into fewer transformations which helps to > optimize the overall execution time. > I found that we can still improve the execution time in the case of filters. > *Sample Run Results:* > Records :50,000,000 > 5 filter Execution Time: (t2-t1) 24854 millisec > 5 filter with Map Execution Time: (t3-t2) 5212 millisec > We can very well improve multiple X times and reduce significant memory > footprint for a complex DAG of Spark Transformation. > Sample illustration can be found here > [https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing] > Need support of such transformation in Spark Core so that more complex > transformation can be supported. Some illustration is provided in above > document. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org