[jira] [Resolved] (SPARK-47743) Use milliseconds as the time unit in loggings
[ https://issues.apache.org/jira/browse/SPARK-47743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-47743. Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45903 [https://github.com/apache/spark/pull/45903] > Use milliseconds as the time unit in loggings > - > > Key: SPARK-47743 > URL: https://issues.apache.org/jira/browse/SPARK-47743 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47094) SPJ : Dynamically rebalance number of buckets when they are not equal
[ https://issues.apache.org/jira/browse/SPARK-47094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-47094. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45267 [https://github.com/apache/spark/pull/45267] > SPJ : Dynamically rebalance number of buckets when they are not equal > - > > Key: SPARK-47094 > URL: https://issues.apache.org/jira/browse/SPARK-47094 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0, 3.4.0 >Reporter: Himadri Pal >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > SPJ: Storage Partition Join works with Iceberg tables when both the tables > have same number of buckets. As part of this feature request, we would like > spark to gather the number of buckets information from both the tables and > dynamically rebalance the number of buckets by coalesce or repartition so > that SPJ will work fine. In this case, we would still have to shuffle but > would be better than no SPJ. > Use Case : > Many times we do not have control of the input tables, hence it's not > possible to change partitioning scheme on those tables. As a consumer, we > would still like them to be used with SPJ when used with other tables and > output tables which has different number of buckets. > In these scenario, we would need to read those tables rewrite them with > matching number of buckets for the SPJ to work, this extra step could > outweigh the benefits of less shuffle via SPJ. Also when there are multiple > different tables being joined, each tables need to be rewritten with matching > number of buckets. > If this feature is implemented, SPJ functionality will be more powerful. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47744) Add support for negative byte types in RocksDB range scan encoder
[ https://issues.apache.org/jira/browse/SPARK-47744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47744: --- Labels: pull-request-available (was: ) > Add support for negative byte types in RocksDB range scan encoder > - > > Key: SPARK-47744 > URL: https://issues.apache.org/jira/browse/SPARK-47744 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 4.0.0 >Reporter: Neil Ramaswamy >Priority: Major > Labels: pull-request-available > > SPARK-47372 introduced the ability for the RocksDB state provider to > big-endian encode values so that they could be range scanned. However, signed > support for Byte types [was > missed|https://github.com/apache/spark/blob/1efbf43160aa4e36710a4668f05fe61534f49648/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala#L326], > even though Scala Bytes are > [signed|https://www.scala-lang.org/api/2.13.5/scala/Byte.html]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47746) Use column ordinals instead of prefix ordering columns in the range scan encoder
Neil Ramaswamy created SPARK-47746: -- Summary: Use column ordinals instead of prefix ordering columns in the range scan encoder Key: SPARK-47746 URL: https://issues.apache.org/jira/browse/SPARK-47746 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Neil Ramaswamy Currently, the State V2 implementations do projections in their state managers, and then provide some prefix (ordering) columns to the RocksDBStateEncoder. However, we can avoid doing extra projection by just reading the ordinals we need, in the order we need, in the state encoder. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47581) SQL catalyst: Migrate logWarn with variables to structured logging framework
[ https://issues.apache.org/jira/browse/SPARK-47581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47581: --- Labels: pull-request-available (was: ) > SQL catalyst: Migrate logWarn with variables to structured logging framework > > > Key: SPARK-47581 > URL: https://issues.apache.org/jira/browse/SPARK-47581 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Gengliang Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47745) Add License to Spark Operator
Zhou JIANG created SPARK-47745: -- Summary: Add License to Spark Operator Key: SPARK-47745 URL: https://issues.apache.org/jira/browse/SPARK-47745 Project: Spark Issue Type: Sub-task Components: Kubernetes Affects Versions: 4.0.0 Reporter: Zhou JIANG Add license to the recently established operator repository. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47582) SQL catalyst: Migrate logInfo with variables to structured logging framework
[ https://issues.apache.org/jira/browse/SPARK-47582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-47582. Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45866 [https://github.com/apache/spark/pull/45866] > SQL catalyst: Migrate logInfo with variables to structured logging framework > > > Key: SPARK-47582 > URL: https://issues.apache.org/jira/browse/SPARK-47582 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Gengliang Wang >Assignee: Daniel >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47744) Add support for negative byte types in RocksDB range scan encoder
Neil Ramaswamy created SPARK-47744: -- Summary: Add support for negative byte types in RocksDB range scan encoder Key: SPARK-47744 URL: https://issues.apache.org/jira/browse/SPARK-47744 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 4.0.0 Reporter: Neil Ramaswamy SPARK-47372 introduced the ability for the RocksDB state provider to big-endian encode values so that they could be range scanned. However, signed support for Byte types [was missed|https://github.com/apache/spark/blob/1efbf43160aa4e36710a4668f05fe61534f49648/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala#L326], even though Scala Bytes are [signed|https://www.scala-lang.org/api/2.13.5/scala/Byte.html]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45445) Upgrade snappy to 1.1.10.5
[ https://issues.apache.org/jira/browse/SPARK-45445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45445: -- Fix Version/s: 3.4.3 > Upgrade snappy to 1.1.10.5 > -- > > Key: SPARK-45445 > URL: https://issues.apache.org/jira/browse/SPARK-45445 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2, 3.4.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47743) Use milliseconds as the time unit in loggings
[ https://issues.apache.org/jira/browse/SPARK-47743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47743: --- Labels: pull-request-available (was: ) > Use milliseconds as the time unit in loggings > - > > Key: SPARK-47743 > URL: https://issues.apache.org/jira/browse/SPARK-47743 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47743) Use milliseconds as the time unit in loggings
Gengliang Wang created SPARK-47743: -- Summary: Use milliseconds as the time unit in loggings Key: SPARK-47743 URL: https://issues.apache.org/jira/browse/SPARK-47743 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Gengliang Wang Assignee: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45445) Upgrade snappy to 1.1.10.5
[ https://issues.apache.org/jira/browse/SPARK-45445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45445: -- Fix Version/s: 3.5.2 > Upgrade snappy to 1.1.10.5 > -- > > Key: SPARK-45445 > URL: https://issues.apache.org/jira/browse/SPARK-45445 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46411) Change to use bcprov/bcpkix-jdk18on for test
[ https://issues.apache.org/jira/browse/SPARK-46411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-46411: -- Fix Version/s: 3.4.3 > Change to use bcprov/bcpkix-jdk18on for test > > > Key: SPARK-46411 > URL: https://issues.apache.org/jira/browse/SPARK-46411 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2, 3.4.3 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-47719) Change default of spark.sql.legacy.timeParserPolicy from EXCEPTION to CORRECTED
[ https://issues.apache.org/jira/browse/SPARK-47719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-47719. Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45859 [https://github.com/apache/spark/pull/45859] > Change default of spark.sql.legacy.timeParserPolicy from EXCEPTION to > CORRECTED > --- > > Key: SPARK-47719 > URL: https://issues.apache.org/jira/browse/SPARK-47719 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Serge Rielau >Assignee: Serge Rielau >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > spark.sql.legacy.timeParserPolicy was introduced in Spark 3.0 and has been > set to EXCEPTION. > Changing it from EXCEPTION for SPark 4.0 to CORRECTED will reduce errors and > reflects a prudent timeframe. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47719) Change default of spark.sql.legacy.timeParserPolicy from EXCEPTION to CORRECTED
[ https://issues.apache.org/jira/browse/SPARK-47719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-47719: -- Assignee: Serge Rielau > Change default of spark.sql.legacy.timeParserPolicy from EXCEPTION to > CORRECTED > --- > > Key: SPARK-47719 > URL: https://issues.apache.org/jira/browse/SPARK-47719 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Serge Rielau >Assignee: Serge Rielau >Priority: Major > Labels: pull-request-available > > spark.sql.legacy.timeParserPolicy was introduced in Spark 3.0 and has been > set to EXCEPTION. > Changing it from EXCEPTION for SPark 4.0 to CORRECTED will reduce errors and > reflects a prudent timeframe. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47742) Spark Transformation with Multi Case filter can improve efficiency
[ https://issues.apache.org/jira/browse/SPARK-47742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hemant Sakharkar updated SPARK-47742: - Description: In Feature Engineering we need to process the input data to create feature and feature vectors which are required to train the model. For which we need to do multiple spark transformations (etc:map, filter etc) the spark has very good optimization for multiple transformations due to its lazy execution. It combines multiple transformations into fewer transformations which helps to optimize the overall execution time. I found that we can still improve the execution time in the case of filters. {code:java} val rddfilter0 = personRdd.filter(t => t.age>5 && t.age<=12) val rddfilter1 = personRdd.filter(t => t.age>12 && t.age<=18) val rddfilter2 = personRdd.filter(t => t.age>18 && t.age<=25) val rddfilter3 = personRdd.filter(t => t.age>25 && t.age<=35) val rddfilter4 = personRdd.filter(t => t.age>35 && t.age<=65) {code} *Sample Run Results:* Records :50,000,000 5 filter Execution Time: 24854 milli sec 5 filter with Map Execution Time: 5212 milli sec We can very well improve multiple X times and reduce significant memory footprint for a complex DAG of Spark Transformation. Sample illustration can be found here [https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing] Need support of such transformation in Spark Core so that more complex transformation can be supported. Some illustration is provided in above document. was: In Feature Engineering we need to process the input data to create feature and feature vectors which are required to train the model. For which we need to do multiple spark transformations (etc:map, filter etc) the spark has very good optimization for multiple transformations due to its lazy execution. It combines multiple transformations into fewer transformations which helps to optimize the overall execution time. I found that we can still improve the execution time in the case of filters. *Sample Run Results:* Records :50,000,000 5 filter Execution Time: 24854 milli sec 5 filter with Map Execution Time: 5212 milli sec We can very well improve multiple X times and reduce significant memory footprint for a complex DAG of Spark Transformation. Sample illustration can be found here [https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing] Need support of such transformation in Spark Core so that more complex transformation can be supported. Some illustration is provided in above document. > Spark Transformation with Multi Case filter can improve efficiency > -- > > Key: SPARK-47742 > URL: https://issues.apache.org/jira/browse/SPARK-47742 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Hemant Sakharkar >Priority: Major > Labels: performance > Attachments: spark_chain_transformation.png > > > In Feature Engineering we need to process the input data to create feature > and feature vectors which are required to train the model. For which we need > to do multiple spark transformations (etc:map, filter etc) the spark has very > good optimization for multiple transformations due to its lazy execution. It > combines multiple transformations into fewer transformations which helps to > optimize the overall execution time. > I found that we can still improve the execution time in the case of filters. > > {code:java} > val rddfilter0 = personRdd.filter(t => t.age>5 && t.age<=12) > val rddfilter1 = personRdd.filter(t => t.age>12 && t.age<=18) > val rddfilter2 = personRdd.filter(t => t.age>18 && t.age<=25) > val rddfilter3 = personRdd.filter(t => t.age>25 && t.age<=35) > val rddfilter4 = personRdd.filter(t => t.age>35 && t.age<=65) {code} > *Sample Run Results:* > Records :50,000,000 > 5 filter Execution Time: 24854 milli sec > 5 filter with Map Execution Time: 5212 milli sec > We can very well improve multiple X times and reduce significant memory > footprint for a complex DAG of Spark Transformation. > Sample illustration can be found here > [https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing] > Need support of such transformation in Spark Core so that more complex > transformation can be supported. Some illustration is provided in above > document. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47741) Handle stack overflow when parsing query as part of Execute immediate
[ https://issues.apache.org/jira/browse/SPARK-47741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47741: --- Labels: pull-request-available (was: ) > Handle stack overflow when parsing query as part of Execute immediate > - > > Key: SPARK-47741 > URL: https://issues.apache.org/jira/browse/SPARK-47741 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.1 >Reporter: Milan Stefanovic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Exec immediate can generate complex queries which can lead to stack overflow. > We need to catch this exception and convert it to proper analysis exc with > error class. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47742) Spark Transformation with Multi Case filter can improve efficiency
[ https://issues.apache.org/jira/browse/SPARK-47742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hemant Sakharkar updated SPARK-47742: - Description: In Feature Engineering we need to process the input data to create feature and feature vectors which are required to train the model. For which we need to do multiple spark transformations (etc:map, filter etc) the spark has very good optimization for multiple transformations due to its lazy execution. It combines multiple transformations into fewer transformations which helps to optimize the overall execution time. I found that we can still improve the execution time in the case of filters. *Sample Run Results:* Records :50,000,000 5 filter Execution Time: 24854 milli sec 5 filter with Map Execution Time: 5212 milli sec We can very well improve multiple X times and reduce significant memory footprint for a complex DAG of Spark Transformation. Sample illustration can be found here [https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing] Need support of such transformation in Spark Core so that more complex transformation can be supported. Some illustration is provided in above document. was: In Feature Engineering we need to process the input data to create feature and feature vectors which are required to train the model. For which we need to do multiple spark transformations (etc:map, filter etc) the spark has very good optimization for multiple transformations due to its lazy execution. It combines multiple transformations into fewer transformations which helps to optimize the overall execution time. I found that we can still improve the execution time in the case of filters. *Sample Run Results:* Records :50,000,000 5 filter Execution Time: (t2-t1) 24854 millisec 5 filter with Map Execution Time: (t3-t2) 5212 millisec We can very well improve multiple X times and reduce significant memory footprint for a complex DAG of Spark Transformation. Sample illustration can be found here [https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing] Need support of such transformation in Spark Core so that more complex transformation can be supported. Some illustration is provided in above document. > Spark Transformation with Multi Case filter can improve efficiency > -- > > Key: SPARK-47742 > URL: https://issues.apache.org/jira/browse/SPARK-47742 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Hemant Sakharkar >Priority: Major > Labels: performance > Attachments: spark_chain_transformation.png > > > In Feature Engineering we need to process the input data to create feature > and feature vectors which are required to train the model. For which we need > to do multiple spark transformations (etc:map, filter etc) the spark has very > good optimization for multiple transformations due to its lazy execution. It > combines multiple transformations into fewer transformations which helps to > optimize the overall execution time. > I found that we can still improve the execution time in the case of filters. > *Sample Run Results:* > Records :50,000,000 > 5 filter Execution Time: 24854 milli sec > 5 filter with Map Execution Time: 5212 milli sec > We can very well improve multiple X times and reduce significant memory > footprint for a complex DAG of Spark Transformation. > Sample illustration can be found here > [https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing] > Need support of such transformation in Spark Core so that more complex > transformation can be supported. Some illustration is provided in above > document. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47742) Spark Transformation with Multi Case filter can improve efficiency
[ https://issues.apache.org/jira/browse/SPARK-47742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hemant Sakharkar updated SPARK-47742: - Attachment: spark_chain_transformation.png > Spark Transformation with Multi Case filter can improve efficiency > -- > > Key: SPARK-47742 > URL: https://issues.apache.org/jira/browse/SPARK-47742 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Hemant Sakharkar >Priority: Major > Labels: performance > Attachments: spark_chain_transformation.png > > > In Feature Engineering we need to process the input data to create feature > and feature vectors which are required to train the model. For which we need > to do multiple spark transformations (etc:map, filter etc) the spark has very > good optimization for multiple transformations due to its lazy execution. It > combines multiple transformations into fewer transformations which helps to > optimize the overall execution time. > I found that we can still improve the execution time in the case of filters. > *Sample Run Results:* > Records :50,000,000 > 5 filter Execution Time: (t2-t1) 24854 millisec > 5 filter with Map Execution Time: (t3-t2) 5212 millisec > We can very well improve multiple X times and reduce significant memory > footprint for a complex DAG of Spark Transformation. > Sample illustration can be found here > [https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing] > Need support of such transformation in Spark Core so that more complex > transformation can be supported. Some illustration is provided in above > document. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47742) Spark Transformation with Multi Case filter can improve efficiency
Hemant Sakharkar created SPARK-47742: Summary: Spark Transformation with Multi Case filter can improve efficiency Key: SPARK-47742 URL: https://issues.apache.org/jira/browse/SPARK-47742 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 4.0.0 Reporter: Hemant Sakharkar In Feature Engineering we need to process the input data to create feature and feature vectors which are required to train the model. For which we need to do multiple spark transformations (etc:map, filter etc) the spark has very good optimization for multiple transformations due to its lazy execution. It combines multiple transformations into fewer transformations which helps to optimize the overall execution time. I found that we can still improve the execution time in the case of filters. *Sample Run Results:* Records :50,000,000 5 filter Execution Time: (t2-t1) 24854 millisec 5 filter with Map Execution Time: (t3-t2) 5212 millisec We can very well improve multiple X times and reduce significant memory footprint for a complex DAG of Spark Transformation. Sample illustration can be found here [https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing] Need support of such transformation in Spark Core so that more complex transformation can be supported. Some illustration is provided in above document. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47740) Stop JVM by calling SparkSession.stop from PySpark
[ https://issues.apache.org/jira/browse/SPARK-47740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Martynov updated SPARK-47740: --- Description: {code:python} from pyspark.sql import SparkSession spark = SparkSession.builder.config("spark.driver.memory", "2g").config("spark.another.property", "hi").getOrCreate() spark.conf.get("spark.driver.memory") # prints '2g' spark.conf.get("spark.another.property") # prints 'hi' # check for JVM process before and after stop spark.stop() # try to recreate Spark session with different driver memory, and no "spark.another.property" at all spark = SparkSession.builder.config("spark.driver.memory", "3g").getOrCreate() spark.conf.get("spark.driver.memory") # prints '3g' spark.conf.get("spark.another.property") # prints 'hi' # check for JVM process {code} After first call of {{.getOrCreate()}} JVM process with following options have been started: {code} maxim 76625 10.9 2.7 7975292 447068 pts/8 Sl+ 16:59 0:11 /home/maxim/.sdkman/candidates/java/current/bin/java -cp /home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/conf:/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/jars/* -Xmx2g -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=2g --conf spark.another.property=hi pyspark-shell {code} But it have not been stopped by {{.stop()}} method call. New Spark session which reports 3Gb of driver memory actually uses the same 2Gb, and config options from old session which should be closed at this moment. Summarizing: * {{spark.stop()}} stops only SparkContext, but not JVM * {{spark.stop()}} does not clean up session config This behavior has been observed for a long time, at least since 2.2.0, and still present in 3.5.1. This could be solved by stopping JVM after Spark session is stopped. But just calling {{spark._jvm.System.exit(0)}} fill fail because py4j socket will be closed, and it will be impossible to start new Spark session: {code:python} spark._jvm.System.exit(0) {code} {code} ERROR:root:Exception while sending command. Traceback (most recent call last): File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py", line 516, in send_command raise Py4JNetworkError("Answer from Java side is empty") py4j.protocol.Py4JNetworkError: Answer from Java side is empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py", line 1038, in send_command response = connection.send_command(command) File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py", line 539, in send_command raise Py4JNetworkError( py4j.protocol.Py4JNetworkError: Error while sending or receiving Traceback (most recent call last): File "", line 1, in File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py", line 1322, in __call__ return_value = get_return_value( ^ File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py", line 179, in deco return f(*a, **kw) ^^^ File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/protocol.py", line 334, in get_return_value raise Py4JError( py4j.protocol.Py4JError: An error occurred while calling z:java.lang.System.exit {code} Here is an article describing the same issue, and also providing a safe way to stop JVM without breaking py4j (in Russian): https://habr.com/ru/companies/sberbank/articles/805285/ {code:python} from pyspark.sql import SparkSession from pyspark import SparkContext from threading import RLock from py4j.protocol import CALL_COMMAND_NAME, END_COMMAND_PART from pyspark.sql import SparkSession from pyspark
[jira] [Updated] (SPARK-47740) Stop JVM by calling SparkSession.stop from PySpark
[ https://issues.apache.org/jira/browse/SPARK-47740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Martynov updated SPARK-47740: --- Description: {code:python} from pyspark.sql import SparkSession spark = SparkSession.builder.config("spark.driver.memory", "2g").config("spark.another.property", "hi").getOrCreate() spark.conf.get("spark.driver.memory") # prints '2g' spark.conf.get("spark.another.property") # prints 'hi' # check for JVM process before and after stop spark.stop() # try to recreate Spark session with different driver memory, and no "spark.another.property" at all spark = SparkSession.builder.config("spark.driver.memory", "3g").getOrCreate() spark.conf.get("spark.driver.memory") # prints '3g' spark.conf.get("spark.another.property") # prints 'hi' # check for JVM process {code} After first call of {{.getOrCreate()}} JVM process with following options have been started: {code} maxim 76625 10.9 2.7 7975292 447068 pts/8 Sl+ 16:59 0:11 /home/maxim/.sdkman/candidates/java/current/bin/java -cp /home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/conf:/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/jars/* -Xmx2g -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=2g --conf spark.another.property=hi pyspark-shell {code} But it have not been stopped by {{.stop()}} method call. New Spark session which reports 3Gb of driver memory actually uses the same 2Gb, and config options from old session which should be closed at this moment. Summarizing: * {{spark.stop()}} stops only SparkContext, but not JVM * {{spark.stop()}} does not clean up session config This behavior has been observed for a long time, at least since 2.2.0, and still present in 3.5.1. This could be solved by stopping JVM after Spark session is stopped. But just calling {{spark._jvm.System.exit(0)}} fill fail because py4j socket will be closed, and it will be impossible to start new Spark session: {code:python} spark._jvm.System.exit(0) {code} {code} ERROR:root:Exception while sending command. Traceback (most recent call last): File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py", line 516, in send_command raise Py4JNetworkError("Answer from Java side is empty") py4j.protocol.Py4JNetworkError: Answer from Java side is empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py", line 1038, in send_command response = connection.send_command(command) File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py", line 539, in send_command raise Py4JNetworkError( py4j.protocol.Py4JNetworkError: Error while sending or receiving Traceback (most recent call last): File "", line 1, in File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py", line 1322, in __call__ return_value = get_return_value( ^ File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py", line 179, in deco return f(*a, **kw) ^^^ File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/protocol.py", line 334, in get_return_value raise Py4JError( py4j.protocol.Py4JError: An error occurred while calling z:java.lang.System.exit {code} Here is an article describing the same issue, and also providing a safe way to stop JVM without breaking py4j (in Russian): https://habr.com/ru/companies/sberbank/articles/805285/ {code:python} from pyspark.sql import SparkSession from pyspark import SparkContext from threading import RLock from py4j.protocol import CALL_COMMAND_NAME, END_COMMAND_PART from pyspark.sql import SparkSession from pyspark
[jira] [Updated] (SPARK-46411) Change to use bcprov/bcpkix-jdk18on for test
[ https://issues.apache.org/jira/browse/SPARK-46411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-46411: -- Fix Version/s: 3.5.2 > Change to use bcprov/bcpkix-jdk18on for test > > > Key: SPARK-46411 > URL: https://issues.apache.org/jira/browse/SPARK-46411 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0, 3.5.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45445) Upgrade snappy to 1.1.10.5
[ https://issues.apache.org/jira/browse/SPARK-45445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45445: -- Parent: SPARK-47046 Issue Type: Sub-task (was: Improvement) > Upgrade snappy to 1.1.10.5 > -- > > Key: SPARK-45445 > URL: https://issues.apache.org/jira/browse/SPARK-45445 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45323) Upgrade snappy to 1.1.10.4
[ https://issues.apache.org/jira/browse/SPARK-45323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45323: -- Parent: SPARK-47046 Issue Type: Sub-task (was: Dependency upgrade) > Upgrade snappy to 1.1.10.4 > -- > > Key: SPARK-45323 > URL: https://issues.apache.org/jira/browse/SPARK-45323 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0, 3.5.1 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Security Fix > Fixed SnappyInputStream so as not to allocate too large memory when > decompressing data with an extremely large chunk size by @tunnelshade (code > change) > This does not affect users only using Snappy.compress/uncompress methods -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47740) Stop JVM by calling SparkSession.stop from PySpark
[ https://issues.apache.org/jira/browse/SPARK-47740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Martynov updated SPARK-47740: --- Description: {code:python} from pyspark.sql import SparkSession spark = SparkSession.builder.config("spark.driver.memory", "2g").config("spark.another.property", "hi").getOrCreate() spark.conf.get("spark.driver.memory") # prints '2g' spark.conf.get("spark.another.property") # prints 'hi' # check for JVM process before and after stop spark.stop() # try to recreate Spark session with different driver memory, and no "spark.another.property" at all spark = SparkSession.builder.config("spark.driver.memory", "3g").getOrCreate() spark.conf.get("spark.driver.memory") # prints '3g' spark.conf.get("spark.another.property") # prints 'hi' # check for JVM process {code} After first call of {{.getOrCreate()}} JVM process with following options have been started: {code} maxim 76625 10.9 2.7 7975292 447068 pts/8 Sl+ 16:59 0:11 /home/maxim/.sdkman/candidates/java/current/bin/java -cp /home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/conf:/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/jars/* -Xmx2g -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=2g --conf spark.another.property=hi pyspark-shell {code} But it have not been stopped by {{.stop()}} method call. New Spark session which reports 3Gb of driver memory actually uses the same 2Gb, and config options from old session which should be closed at this moment. Summarizing: * {{spark.stop()}} stops only SparkContext, but not JVM * {{spark.stop()}} does not clean up session config This behavior has been observed for a long time, at least since 2.2.0, and still present in 3.5.1. This could be solved by stopping JVM after Spark session is stopped. But just calling {{spark._jvm.System.exit(0)}} fill fail because py4j socket will be closed, and it will be impossible to start new Spark session: {code:python} spark._jvm.System.exit(0) {code} {code} ERROR:root:Exception while sending command. Traceback (most recent call last): File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py", line 516, in send_command raise Py4JNetworkError("Answer from Java side is empty") py4j.protocol.Py4JNetworkError: Answer from Java side is empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py", line 1038, in send_command response = connection.send_command(command) File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py", line 539, in send_command raise Py4JNetworkError( py4j.protocol.Py4JNetworkError: Error while sending or receiving Traceback (most recent call last): File "", line 1, in File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py", line 1322, in __call__ return_value = get_return_value( ^ File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py", line 179, in deco return f(*a, **kw) ^^^ File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/protocol.py", line 334, in get_return_value raise Py4JError( py4j.protocol.Py4JError: An error occurred while calling z:java.lang.System.exit {code} Here is an article describing the same issue, and also providing a safe way to stop JVM without breaking py4j (in Russian): https://habr.com/ru/companies/sberbank/articles/805285/ {code:python} from pyspark.sql import SparkSession from pyspark import SparkContext from threading import RLock from py4j.protocol import CALL_COMMAND_NAME, END_COMMAND_PART from pyspark.sql import SparkSession from pyspark
[jira] [Commented] (SPARK-47740) Stop JVM by calling SparkSession.stop from PySpark
[ https://issues.apache.org/jira/browse/SPARK-47740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834379#comment-17834379 ] Maxim Martynov commented on SPARK-47740: Here is a PR for adding helper function [jvm_shutdown(gateway)|https://github.com/py4j/py4j/pull/541] to Py4J. > Stop JVM by calling SparkSession.stop from PySpark > -- > > Key: SPARK-47740 > URL: https://issues.apache.org/jira/browse/SPARK-47740 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.5.1 >Reporter: Maxim Martynov >Priority: Major > > {code:python} > from pyspark.sql import SparkSession > spark = SparkSession.builder.config("spark.driver.memory", > "2g").config("spark.another.property", "hi").getOrCreate() > spark.conf.get("spark.driver.memory") > # prints '2g' > spark.conf.get("spark.another.property") > # prints 'hi' > # check for JVM process before and after stop > spark.stop() > # try to recreate Spark session with different driver memory, and no > "spark.another.property" at all > spark = SparkSession.builder.config("spark.driver.memory", "3g").getOrCreate() > spark.conf.get("spark.driver.memory") > # prints '3g' > spark.conf.get("spark.another.property") > # prints 'hi' > # check for JVM process > {code} > After first call of {{.getOrCreate()}} JVM process with following options > have been started: > {code} > maxim 76625 10.9 2.7 7975292 447068 pts/8 Sl+ 16:59 0:11 > /home/maxim/.sdkman/candidates/java/current/bin/java -cp > /home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/conf:/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/jars/* > -Xmx2g -XX:+IgnoreUnrecognizedVMOptions > --add-opens=java.base/java.lang=ALL-UNNAMED > --add-opens=java.base/java.lang.invoke=ALL-UNNAMED > --add-opens=java.base/java.lang.reflect=ALL-UNNAMED > --add-opens=java.base/java.io=ALL-UNNAMED > --add-opens=java.base/java.net=ALL-UNNAMED > --add-opens=java.base/java.nio=ALL-UNNAMED > --add-opens=java.base/java.util=ALL-UNNAMED > --add-opens=java.base/java.util.concurrent=ALL-UNNAMED > --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED > --add-opens=java.base/sun.nio.ch=ALL-UNNAMED > --add-opens=java.base/sun.nio.cs=ALL-UNNAMED > --add-opens=java.base/sun.security.action=ALL-UNNAMED > --add-opens=java.base/sun.util.calendar=ALL-UNNAMED > --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED > -Djdk.reflect.useDirectMethodHandle=false org.apache.spark.deploy.SparkSubmit > --conf spark.driver.memory=2g --conf spark.another.property=hi pyspark-shell > {code} > But it have not been stopped by {{.stop()}} method call. New Spark session > which reports 3Gb of driver memory actually uses the same 2Gb, and config > options from old session which should be closed at this moment. > Summarizing: > * {{spark.stop()}} stops only SparkContext, but not JVM > * {{spark.stop()}} does not clean up session config > This behavior has been observed for a long time, at least since 2.2.0, and > still present in 3.5.1. > This could be solved by stopping JVM after Spark session is stopped. > But just calling {{spark._jvm.System.exit(0)}} fill fail because py4j socket > will be closed, and it will be impossible to start new Spark session: > {code:python} > spark._jvm.System.exit(0) > {code} > {code} > ERROR:root:Exception while sending command. > Traceback (most recent call last): > File > "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py", > line 516, in send_command > raise Py4JNetworkError("Answer from Java side is empty") > py4j.protocol.Py4JNetworkError: Answer from Java side is empty > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File > "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py", > line 1038, in send_command > response = connection.send_command(command) > > File > "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py", > line 539, in send_command > raise Py4JNetworkError( > py4j.protocol.Py4JNetworkError: Error while sending or receiving > Traceback (most recent call last): > File "", line 1, in > File > "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py", > line 1322, in __call__ > return_value = get_return_value( >^ > File > "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py", > line 179, in deco >
[jira] [Updated] (SPARK-45172) Upgrade commons-compress.version to 1.24.0
[ https://issues.apache.org/jira/browse/SPARK-45172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45172: -- Parent: SPARK-47046 Issue Type: Sub-task (was: Improvement) > Upgrade commons-compress.version to 1.24.0 > -- > > Key: SPARK-45172 > URL: https://issues.apache.org/jira/browse/SPARK-45172 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44441) Upgrade bcprov-jdk15on and bcpkix-jdk15on to 1.70
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-1: --- Labels: pull-request-available (was: ) > Upgrade bcprov-jdk15on and bcpkix-jdk15on to 1.70 > - > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Trivial > Labels: pull-request-available > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44393) Upgrade H2 from 2.1.214 to 2.2.220
[ https://issues.apache.org/jira/browse/SPARK-44393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44393: -- Fix Version/s: 3.4.3 > Upgrade H2 from 2.1.214 to 2.2.220 > -- > > Key: SPARK-44393 > URL: https://issues.apache.org/jira/browse/SPARK-44393 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0, 3.4.3 > > > [CVE-2022-45868|https://nvd.nist.gov/vuln/detail/CVE-2022-45868] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44393) Upgrade H2 from 2.1.214 to 2.2.220
[ https://issues.apache.org/jira/browse/SPARK-44393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44393: --- Labels: pull-request-available (was: ) > Upgrade H2 from 2.1.214 to 2.2.220 > -- > > Key: SPARK-44393 > URL: https://issues.apache.org/jira/browse/SPARK-44393 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.5.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Labels: pull-request-available > Fix For: 3.5.0 > > > [CVE-2022-45868|https://nvd.nist.gov/vuln/detail/CVE-2022-45868] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44279) Upgrade optionator to ^0.9.3
[ https://issues.apache.org/jira/browse/SPARK-44279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44279: -- Fix Version/s: 3.5.0 > Upgrade optionator to ^0.9.3 > > > Key: SPARK-44279 > URL: https://issues.apache.org/jira/browse/SPARK-44279 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.1, 3.5.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Minor > Labels: pull-request-available > Fix For: 3.5.0 > > > [Regular Expression Denial of Service (ReDoS) - > CVE-2023-26115|https://github.com/jonschlinkert/word-wrap/issues/32] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47741) Handle stack overflow when parsing query as part of Execute immediate
Milan Stefanovic created SPARK-47741: Summary: Handle stack overflow when parsing query as part of Execute immediate Key: SPARK-47741 URL: https://issues.apache.org/jira/browse/SPARK-47741 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.1 Reporter: Milan Stefanovic Fix For: 4.0.0 Exec immediate can generate complex queries which can lead to stack overflow. We need to catch this exception and convert it to proper analysis exc with error class. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47740) Stop JVM by calling SparkSession.stop from PySpark
[ https://issues.apache.org/jira/browse/SPARK-47740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Martynov updated SPARK-47740: --- Description: {code:python} from pyspark.sql import SparkSession spark = SparkSession.builder.config("spark.driver.memory", "2g").config("spark.another.property", "hi").getOrCreate() spark.conf.get("spark.driver.memory") # prints '2g' spark.conf.get("spark.another.property") # prints 'hi' # check for JVM process before and after stop spark.stop() # try to recreate Spark session with different driver memory, and no "spark.another.property" at all spark = SparkSession.builder.config("spark.driver.memory", "3g").getOrCreate() spark.conf.get("spark.driver.memory") # prints '3g' spark.conf.get("spark.another.property") # prints 'hi' # check for JVM process {code} After first call of {{.getOrCreate()}} JVM process with following options have been started: {code} maxim 76625 10.9 2.7 7975292 447068 pts/8 Sl+ 16:59 0:11 /home/maxim/.sdkman/candidates/java/current/bin/java -cp /home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/conf:/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/jars/* -Xmx2g -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=2g --conf spark.another.property=hi pyspark-shell {code} But it have not been stopped by {{.stop()}} method call. New Spark session which reports 3Gb of driver memory actually uses the same 2Gb, and config options from old session which should be closed at this moment. Summarizing: * {{spark.stop()}} stops only SparkContext, but not JVM * {{spark.stop()}} does not clean up session config This behavior has been observed for a long time, at least since 2.2.0, and still present in 3.5.1. This could be solved by stopping JVM after Spark session is stopped. But just calling {{spark._jvm.System.exit(0)}} fill fail because py4j socket will be closed, and it will be impossible to start new Spark session: {code:python} spark._jvm.System.exit(0) {code} {code} ERROR:root:Exception while sending command. Traceback (most recent call last): File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py", line 516, in send_command raise Py4JNetworkError("Answer from Java side is empty") py4j.protocol.Py4JNetworkError: Answer from Java side is empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py", line 1038, in send_command response = connection.send_command(command) File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py", line 539, in send_command raise Py4JNetworkError( py4j.protocol.Py4JNetworkError: Error while sending or receiving Traceback (most recent call last): File "", line 1, in File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py", line 1322, in __call__ return_value = get_return_value( ^ File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py", line 179, in deco return f(*a, **kw) ^^^ File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/protocol.py", line 334, in get_return_value raise Py4JError( py4j.protocol.Py4JError: An error occurred while calling z:java.lang.System.exit {code} Here is an article describing the same issue, and also providing a safe way to stop JVM without breaking py4j (in Russian): https://habr.com/ru/companies/sberbank/articles/805285/ {code:python} from pyspark.sql import SparkSession from pyspark import SparkContext from threading import RLock from py4j.protocol import CALL_COMMAND_NAME, END_COMMAND_PART from pyspark.sql import SparkSession from pyspark
[jira] [Updated] (SPARK-47740) Stop JVM by calling SparkSession.stop from PySpark
[ https://issues.apache.org/jira/browse/SPARK-47740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Martynov updated SPARK-47740: --- Description: {code:python} from pyspark.sql import SparkSession spark = SparkSession.builder.config("spark.driver.memory", "2g").config("spark.another.property", "hi").getOrCreate() spark.conf.get("spark.driver.memory") # prints '2g' spark.conf.get("spark.another.property") # prints 'hi' # check for JVM process before and after stop spark.stop() # try to recreate Spark session with different driver memory, and no "spark.another.property" at all spark = SparkSession.builder.config("spark.driver.memory", "3g").getOrCreate() spark.conf.get("spark.driver.memory") # prints '3g' spark.conf.get("spark.another.property") # prints 'hi' # check for JVM process {code} After first call of {{.getOrCreate()}} JVM process with following options have been started: {code} maxim 76625 10.9 2.7 7975292 447068 pts/8 Sl+ 16:59 0:11 /home/maxim/.sdkman/candidates/java/current/bin/java -cp /home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/conf:/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/jars/* -Xmx2g -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=2g --conf spark.another.property=hi pyspark-shell {code} But it have not been stopped by {{.stop()}} method call. New Spark session which reports 3Gb of driver memory actually uses the same 2Gb, and config options from old session which should be closed at this moment. Summarizing: * {{spark.stop()}} stops only SparkContext, but not JVM * {{spark.stop()}} does not clean up session config This behavior has been observed for a long time, at least since 2.2.0, and still present in 3.5.1. This could be solved by stopping JVM after Spark session is stopped. But just calling {{spark._jvm.System.exit(0)}} fill fail because py4j socket will be closed, and it will be impossible to start new Spark session: {code:python} spark._jvm.System.exit(0) {code} {code} ERROR:root:Exception while sending command. Traceback (most recent call last): File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py", line 516, in send_command raise Py4JNetworkError("Answer from Java side is empty") py4j.protocol.Py4JNetworkError: Answer from Java side is empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py", line 1038, in send_command response = connection.send_command(command) File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py", line 539, in send_command raise Py4JNetworkError( py4j.protocol.Py4JNetworkError: Error while sending or receiving Traceback (most recent call last): File "", line 1, in File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py", line 1322, in __call__ return_value = get_return_value( ^ File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py", line 179, in deco return f(*a, **kw) ^^^ File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/protocol.py", line 334, in get_return_value raise Py4JError( py4j.protocol.Py4JError: An error occurred while calling z:java.lang.System.exit {code} Here is an article describing the same issue, and also providing a safe way to stop JVM without breaking py4j (in Russian): https://habr.com/ru/companies/sberbank/articles/805285/ {code:python} from pyspark.sql import SparkSession from pyspark import SparkContext from threading import RLock from py4j.protocol import CALL_COMMAND_NAME, END_COMMAND_PART from pyspark.sql import SparkSession from pyspark
[jira] [Updated] (SPARK-47740) Stop JVM by calling SparkSession.stop from PySpark
[ https://issues.apache.org/jira/browse/SPARK-47740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Martynov updated SPARK-47740: --- Description: {code:python} from pyspark.sql import SparkSession spark = SparkSession.builder.config("spark.driver.memory", "2g").config("spark.another.property", "hi").getOrCreate() spark.conf.get("spark.driver.memory") # prints '2g' spark.conf.get("spark.another.property") # prints 'hi' # check for JVM process before and after stop spark.stop() # try to recreate Spark session with different driver memory, and no "spark.another.property" at all spark = SparkSession.builder.config("spark.driver.memory", "3g").getOrCreate() spark.conf.get("spark.driver.memory") # prints '3g' spark.conf.get("spark.another.property") # prints 'hi' # check for JVM process {code} After first call of {{.getOrCreate()}} JVM process with following options have been started: {code} maxim 76625 10.9 2.7 7975292 447068 pts/8 Sl+ 16:59 0:11 /home/maxim/.sdkman/candidates/java/current/bin/java -cp /home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/conf:/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/jars/* -Xmx2g -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=2g --conf spark.another.property=hi pyspark-shell {code} But it have not been stopped by {{.stop()}} method call. New Spark session which reports 3Gb of driver memory actually uses the same 2Gb, and config options from old session which should be closed at this moment. Summarizing: * {{spark.stop()}} stops only SparkContext, but not JVM * {{spark.stop()}} does not clean up session config This behavior has been observed for a long time, at least since 2.2.0, and still present in 3.5.1. This could be solved by stopping JVM after Spark session is stopped. But just calling {{spark._jvm.System.exit(0)}} fill fail because py4j socket will be closed, and it will be impossible to start new Spark session: {code:python} spark._jvm.System.exit(0) {code} {code} ERROR:root:Exception while sending command. Traceback (most recent call last): File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py", line 516, in send_command raise Py4JNetworkError("Answer from Java side is empty") py4j.protocol.Py4JNetworkError: Answer from Java side is empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py", line 1038, in send_command response = connection.send_command(command) File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py", line 539, in send_command raise Py4JNetworkError( py4j.protocol.Py4JNetworkError: Error while sending or receiving Traceback (most recent call last): File "", line 1, in File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py", line 1322, in __call__ return_value = get_return_value( ^ File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py", line 179, in deco return f(*a, **kw) ^^^ File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/protocol.py", line 334, in get_return_value raise Py4JError( py4j.protocol.Py4JError: An error occurred while calling z:java.lang.System.exit {code} Here is an article describing the same issue, and also providing a safe way to stop JVM without breaking py4j (in Russian): https://habr.com/ru/companies/sberbank/articles/805285/ {code:python} from pyspark.sql import SparkSession from pyspark import SparkContext from threading import RLock from py4j.protocol import CALL_COMMAND_NAME, END_COMMAND_PART from pyspark.sql import SparkSession from pyspark
[jira] [Created] (SPARK-47740) Stop JVM by calling SparkSession.stop from PySpark
Maxim Martynov created SPARK-47740: -- Summary: Stop JVM by calling SparkSession.stop from PySpark Key: SPARK-47740 URL: https://issues.apache.org/jira/browse/SPARK-47740 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.5.1 Reporter: Maxim Martynov {code:python} from pyspark.sql import SparkSession spark = SparkSession.builder.config("spark.driver.memory", "2g").config("spark.another.property", "hi").getOrCreate() spark.conf.get("spark.driver.memory") # prints '2g' spark.conf.get("spark.another.property") # prints 'hi' # check for JVM process before and after stop spark.stop() # try to recreate Spark session with different driver memory, and no "spark.another.property" at all spark = SparkSession.builder.config("spark.driver.memory", "3g").getOrCreate() spark.conf.get("spark.driver.memory") # prints '3g' spark.conf.get("spark.another.property") # prints 'hi' # check for JVM process {code} After first call of {{.getOrCreate()}} JVM process with following options have been started: {code} maxim 76625 10.9 2.7 7975292 447068 pts/8 Sl+ 16:59 0:11 /home/maxim/.sdkman/candidates/java/current/bin/java -cp /home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/conf:/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/jars/* -Xmx2g -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false org.apache.spark.deploy.SparkSubmit --conf spark.driver.memory=2g --conf spark.another.property=hi pyspark-shell {code} But it have not been stopped by {{.stop()}} method call. New Spark session which reports 3Gb of driver memory actually uses the same 2Gb, and config options from old session which should be closed at this moment. Summarizing: * {{spark.stop()}} stops only SparkContext, but not JVM * {{spark.stop()}} does not clean up session config This behavior has been observed for a long time, at least since 2.2.0, and still present in 3.5.1. This could be solved by stopping JVM after Spark session is stopped. But just calling {{spark._jvm.System.exit(0)}} fill fail because py4j socket will be closed, and it will be impossible to start new Spark session: {code:python} spark._jvm.System.exit(0) {code} {code} ERROR:root:Exception while sending command. Traceback (most recent call last): File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py", line 516, in send_command raise Py4JNetworkError("Answer from Java side is empty") py4j.protocol.Py4JNetworkError: Answer from Java side is empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py", line 1038, in send_command response = connection.send_command(command) File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py", line 539, in send_command raise Py4JNetworkError( py4j.protocol.Py4JNetworkError: Error while sending or receiving Traceback (most recent call last): File "", line 1, in File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py", line 1322, in __call__ return_value = get_return_value( ^ File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py", line 179, in deco return f(*a, **kw) ^^^ File "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/protocol.py", line 334, in get_return_value raise Py4JError( py4j.protocol.Py4JError: An error occurred while calling z:java.lang.System.exit {code} Here is a description of a safe way to stop JVM without breaking py4j (in Russian): https://habr.com/ru/companies/sberbank/articles/805285/ {code:python} from pyspark.sql import SparkSession from pyspark import
[jira] [Resolved] (SPARK-47735) Make pyspark.testing.connectutils compatible with pyspark-connect
[ https://issues.apache.org/jira/browse/SPARK-47735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-47735. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 45887 [https://github.com/apache/spark/pull/45887] > Make pyspark.testing.connectutils compatible with pyspark-connect > - > > Key: SPARK-47735 > URL: https://issues.apache.org/jira/browse/SPARK-47735 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-47735) Make pyspark.testing.connectutils compatible with pyspark-connect
[ https://issues.apache.org/jira/browse/SPARK-47735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-47735: - Assignee: Hyukjin Kwon > Make pyspark.testing.connectutils compatible with pyspark-connect > - > > Key: SPARK-47735 > URL: https://issues.apache.org/jira/browse/SPARK-47735 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Tests >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-46632) EquivalentExpressions throw IllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-46632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-46632: --- Labels: pull-request-available (was: ) > EquivalentExpressions throw IllegalStateException > - > > Key: SPARK-46632 > URL: https://issues.apache.org/jira/browse/SPARK-46632 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core, SQL >Affects Versions: 3.3.0, 3.4.0, 3.5.0 >Reporter: zhangzhenhao >Priority: Major > Labels: pull-request-available > > EquivalentExpressions throw IllegalStateException with some IF expresssion > ```scala > import org.apache.spark.sql.catalyst.dsl.expressions.DslExpression > import org.apache.spark.sql.catalyst.expressions.\{EquivalentExpressions, If, > Literal} > import org.apache.spark.sql.functions.col > val one = Literal(1.0) > val y = col("y").expr > val e1 = If( > Literal(true), > y * one * one + one * one * y, > y * one * one + one * one * y > ) > (new EquivalentExpressions).addExprTree(e1) > ``` > > result is > ``` > java.lang.IllegalStateException: Cannot update expression: (1.0 * 1.0) in > map: Map(ExpressionEquals(('y * 1.0)) -> ExpressionStats(('y * 1.0))) with > use count: -1 > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprInMap(EquivalentExpressions.scala:85) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprTree(EquivalentExpressions.scala:198) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$1(EquivalentExpressions.scala:200) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$1$adapted(EquivalentExpressions.scala:200) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprTree(EquivalentExpressions.scala:200) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$1(EquivalentExpressions.scala:200) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$1$adapted(EquivalentExpressions.scala:200) > at scala.collection.Iterator.foreach(Iterator.scala:943) > at scala.collection.Iterator.foreach$(Iterator.scala:943) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprTree(EquivalentExpressions.scala:200) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateCommonExprs(EquivalentExpressions.scala:128) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$3(EquivalentExpressions.scala:201) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$3$adapted(EquivalentExpressions.scala:201) > at scala.collection.immutable.List.foreach(List.scala:431) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprTree(EquivalentExpressions.scala:201) > at > org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTree(EquivalentExpressions.scala:188) > ... 49 elided > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1
[ https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834315#comment-17834315 ] Christos Karras edited comment on SPARK-46143 at 4/5/24 1:47 PM: - I have the same issue too. The problem is because the squeeze parameter of read_excel has been deprecated since pandas version 1.4, and has been completely removed in pandas 2.0. But Pyspark's implementation of read_excel keeps passing the squeeze parameter, even though the parameter is also deprecated since pyspark 3.4. Since there's no version constraint in Pyspark that indicates Pandas 2.0 is not supported, a fix to avoid the need to stay with pandas 1.x could be to detect the pandas version and decide if the squeeze parameter should be passed depending on the pandas version. Also, if the squeeze parameter is passed in the Pyspark function, raise an error if a version of pandas that no longer supports this parameter is installed. This would be a transition solution until the squeeze parameter is also removed completely from Pyspark. Modify pyspark\pandas\namespace.py: * Change the squeeze parameter of the read_excel function to be "squeeze: Optional[bool] = None" * Modify the nested pd_read_excel function to check for the pandas version and build a dict of arguments it will pass to pd.read_excel based on that version. Also consider if the squeeze parameter was passed by the caller or not. And if the caller specified that argument but a newer version of pandas that doesn't support it, raise an exception Sample code that could implement this solution: def pd_read_excel( io_or_bin: Any, sn: Union[str, int, List[Union[str, int]], None], sq: bool ) -> pd.DataFrame: read_excel_args: dict = { "io":BytesIO(io_or_bin) if isinstance(io_or_bin, (bytes, bytearray)) else io_or_bin, "sheet_name":sn, "header":header, # TODO other args..., **kwds } if squeeze is not None: pandas_version_major = int(pd.__version__.split(".")[0]) if pandas_version_major >= 2: raise Exception("The squeeze parameter for read_excel is no longer available in pandas 2.x") read_excel_args["squeeze"] = squeeze return pd.read_excel(**read_excel_args) was (Author: JIRAUSER304886): I have the same issue too. The problem is because the squeeze parameter of read_excel has been deprecated since pandas version 1.4, and has been completely removed in pandas 2.0. But Pyspark's implementation of read_excel keeps passing the squeeze parameter, even though the parameter is also deprecated since pyspark 3.4. Since there's no version constraint in Pyspark that indicates Pandas 2.0 is not supported, a fix to avoid the need to stay with pandas 1.x could be to detect the pandas version and decide if the squeeze parameter should be passed depending on the pandas version. Also, if the squeeze parameter is passed in the Pyspark function, raise an error if a versio of pandas that no longer supports this parameter is installed. This would be a transition solution until the squeeze parameter is also removed completely from Pyspark. Modify pyspark\pandas\namespace.py: * Change the squeeze parameter of the read_excel function to be "squeeze: Optional[bool] = None" * Modify the nested pd_read_excel function to check for the pandas version and build a dict of arguments it will pass to pd.read_excel based on that version. Also consider if the squeeze parameter was passed by the caller or not. And if the caller specified that argument but a newer version of pandas that doesn't support it, raise an exception Sample code that could implement this solution: def pd_read_excel( io_or_bin: Any, sn: Union[str, int, List[Union[str, int]], None], sq: bool ) -> pd.DataFrame: read_excel_args: dict = { "io":BytesIO(io_or_bin) if isinstance(io_or_bin, (bytes, bytearray)) else io_or_bin, "sheet_name":sn, "header":header, # TODO other args..., **kwds } if squeeze is not None: if pandas_version >= 2: raise Exception("The squeeze parameter for read_excel is no longer available in pandas 2.x") read_excel_args["squeeze"] = squeeze return pd.read_excel(**read_excel_args) > pyspark.pandas read_excel implementation at version 3.4.1 > - > > Key: SPARK-46143 > URL: https://issues.apache.org/jira/browse/SPARK-46143 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.1 > Environment: pyspark 3.4.1.5.3 build 20230713. > Running on Microsoft Fabric workspace at runtime 1.2. > Tested the same scenario on a spark 3.4.1 standalone deployment on docker > documented at
[jira] [Comment Edited] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1
[ https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834315#comment-17834315 ] Christos Karras edited comment on SPARK-46143 at 4/5/24 1:42 PM: - I have the same issue too. The problem is because the squeeze parameter of read_excel has been deprecated since pandas version 1.4, and has been completely removed in pandas 2.0. But Pyspark's implementation of read_excel keeps passing the squeeze parameter, even though the parameter is also deprecated since pyspark 3.4. Since there's no version constraint in Pyspark that indicates Pandas 2.0 is not supported, a fix to avoid the need to stay with pandas 1.x could be to detect the pandas version and decide if the squeeze parameter should be passed depending on the pandas version. Also, if the squeeze parameter is passed in the Pyspark function, raise an error if a versio of pandas that no longer supports this parameter is installed. This would be a transition solution until the squeeze parameter is also removed completely from Pyspark. Modify pyspark\pandas\namespace.py: * Change the squeeze parameter of the read_excel function to be "squeeze: Optional[bool] = None" * Modify the nested pd_read_excel function to check for the pandas version and build a dict of arguments it will pass to pd.read_excel based on that version. Also consider if the squeeze parameter was passed by the caller or not. And if the caller specified that argument but a newer version of pandas that doesn't support it, raise an exception Sample code that could implement this solution: def pd_read_excel( io_or_bin: Any, sn: Union[str, int, List[Union[str, int]], None], sq: bool ) -> pd.DataFrame: read_excel_args: dict = { "io":BytesIO(io_or_bin) if isinstance(io_or_bin, (bytes, bytearray)) else io_or_bin, "sheet_name":sn, "header":header, # TODO other args..., **kwds } if squeeze is not None: if pandas_version >= 2: raise Exception("The squeeze parameter for read_excel is no longer available in pandas 2.x") read_excel_args["squeeze"] = squeeze return pd.read_excel(**read_excel_args) was (Author: JIRAUSER304886): I have the same issue too. The problem is because the squeeze parameter of read_excel has been deprecated since pandas version 1.4, and has been completely removed in pandas 2.0. But Pyspark's implementation of read_excel keeps passing the squeeze parameter, even though the parameter is also deprecated since pyspark 3.4. Since there's no version constraint in Pyspark that indicates Pandas 2.0 is not supported, a fix to avoid the need to stay with pandas 1.x could be to detect the pandas version and decide if the squeeze parameter should be passed depending on the pandas version. Also, if the squeeze parameter is passed in the Pyspark function, raise an error if a versio of pandas that no longer supports this parameter is installed. This would be a transition solution until the squeeze parameter is also removed completely from Pyspark. Modify pyspark\pandas\namespace.py: * Change the squeeze parameter of the read_excel function to be "squeeze: Optional[bool] = None" * Modify the nested pd_read_excel function to check for the pandas version and build a dict of arguments it will pass to pd.read_excel based on that version. Also consider if the squeeze parameter was passed by the caller or not. And if the caller specified that argument but a newer version of pandas that doesn't support it, raise an exception Sample code that could implement this solution: def pd_read_excel( io_or_bin: Any, sn: Union[str, int, List[Union[str, int]], None], sq: bool ) -> pd.DataFrame: read_excel_args: dict = { "io":BytesIO(io_or_bin) if isinstance(io_or_bin, (bytes, bytearray)) else io_or_bin, "sheet_name":sn, "header":header, # TODO other args..., **kwds } if squeeze is not None: if pandas_version >= 2: raise Exception("The squeeze parameter for read_excel is no longer available in pandas 2.x") read_excel_args["squeeze"] = squeeze return pd.read_excel(**read_excel_args) > pyspark.pandas read_excel implementation at version 3.4.1 > - > > Key: SPARK-46143 > URL: https://issues.apache.org/jira/browse/SPARK-46143 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.1 > Environment: pyspark 3.4.1.5.3 build 20230713. > Running on Microsoft Fabric workspace at runtime 1.2. > Tested the same scenario on a spark 3.4.1 standalone deployment on docker > documented at https://github.com/mpavanetti/sparkenv > > >Reporter: Matheus
[jira] [Comment Edited] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1
[ https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834315#comment-17834315 ] Christos Karras edited comment on SPARK-46143 at 4/5/24 1:42 PM: - I have the same issue too. The problem is because the squeeze parameter of read_excel has been deprecated since pandas version 1.4, and has been completely removed in pandas 2.0. But Pyspark's implementation of read_excel keeps passing the squeeze parameter, even though the parameter is also deprecated since pyspark 3.4. Since there's no version constraint in Pyspark that indicates Pandas 2.0 is not supported, a fix to avoid the need to stay with pandas 1.x could be to detect the pandas version and decide if the squeeze parameter should be passed depending on the pandas version. Also, if the squeeze parameter is passed in the Pyspark function, raise an error if a versio of pandas that no longer supports this parameter is installed. This would be a transition solution until the squeeze parameter is also removed completely from Pyspark. Modify pyspark\pandas\namespace.py: * Change the squeeze parameter of the read_excel function to be "squeeze: Optional[bool] = None" * Modify the nested pd_read_excel function to check for the pandas version and build a dict of arguments it will pass to pd.read_excel based on that version. Also consider if the squeeze parameter was passed by the caller or not. And if the caller specified that argument but a newer version of pandas that doesn't support it, raise an exception Sample code that could implement this solution: def pd_read_excel( io_or_bin: Any, sn: Union[str, int, List[Union[str, int]], None], sq: bool ) -> pd.DataFrame: read_excel_args: dict = { "io":BytesIO(io_or_bin) if isinstance(io_or_bin, (bytes, bytearray)) else io_or_bin, "sheet_name":sn, "header":header, # TODO other args..., **kwds } if squeeze is not None: if pandas_version >= 2: raise Exception("The squeeze parameter for read_excel is no longer available in pandas 2.x") read_excel_args["squeeze"] = squeeze return pd.read_excel(**read_excel_args) was (Author: JIRAUSER304886): I have the same issue too. The problem is because the squeeze parameter of read_excel has been deprecated since pandas version 1.4, and has been completely removed in pandas 2.0. But Pyspark's implementation of read_excel keeps passing the squeeze parameter, even though the parameter is also deprecated since pyspark 3.4. Since there's no version constraint in Pyspark that indicates Pandas 2.0 is not supported, a fix to avoid the need to stay with pandas 1.x could be to detect the pandas version and decide if the squeeze parameter should be passed depending on the pandas version. Also, if the squeeze parameter is passed in the Pyspark function, raise an error if a versio of pandas that no longer supports this parameter is installed. This would be a transition solution until the squeeze parameter is also removed completely from Pyspark. Modify pyspark\pandas\namespace.py: * Change the squeeze parameter of the read_excel function to be "squeeze: Optional[bool] = None" * Modify the nested pd_read_excel function to check for the pandas version and build a dict of arguments it will pass to pd.read_excel based on that version. Also consider if the squeeze parameter was passed by the caller or not. And if the caller specified that argument but a newer version of pandas that doesn't support it, raise an exception Sample code that could implement this solution: def pd_read_excel( io_or_bin: Any, sn: Union[str, int, List[Union[str, int]], None], sq: bool ) -> pd.DataFrame: read_excel_args: dict = { "io":BytesIO(io_or_bin) if isinstance(io_or_bin, (bytes, bytearray)) else io_or_bin, "sheet_name":sn, "header":header, # TODO other args..., **kwds } if squeeze is not None: if pandas_version >= 2: raise Exception("The squeeze parameter for read_excel is no longer available in pandas 2.x") read_excel_args["squeeze"] = squeeze return pd.read_excel(**read_excel_args) > pyspark.pandas read_excel implementation at version 3.4.1 > - > > Key: SPARK-46143 > URL: https://issues.apache.org/jira/browse/SPARK-46143 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.1 > Environment: pyspark 3.4.1.5.3 build 20230713. > Running on Microsoft Fabric workspace at runtime 1.2. > Tested the same scenario on a spark 3.4.1 standalone deployment on docker > documented at https://github.com/mpavanetti/sparkenv > > >Reporter: Matheus
[jira] [Comment Edited] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1
[ https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834315#comment-17834315 ] Christos Karras edited comment on SPARK-46143 at 4/5/24 1:41 PM: - I have the same issue too. The problem is because the squeeze parameter of read_excel has been deprecated since pandas version 1.4, and has been completely removed in pandas 2.0. But Pyspark's implementation of read_excel keeps passing the squeeze parameter, even though the parameter is also deprecated since pyspark 3.4. Since there's no version constraint in Pyspark that indicates Pandas 2.0 is not supported, a fix to avoid the need to stay with pandas 1.x could be to detect the pandas version and decide if the squeeze parameter should be passed depending on the pandas version. Also, if the squeeze parameter is passed in the Pyspark function, raise an error if a versio of pandas that no longer supports this parameter is installed. This would be a transition solution until the squeeze parameter is also removed completely from Pyspark. Modify pyspark\pandas\namespace.py: * Change the squeeze parameter of the read_excel function to be "squeeze: Optional[bool] = None" * Modify the nested pd_read_excel function to check for the pandas version and build a dict of arguments it will pass to pd.read_excel based on that version. Also consider if the squeeze parameter was passed by the caller or not. And if the caller specified that argument but a newer version of pandas that doesn't support it, raise an exception Sample code that could implement this solution: def pd_read_excel( io_or_bin: Any, sn: Union[str, int, List[Union[str, int]], None], sq: bool ) -> pd.DataFrame: read_excel_args: dict = { "io":BytesIO(io_or_bin) if isinstance(io_or_bin, (bytes, bytearray)) else io_or_bin, "sheet_name":sn, "header":header, # TODO other args..., **kwds } if squeeze is not None: if pandas_version >= 2: raise Exception("The squeeze parameter for read_excel is no longer available in pandas 2.x") read_excel_args["squeeze"] = squeeze return pd.read_excel(**read_excel_args) was (Author: JIRAUSER304886): I have the same issue too. The problem is because the squeeze parameter of read_excel has been deprecated since pandas version 1.4, and has been completely removed in pandas 2.0. But Pyspark's implementation of read_excel keeps passing the squeeze parameter, even though the parameter is also deprecated since pyspark 3.4. Since there's no version constraint in Pyspark that indicates Pandas 2.0 is not supported, a fix to avoid the need to stay with pandas 1.x could be to detect the pandas version and decide if the squeeze parameter should be passed depending on the pandas version. Also, if the squeeze parameter is passed in the Pyspark function, raise an error if a versio of pandas that no longer supports this parameter is installed. This would be a transition solution until the squeeze parameter is also removed completely from Pyspark. Modify pyspark\pandas\namespace.py: * Change the squeeze parameter of the read_excel function to be "squeeze: Optional[bool] = None" * Modify the nested pd_read_excel function to check for the pandas version and build a dict of arguments it will pass to pd.read_excel based on that version. Also consider if the squeeze parameter was passed by the caller or not. And if the caller specified that argument but a newer version of pandas that doesn't support it, raise an exception: {{ }} {{def pd_read_excel(}}{{ io_or_bin: Any, sn: Union[str, int, List[Union[str, int]], None], sq: bool}}{{ ) -> pd.DataFrame:}}{{read_excel_args: dict = { {{ {{"io":BytesIO(io_or_bin) if isinstance(io_or_bin, (bytes, bytearray)) else io_or_bin,}}{{ }} {{"sheet_name":sn,}} {{"header":header,}} {{...,}} {{**kwds}} {{}}} {{if squeeze is not None:}} {{ if pandas_version >= 2:}} {{ raise Exception("The squeeze parameter for read_excel is no longer available in pandas 2.x")}} {{read_excel_args["squeeze"] = squeeze}} {{return pd.read_excel(**read_excel_args)}} {{ }} > pyspark.pandas read_excel implementation at version 3.4.1 > - > > Key: SPARK-46143 > URL: https://issues.apache.org/jira/browse/SPARK-46143 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.1 > Environment: pyspark 3.4.1.5.3 build 20230713. > Running on Microsoft Fabric workspace at runtime 1.2. > Tested the same scenario on a spark 3.4.1 standalone deployment on docker > documented at https://github.com/mpavanetti/sparkenv > > >Reporter: Matheus Pavanetti >Priority: Major > Attachments:
[jira] [Commented] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1
[ https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834315#comment-17834315 ] Christos Karras commented on SPARK-46143: - I have the same issue too. The problem is because the squeeze parameter of read_excel has been deprecated since pandas version 1.4, and has been completely removed in pandas 2.0. But Pyspark's implementation of read_excel keeps passing the squeeze parameter, even though the parameter is also deprecated since pyspark 3.4. Since there's no version constraint in Pyspark that indicates Pandas 2.0 is not supported, a fix to avoid the need to stay with pandas 1.x could be to detect the pandas version and decide if the squeeze parameter should be passed depending on the pandas version. Also, if the squeeze parameter is passed in the Pyspark function, raise an error if a versio of pandas that no longer supports this parameter is installed. This would be a transition solution until the squeeze parameter is also removed completely from Pyspark. Modify pyspark\pandas\namespace.py: * Change the squeeze parameter of the read_excel function to be "squeeze: Optional[bool] = None" * Modify the nested pd_read_excel function to check for the pandas version and build a dict of arguments it will pass to pd.read_excel based on that version. Also consider if the squeeze parameter was passed by the caller or not. And if the caller specified that argument but a newer version of pandas that doesn't support it, raise an exception: {{ }} {{def pd_read_excel(}}{{ io_or_bin: Any, sn: Union[str, int, List[Union[str, int]], None], sq: bool}}{{ ) -> pd.DataFrame:}}{{read_excel_args: dict = { {{ {{"io":BytesIO(io_or_bin) if isinstance(io_or_bin, (bytes, bytearray)) else io_or_bin,}}{{ }} {{"sheet_name":sn,}} {{"header":header,}} {{...,}} {{**kwds}} {{}}} {{if squeeze is not None:}} {{ if pandas_version >= 2:}} {{ raise Exception("The squeeze parameter for read_excel is no longer available in pandas 2.x")}} {{read_excel_args["squeeze"] = squeeze}} {{return pd.read_excel(**read_excel_args)}} {{ }} > pyspark.pandas read_excel implementation at version 3.4.1 > - > > Key: SPARK-46143 > URL: https://issues.apache.org/jira/browse/SPARK-46143 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.1 > Environment: pyspark 3.4.1.5.3 build 20230713. > Running on Microsoft Fabric workspace at runtime 1.2. > Tested the same scenario on a spark 3.4.1 standalone deployment on docker > documented at https://github.com/mpavanetti/sparkenv > > >Reporter: Matheus Pavanetti >Priority: Major > Attachments: MicrosoftTeams-image.png, > image-2023-11-28-13-20-40-275.png, image-2023-11-28-13-20-51-291.png > > > Hello, > I would like to report an issue with pyspark.pandas implementation on > read_excel function. > Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which > potentially uses an older version of pandas on it's implementations of > pyspark.pandas. > The function read_excel from pandas doesn't expect a parameter called > "squeeze" however it's implemented as part of pyspark.pandas and the > parameter "squeeze" is being passed to the pandas function. > > !image-2023-11-28-13-20-40-275.png! > > I've been digging into it for further investigation into pyspark 3.4.1 > documentation > [https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500] > > This is the point I found that "squeeze" parameter is being passed to pandas > read_excel function which is not expected. > It seems like it was deprecated as part of pyspark 3.4.0 but still being used > in the implementation. > > !image-2023-11-28-13-20-51-291.png! > > I believe this is an issue with pyspark implementation 3.4.1 not necessaily > with fabric. However fabric uses this version as its 1.2 build. > > I am able to work around that for now by download the excel from the one lake > to the spark driver, loading that to the memory with pandas and then > converting to a spark dataframe etc or I made it work downgrading the build > I downloaded the pyspark build 20230713 to my local, made the changes and > re-compiled it and it worked locally. So it means that is related to the > implementation and they would have to fix or I do a downgrade to older > version like 3.3.3 or try the latest 3.5.0 which is not the case for fabric > > -- This message was sent
[jira] [Updated] (SPARK-47737) Bump PyArrow to 15.0.2
[ https://issues.apache.org/jira/browse/SPARK-47737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-47737: --- Labels: pull-request-available (was: ) > Bump PyArrow to 15.0.2 > -- > > Key: SPARK-47737 > URL: https://issues.apache.org/jira/browse/SPARK-47737 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Haejoon Lee >Priority: Major > Labels: pull-request-available > > Use the latest version for stability. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-47736) Add support for AbstractArrayType(StringTypeCollated)
[ https://issues.apache.org/jira/browse/SPARK-47736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mihailo Milosevic updated SPARK-47736: -- Summary: Add support for AbstractArrayType(StringTypeCollated) (was: Add support for ArrayType(StringTypeAnyCollation)) > Add support for AbstractArrayType(StringTypeCollated) > - > > Key: SPARK-47736 > URL: https://issues.apache.org/jira/browse/SPARK-47736 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47736) Add support for ArrayType(StringTypeAnyCollation)
Mihailo Milosevic created SPARK-47736: - Summary: Add support for ArrayType(StringTypeAnyCollation) Key: SPARK-47736 URL: https://issues.apache.org/jira/browse/SPARK-47736 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Mihailo Milosevic -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org