[jira] [Resolved] (SPARK-47743) Use milliseconds as the time unit in loggings

2024-04-05 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-47743.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45903
[https://github.com/apache/spark/pull/45903]

> Use milliseconds as the time unit in loggings
> -
>
> Key: SPARK-47743
> URL: https://issues.apache.org/jira/browse/SPARK-47743
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47094) SPJ : Dynamically rebalance number of buckets when they are not equal

2024-04-05 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-47094.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45267
[https://github.com/apache/spark/pull/45267]

> SPJ : Dynamically rebalance number of buckets when they are not equal
> -
>
> Key: SPARK-47094
> URL: https://issues.apache.org/jira/browse/SPARK-47094
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Himadri Pal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> SPJ: Storage Partition Join works with Iceberg tables when both the tables 
> have same number of buckets. As part of this feature request, we would like 
> spark to gather the number of buckets information from both the tables and 
> dynamically rebalance the number of buckets by coalesce or repartition so 
> that SPJ will work fine. In this case, we would still have to shuffle but 
> would be better than no SPJ.
> Use Case : 
> Many times we do not have control of the input tables, hence it's not 
> possible to change partitioning scheme on those tables. As a consumer, we 
> would still like them to be used with SPJ when used with other tables and 
> output tables which has different number of buckets.
> In these scenario, we would need to read those tables rewrite them with 
> matching number of buckets for the SPJ to work, this extra step could 
> outweigh the benefits of less shuffle via SPJ. Also when there are multiple 
> different tables being joined, each tables need to be rewritten with matching 
> number of buckets. 
> If this feature is implemented, SPJ functionality will be more powerful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47744) Add support for negative byte types in RocksDB range scan encoder

2024-04-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47744:
---
Labels: pull-request-available  (was: )

> Add support for negative byte types in RocksDB range scan encoder
> -
>
> Key: SPARK-47744
> URL: https://issues.apache.org/jira/browse/SPARK-47744
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Neil Ramaswamy
>Priority: Major
>  Labels: pull-request-available
>
> SPARK-47372 introduced the ability for the RocksDB state provider to 
> big-endian encode values so that they could be range scanned. However, signed 
> support for Byte types [was 
> missed|https://github.com/apache/spark/blob/1efbf43160aa4e36710a4668f05fe61534f49648/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala#L326],
>  even though Scala Bytes are 
> [signed|https://www.scala-lang.org/api/2.13.5/scala/Byte.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47746) Use column ordinals instead of prefix ordering columns in the range scan encoder

2024-04-05 Thread Neil Ramaswamy (Jira)
Neil Ramaswamy created SPARK-47746:
--

 Summary: Use column ordinals instead of prefix ordering columns in 
the range scan encoder
 Key: SPARK-47746
 URL: https://issues.apache.org/jira/browse/SPARK-47746
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Neil Ramaswamy


Currently, the State V2 implementations do projections in their state managers, 
and then provide some prefix (ordering) columns to the RocksDBStateEncoder. 
However, we can avoid doing extra projection by just reading the ordinals we 
need, in the order we need, in the state encoder.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47581) SQL catalyst: Migrate logWarn with variables to structured logging framework

2024-04-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47581:
---
Labels: pull-request-available  (was: )

> SQL catalyst: Migrate logWarn with variables to structured logging framework
> 
>
> Key: SPARK-47581
> URL: https://issues.apache.org/jira/browse/SPARK-47581
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47745) Add License to Spark Operator

2024-04-05 Thread Zhou JIANG (Jira)
Zhou JIANG created SPARK-47745:
--

 Summary: Add License to Spark Operator
 Key: SPARK-47745
 URL: https://issues.apache.org/jira/browse/SPARK-47745
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes
Affects Versions: 4.0.0
Reporter: Zhou JIANG


Add license to the recently established operator repository.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47582) SQL catalyst: Migrate logInfo with variables to structured logging framework

2024-04-05 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-47582.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45866
[https://github.com/apache/spark/pull/45866]

> SQL catalyst: Migrate logInfo with variables to structured logging framework
> 
>
> Key: SPARK-47582
> URL: https://issues.apache.org/jira/browse/SPARK-47582
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Assignee: Daniel
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47744) Add support for negative byte types in RocksDB range scan encoder

2024-04-05 Thread Neil Ramaswamy (Jira)
Neil Ramaswamy created SPARK-47744:
--

 Summary: Add support for negative byte types in RocksDB range scan 
encoder
 Key: SPARK-47744
 URL: https://issues.apache.org/jira/browse/SPARK-47744
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Neil Ramaswamy


SPARK-47372 introduced the ability for the RocksDB state provider to big-endian 
encode values so that they could be range scanned. However, signed support for 
Byte types [was 
missed|https://github.com/apache/spark/blob/1efbf43160aa4e36710a4668f05fe61534f49648/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBStateEncoder.scala#L326],
 even though Scala Bytes are 
[signed|https://www.scala-lang.org/api/2.13.5/scala/Byte.html].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45445) Upgrade snappy to 1.1.10.5

2024-04-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45445:
--
Fix Version/s: 3.4.3

> Upgrade snappy to 1.1.10.5
> --
>
> Key: SPARK-45445
> URL: https://issues.apache.org/jira/browse/SPARK-45445
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47743) Use milliseconds as the time unit in loggings

2024-04-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47743:
---
Labels: pull-request-available  (was: )

> Use milliseconds as the time unit in loggings
> -
>
> Key: SPARK-47743
> URL: https://issues.apache.org/jira/browse/SPARK-47743
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47743) Use milliseconds as the time unit in loggings

2024-04-05 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-47743:
--

 Summary: Use milliseconds as the time unit in loggings
 Key: SPARK-47743
 URL: https://issues.apache.org/jira/browse/SPARK-47743
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45445) Upgrade snappy to 1.1.10.5

2024-04-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45445:
--
Fix Version/s: 3.5.2

> Upgrade snappy to 1.1.10.5
> --
>
> Key: SPARK-45445
> URL: https://issues.apache.org/jira/browse/SPARK-45445
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46411) Change to use bcprov/bcpkix-jdk18on for test

2024-04-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46411:
--
Fix Version/s: 3.4.3

> Change to use bcprov/bcpkix-jdk18on for test
> 
>
> Key: SPARK-46411
> URL: https://issues.apache.org/jira/browse/SPARK-46411
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.3
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47719) Change default of spark.sql.legacy.timeParserPolicy from EXCEPTION to CORRECTED

2024-04-05 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-47719.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45859
[https://github.com/apache/spark/pull/45859]

> Change default of spark.sql.legacy.timeParserPolicy from EXCEPTION to 
> CORRECTED
> ---
>
> Key: SPARK-47719
> URL: https://issues.apache.org/jira/browse/SPARK-47719
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Serge Rielau
>Assignee: Serge Rielau
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> spark.sql.legacy.timeParserPolicy was introduced in Spark 3.0 and has been 
> set to EXCEPTION.
> Changing it from EXCEPTION for SPark 4.0 to CORRECTED will reduce errors and 
> reflects a prudent timeframe.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47719) Change default of spark.sql.legacy.timeParserPolicy from EXCEPTION to CORRECTED

2024-04-05 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-47719:
--

Assignee: Serge Rielau

> Change default of spark.sql.legacy.timeParserPolicy from EXCEPTION to 
> CORRECTED
> ---
>
> Key: SPARK-47719
> URL: https://issues.apache.org/jira/browse/SPARK-47719
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Serge Rielau
>Assignee: Serge Rielau
>Priority: Major
>  Labels: pull-request-available
>
> spark.sql.legacy.timeParserPolicy was introduced in Spark 3.0 and has been 
> set to EXCEPTION.
> Changing it from EXCEPTION for SPark 4.0 to CORRECTED will reduce errors and 
> reflects a prudent timeframe.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47742) Spark Transformation with Multi Case filter can improve efficiency

2024-04-05 Thread Hemant Sakharkar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hemant Sakharkar updated SPARK-47742:
-
Description: 
In Feature Engineering we need to process the input data to create feature and 
feature vectors which are required to train the model. For which we need to do 
multiple spark transformations (etc:map, filter etc) the spark has very good 
optimization for multiple transformations due to its lazy execution. It 
combines multiple transformations into fewer transformations which helps to 
optimize the overall execution time.

I found that we can still improve the execution time in the case of filters.

 
{code:java}
val rddfilter0 = personRdd.filter(t => t.age>5 && t.age<=12)
val rddfilter1 = personRdd.filter(t => t.age>12 && t.age<=18)
val rddfilter2 = personRdd.filter(t => t.age>18 && t.age<=25)
val rddfilter3 = personRdd.filter(t => t.age>25 && t.age<=35)
val rddfilter4 = personRdd.filter(t => t.age>35 && t.age<=65) {code}
*Sample Run Results:*

Records :50,000,000

5 filter Execution Time: 24854 milli sec

5 filter with Map Execution Time: 5212 milli sec

We can very well improve multiple X times and reduce significant memory 
footprint for a complex DAG of Spark Transformation.

Sample illustration can be found here

[https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing]

Need support of such transformation in Spark Core so that more complex 
transformation can be supported. Some illustration is provided in above 
document.

  was:
In Feature Engineering we need to process the input data to create feature and 
feature vectors which are required to train the model. For which we need to do 
multiple spark transformations (etc:map, filter etc) the spark has very good 
optimization for multiple transformations due to its lazy execution. It 
combines multiple transformations into fewer transformations which helps to 
optimize the overall execution time.

I found that we can still improve the execution time in the case of filters.

*Sample Run Results:*

Records :50,000,000

5 filter Execution Time: 24854 milli sec

5 filter with Map Execution Time: 5212 milli sec

We can very well improve multiple X times and reduce significant memory 
footprint for a complex DAG of Spark Transformation.

Sample illustration can be found here

[https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing]

Need support of such transformation in Spark Core so that more complex 
transformation can be supported. Some illustration is provided in above 
document.


> Spark Transformation with Multi Case filter can improve efficiency
> --
>
> Key: SPARK-47742
> URL: https://issues.apache.org/jira/browse/SPARK-47742
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hemant Sakharkar
>Priority: Major
>  Labels: performance
> Attachments: spark_chain_transformation.png
>
>
> In Feature Engineering we need to process the input data to create feature 
> and feature vectors which are required to train the model. For which we need 
> to do multiple spark transformations (etc:map, filter etc) the spark has very 
> good optimization for multiple transformations due to its lazy execution. It 
> combines multiple transformations into fewer transformations which helps to 
> optimize the overall execution time.
> I found that we can still improve the execution time in the case of filters.
>  
> {code:java}
> val rddfilter0 = personRdd.filter(t => t.age>5 && t.age<=12)
> val rddfilter1 = personRdd.filter(t => t.age>12 && t.age<=18)
> val rddfilter2 = personRdd.filter(t => t.age>18 && t.age<=25)
> val rddfilter3 = personRdd.filter(t => t.age>25 && t.age<=35)
> val rddfilter4 = personRdd.filter(t => t.age>35 && t.age<=65) {code}
> *Sample Run Results:*
> Records :50,000,000
> 5 filter Execution Time: 24854 milli sec
> 5 filter with Map Execution Time: 5212 milli sec
> We can very well improve multiple X times and reduce significant memory 
> footprint for a complex DAG of Spark Transformation.
> Sample illustration can be found here
> [https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing]
> Need support of such transformation in Spark Core so that more complex 
> transformation can be supported. Some illustration is provided in above 
> document.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47741) Handle stack overflow when parsing query as part of Execute immediate

2024-04-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47741:
---
Labels: pull-request-available  (was: )

> Handle stack overflow when parsing query as part of Execute immediate
> -
>
> Key: SPARK-47741
> URL: https://issues.apache.org/jira/browse/SPARK-47741
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: Milan Stefanovic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Exec immediate can generate complex queries which can lead to stack overflow.
> We need to catch this exception and convert it to proper analysis exc with 
> error class.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47742) Spark Transformation with Multi Case filter can improve efficiency

2024-04-05 Thread Hemant Sakharkar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hemant Sakharkar updated SPARK-47742:
-
Description: 
In Feature Engineering we need to process the input data to create feature and 
feature vectors which are required to train the model. For which we need to do 
multiple spark transformations (etc:map, filter etc) the spark has very good 
optimization for multiple transformations due to its lazy execution. It 
combines multiple transformations into fewer transformations which helps to 
optimize the overall execution time.

I found that we can still improve the execution time in the case of filters.

*Sample Run Results:*

Records :50,000,000

5 filter Execution Time: 24854 milli sec

5 filter with Map Execution Time: 5212 milli sec

We can very well improve multiple X times and reduce significant memory 
footprint for a complex DAG of Spark Transformation.

Sample illustration can be found here

[https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing]

Need support of such transformation in Spark Core so that more complex 
transformation can be supported. Some illustration is provided in above 
document.

  was:
In Feature Engineering we need to process the input data to create feature and 
feature vectors which are required to train the model. For which we need to do 
multiple spark transformations (etc:map, filter etc) the spark has very good 
optimization for multiple transformations due to its lazy execution. It 
combines multiple transformations into fewer transformations which helps to 
optimize the overall execution time.

I found that we can still improve the execution time in the case of filters. 

*Sample Run Results:*

Records :50,000,000

5 filter Execution Time: (t2-t1) 24854 millisec

5 filter with Map Execution Time: (t3-t2) 5212 millisec

We can very well improve multiple X times and reduce significant memory 
footprint for a complex DAG of Spark Transformation.

Sample illustration can be found here

[https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing]

Need support of such transformation in Spark Core so that more complex 
transformation can be supported. Some illustration is provided in above 
document.


> Spark Transformation with Multi Case filter can improve efficiency
> --
>
> Key: SPARK-47742
> URL: https://issues.apache.org/jira/browse/SPARK-47742
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hemant Sakharkar
>Priority: Major
>  Labels: performance
> Attachments: spark_chain_transformation.png
>
>
> In Feature Engineering we need to process the input data to create feature 
> and feature vectors which are required to train the model. For which we need 
> to do multiple spark transformations (etc:map, filter etc) the spark has very 
> good optimization for multiple transformations due to its lazy execution. It 
> combines multiple transformations into fewer transformations which helps to 
> optimize the overall execution time.
> I found that we can still improve the execution time in the case of filters.
> *Sample Run Results:*
> Records :50,000,000
> 5 filter Execution Time: 24854 milli sec
> 5 filter with Map Execution Time: 5212 milli sec
> We can very well improve multiple X times and reduce significant memory 
> footprint for a complex DAG of Spark Transformation.
> Sample illustration can be found here
> [https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing]
> Need support of such transformation in Spark Core so that more complex 
> transformation can be supported. Some illustration is provided in above 
> document.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47742) Spark Transformation with Multi Case filter can improve efficiency

2024-04-05 Thread Hemant Sakharkar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hemant Sakharkar updated SPARK-47742:
-
Attachment: spark_chain_transformation.png

> Spark Transformation with Multi Case filter can improve efficiency
> --
>
> Key: SPARK-47742
> URL: https://issues.apache.org/jira/browse/SPARK-47742
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hemant Sakharkar
>Priority: Major
>  Labels: performance
> Attachments: spark_chain_transformation.png
>
>
> In Feature Engineering we need to process the input data to create feature 
> and feature vectors which are required to train the model. For which we need 
> to do multiple spark transformations (etc:map, filter etc) the spark has very 
> good optimization for multiple transformations due to its lazy execution. It 
> combines multiple transformations into fewer transformations which helps to 
> optimize the overall execution time.
> I found that we can still improve the execution time in the case of filters. 
> *Sample Run Results:*
> Records :50,000,000
> 5 filter Execution Time: (t2-t1) 24854 millisec
> 5 filter with Map Execution Time: (t3-t2) 5212 millisec
> We can very well improve multiple X times and reduce significant memory 
> footprint for a complex DAG of Spark Transformation.
> Sample illustration can be found here
> [https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing]
> Need support of such transformation in Spark Core so that more complex 
> transformation can be supported. Some illustration is provided in above 
> document.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47742) Spark Transformation with Multi Case filter can improve efficiency

2024-04-05 Thread Hemant Sakharkar (Jira)
Hemant Sakharkar created SPARK-47742:


 Summary: Spark Transformation with Multi Case filter can improve 
efficiency
 Key: SPARK-47742
 URL: https://issues.apache.org/jira/browse/SPARK-47742
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Hemant Sakharkar


In Feature Engineering we need to process the input data to create feature and 
feature vectors which are required to train the model. For which we need to do 
multiple spark transformations (etc:map, filter etc) the spark has very good 
optimization for multiple transformations due to its lazy execution. It 
combines multiple transformations into fewer transformations which helps to 
optimize the overall execution time.

I found that we can still improve the execution time in the case of filters. 

*Sample Run Results:*

Records :50,000,000

5 filter Execution Time: (t2-t1) 24854 millisec

5 filter with Map Execution Time: (t3-t2) 5212 millisec

We can very well improve multiple X times and reduce significant memory 
footprint for a complex DAG of Spark Transformation.

Sample illustration can be found here

[https://docs.google.com/document/d/1gdWR2TwbCfiuRF51EHA1zRnD9ES_neIvIsgEvizrjuo/edit?usp=sharing]

Need support of such transformation in Spark Core so that more complex 
transformation can be supported. Some illustration is provided in above 
document.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47740) Stop JVM by calling SparkSession.stop from PySpark

2024-04-05 Thread Maxim Martynov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Martynov updated SPARK-47740:
---
Description: 
{code:python}
from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.driver.memory", 
"2g").config("spark.another.property", "hi").getOrCreate()
spark.conf.get("spark.driver.memory")
# prints '2g'
spark.conf.get("spark.another.property")
# prints 'hi'

# check for JVM process before and after stop
spark.stop()

# try to recreate Spark session with different driver memory, and no 
"spark.another.property" at all
spark = SparkSession.builder.config("spark.driver.memory", "3g").getOrCreate()
spark.conf.get("spark.driver.memory")
# prints '3g'
spark.conf.get("spark.another.property")
# prints 'hi'

# check for JVM process
{code}

After first call of {{.getOrCreate()}} JVM process with following options have 
been started:
{code}
maxim  76625 10.9  2.7 7975292 447068 pts/8  Sl+  16:59   0:11 
/home/maxim/.sdkman/candidates/java/current/bin/java -cp 
/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/conf:/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/jars/*
 -Xmx2g -XX:+IgnoreUnrecognizedVMOptions 
--add-opens=java.base/java.lang=ALL-UNNAMED 
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED 
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED 
--add-opens=java.base/java.io=ALL-UNNAMED 
--add-opens=java.base/java.net=ALL-UNNAMED 
--add-opens=java.base/java.nio=ALL-UNNAMED 
--add-opens=java.base/java.util=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED 
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED 
--add-opens=java.base/sun.nio.cs=ALL-UNNAMED 
--add-opens=java.base/sun.security.action=ALL-UNNAMED 
--add-opens=java.base/sun.util.calendar=ALL-UNNAMED 
--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED 
-Djdk.reflect.useDirectMethodHandle=false org.apache.spark.deploy.SparkSubmit 
--conf spark.driver.memory=2g --conf spark.another.property=hi pyspark-shell
{code}

But it have not been stopped by {{.stop()}} method call. New Spark session 
which reports 3Gb of driver memory actually uses the same 2Gb, and config 
options from old session which should be closed at this moment.

Summarizing:
* {{spark.stop()}} stops only SparkContext, but not JVM
* {{spark.stop()}} does not clean up session config

This behavior has been observed for a long time, at least since 2.2.0, and 
still present in 3.5.1.

This could be solved by stopping JVM after Spark session is stopped.
But just calling {{spark._jvm.System.exit(0)}} fill fail because py4j socket 
will be closed, and it will be impossible to start new Spark session:
{code:python}
spark._jvm.System.exit(0)
{code}

{code}
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py",
 line 516, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py",
 line 1038, in send_command
response = connection.send_command(command)
   
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py",
 line 539, in send_command
raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py",
 line 1322, in __call__
return_value = get_return_value(
   ^
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py",
 line 179, in deco
return f(*a, **kw)
   ^^^
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/protocol.py",
 line 334, in get_return_value
raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling z:java.lang.System.exit
{code}

Here is an article describing the same issue, and also providing a safe way to 
stop JVM without breaking py4j (in Russian):
https://habr.com/ru/companies/sberbank/articles/805285/

{code:python}
from pyspark.sql import SparkSession
from pyspark import SparkContext
from threading import RLock
from py4j.protocol import CALL_COMMAND_NAME, END_COMMAND_PART


from pyspark.sql import SparkSession
from pyspark 

[jira] [Updated] (SPARK-47740) Stop JVM by calling SparkSession.stop from PySpark

2024-04-05 Thread Maxim Martynov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Martynov updated SPARK-47740:
---
Description: 
{code:python}
from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.driver.memory", 
"2g").config("spark.another.property", "hi").getOrCreate()
spark.conf.get("spark.driver.memory")
# prints '2g'
spark.conf.get("spark.another.property")
# prints 'hi'

# check for JVM process before and after stop
spark.stop()

# try to recreate Spark session with different driver memory, and no 
"spark.another.property" at all
spark = SparkSession.builder.config("spark.driver.memory", "3g").getOrCreate()
spark.conf.get("spark.driver.memory")
# prints '3g'
spark.conf.get("spark.another.property")
# prints 'hi'

# check for JVM process
{code}

After first call of {{.getOrCreate()}} JVM process with following options have 
been started:
{code}
maxim  76625 10.9  2.7 7975292 447068 pts/8  Sl+  16:59   0:11 
/home/maxim/.sdkman/candidates/java/current/bin/java -cp 
/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/conf:/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/jars/*
 -Xmx2g -XX:+IgnoreUnrecognizedVMOptions 
--add-opens=java.base/java.lang=ALL-UNNAMED 
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED 
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED 
--add-opens=java.base/java.io=ALL-UNNAMED 
--add-opens=java.base/java.net=ALL-UNNAMED 
--add-opens=java.base/java.nio=ALL-UNNAMED 
--add-opens=java.base/java.util=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED 
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED 
--add-opens=java.base/sun.nio.cs=ALL-UNNAMED 
--add-opens=java.base/sun.security.action=ALL-UNNAMED 
--add-opens=java.base/sun.util.calendar=ALL-UNNAMED 
--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED 
-Djdk.reflect.useDirectMethodHandle=false org.apache.spark.deploy.SparkSubmit 
--conf spark.driver.memory=2g --conf spark.another.property=hi pyspark-shell
{code}

But it have not been stopped by {{.stop()}} method call. New Spark session 
which reports 3Gb of driver memory actually uses the same 2Gb, and config 
options from old session which should be closed at this moment.

Summarizing:
* {{spark.stop()}} stops only SparkContext, but not JVM
* {{spark.stop()}} does not clean up session config

This behavior has been observed for a long time, at least since 2.2.0, and 
still present in 3.5.1.

This could be solved by stopping JVM after Spark session is stopped.
But just calling {{spark._jvm.System.exit(0)}} fill fail because py4j socket 
will be closed, and it will be impossible to start new Spark session:
{code:python}
spark._jvm.System.exit(0)
{code}

{code}
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py",
 line 516, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py",
 line 1038, in send_command
response = connection.send_command(command)
   
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py",
 line 539, in send_command
raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py",
 line 1322, in __call__
return_value = get_return_value(
   ^
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py",
 line 179, in deco
return f(*a, **kw)
   ^^^
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/protocol.py",
 line 334, in get_return_value
raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling z:java.lang.System.exit
{code}

Here is an article describing the same issue, and also providing a safe way to 
stop JVM without breaking py4j (in Russian):
https://habr.com/ru/companies/sberbank/articles/805285/

{code:python}
from pyspark.sql import SparkSession
from pyspark import SparkContext
from threading import RLock
from py4j.protocol import CALL_COMMAND_NAME, END_COMMAND_PART


from pyspark.sql import SparkSession
from pyspark 

[jira] [Updated] (SPARK-46411) Change to use bcprov/bcpkix-jdk18on for test

2024-04-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-46411:
--
Fix Version/s: 3.5.2

> Change to use bcprov/bcpkix-jdk18on for test
> 
>
> Key: SPARK-46411
> URL: https://issues.apache.org/jira/browse/SPARK-46411
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45445) Upgrade snappy to 1.1.10.5

2024-04-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45445:
--
Parent: SPARK-47046
Issue Type: Sub-task  (was: Improvement)

> Upgrade snappy to 1.1.10.5
> --
>
> Key: SPARK-45445
> URL: https://issues.apache.org/jira/browse/SPARK-45445
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45323) Upgrade snappy to 1.1.10.4

2024-04-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45323:
--
Parent: SPARK-47046
Issue Type: Sub-task  (was: Dependency upgrade)

> Upgrade snappy to 1.1.10.4
> --
>
> Key: SPARK-45323
> URL: https://issues.apache.org/jira/browse/SPARK-45323
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0, 3.5.1
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Security Fix
> Fixed SnappyInputStream so as not to allocate too large memory when 
> decompressing data with an extremely large chunk size by @​tunnelshade (code 
> change)
> This does not affect users only using Snappy.compress/uncompress methods



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47740) Stop JVM by calling SparkSession.stop from PySpark

2024-04-05 Thread Maxim Martynov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Martynov updated SPARK-47740:
---
Description: 
{code:python}
from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.driver.memory", 
"2g").config("spark.another.property", "hi").getOrCreate()
spark.conf.get("spark.driver.memory")
# prints '2g'
spark.conf.get("spark.another.property")
# prints 'hi'

# check for JVM process before and after stop
spark.stop()

# try to recreate Spark session with different driver memory, and no 
"spark.another.property" at all
spark = SparkSession.builder.config("spark.driver.memory", "3g").getOrCreate()
spark.conf.get("spark.driver.memory")
# prints '3g'
spark.conf.get("spark.another.property")
# prints 'hi'

# check for JVM process
{code}

After first call of {{.getOrCreate()}} JVM process with following options have 
been started:
{code}
maxim  76625 10.9  2.7 7975292 447068 pts/8  Sl+  16:59   0:11 
/home/maxim/.sdkman/candidates/java/current/bin/java -cp 
/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/conf:/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/jars/*
 -Xmx2g -XX:+IgnoreUnrecognizedVMOptions 
--add-opens=java.base/java.lang=ALL-UNNAMED 
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED 
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED 
--add-opens=java.base/java.io=ALL-UNNAMED 
--add-opens=java.base/java.net=ALL-UNNAMED 
--add-opens=java.base/java.nio=ALL-UNNAMED 
--add-opens=java.base/java.util=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED 
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED 
--add-opens=java.base/sun.nio.cs=ALL-UNNAMED 
--add-opens=java.base/sun.security.action=ALL-UNNAMED 
--add-opens=java.base/sun.util.calendar=ALL-UNNAMED 
--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED 
-Djdk.reflect.useDirectMethodHandle=false org.apache.spark.deploy.SparkSubmit 
--conf spark.driver.memory=2g --conf spark.another.property=hi pyspark-shell
{code}

But it have not been stopped by {{.stop()}} method call. New Spark session 
which reports 3Gb of driver memory actually uses the same 2Gb, and config 
options from old session which should be closed at this moment.

Summarizing:
* {{spark.stop()}} stops only SparkContext, but not JVM
* {{spark.stop()}} does not clean up session config

This behavior has been observed for a long time, at least since 2.2.0, and 
still present in 3.5.1.

This could be solved by stopping JVM after Spark session is stopped.
But just calling {{spark._jvm.System.exit(0)}} fill fail because py4j socket 
will be closed, and it will be impossible to start new Spark session:
{code:python}
spark._jvm.System.exit(0)
{code}

{code}
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py",
 line 516, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py",
 line 1038, in send_command
response = connection.send_command(command)
   
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py",
 line 539, in send_command
raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py",
 line 1322, in __call__
return_value = get_return_value(
   ^
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py",
 line 179, in deco
return f(*a, **kw)
   ^^^
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/protocol.py",
 line 334, in get_return_value
raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling z:java.lang.System.exit
{code}

Here is an article describing the same issue, and also providing a safe way to 
stop JVM without breaking py4j (in Russian):
https://habr.com/ru/companies/sberbank/articles/805285/

{code:python}
from pyspark.sql import SparkSession
from pyspark import SparkContext
from threading import RLock
from py4j.protocol import CALL_COMMAND_NAME, END_COMMAND_PART


from pyspark.sql import SparkSession
from pyspark 

[jira] [Commented] (SPARK-47740) Stop JVM by calling SparkSession.stop from PySpark

2024-04-05 Thread Maxim Martynov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834379#comment-17834379
 ] 

Maxim Martynov commented on SPARK-47740:


Here is a PR for adding helper function 
[jvm_shutdown(gateway)|https://github.com/py4j/py4j/pull/541] to Py4J.

> Stop JVM by calling SparkSession.stop from PySpark
> --
>
> Key: SPARK-47740
> URL: https://issues.apache.org/jira/browse/SPARK-47740
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.5.1
>Reporter: Maxim Martynov
>Priority: Major
>
> {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.config("spark.driver.memory", 
> "2g").config("spark.another.property", "hi").getOrCreate()
> spark.conf.get("spark.driver.memory")
> # prints '2g'
> spark.conf.get("spark.another.property")
> # prints 'hi'
> # check for JVM process before and after stop
> spark.stop()
> # try to recreate Spark session with different driver memory, and no 
> "spark.another.property" at all
> spark = SparkSession.builder.config("spark.driver.memory", "3g").getOrCreate()
> spark.conf.get("spark.driver.memory")
> # prints '3g'
> spark.conf.get("spark.another.property")
> # prints 'hi'
> # check for JVM process
> {code}
> After first call of {{.getOrCreate()}} JVM process with following options 
> have been started:
> {code}
> maxim  76625 10.9  2.7 7975292 447068 pts/8  Sl+  16:59   0:11 
> /home/maxim/.sdkman/candidates/java/current/bin/java -cp 
> /home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/conf:/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/jars/*
>  -Xmx2g -XX:+IgnoreUnrecognizedVMOptions 
> --add-opens=java.base/java.lang=ALL-UNNAMED 
> --add-opens=java.base/java.lang.invoke=ALL-UNNAMED 
> --add-opens=java.base/java.lang.reflect=ALL-UNNAMED 
> --add-opens=java.base/java.io=ALL-UNNAMED 
> --add-opens=java.base/java.net=ALL-UNNAMED 
> --add-opens=java.base/java.nio=ALL-UNNAMED 
> --add-opens=java.base/java.util=ALL-UNNAMED 
> --add-opens=java.base/java.util.concurrent=ALL-UNNAMED 
> --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED 
> --add-opens=java.base/sun.nio.ch=ALL-UNNAMED 
> --add-opens=java.base/sun.nio.cs=ALL-UNNAMED 
> --add-opens=java.base/sun.security.action=ALL-UNNAMED 
> --add-opens=java.base/sun.util.calendar=ALL-UNNAMED 
> --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED 
> -Djdk.reflect.useDirectMethodHandle=false org.apache.spark.deploy.SparkSubmit 
> --conf spark.driver.memory=2g --conf spark.another.property=hi pyspark-shell
> {code}
> But it have not been stopped by {{.stop()}} method call. New Spark session 
> which reports 3Gb of driver memory actually uses the same 2Gb, and config 
> options from old session which should be closed at this moment.
> Summarizing:
> * {{spark.stop()}} stops only SparkContext, but not JVM
> * {{spark.stop()}} does not clean up session config
> This behavior has been observed for a long time, at least since 2.2.0, and 
> still present in 3.5.1.
> This could be solved by stopping JVM after Spark session is stopped.
> But just calling {{spark._jvm.System.exit(0)}} fill fail because py4j socket 
> will be closed, and it will be impossible to start new Spark session:
> {code:python}
> spark._jvm.System.exit(0)
> {code}
> {code}
> ERROR:root:Exception while sending command.
> Traceback (most recent call last):
>   File 
> "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py",
>  line 516, in send_command
> raise Py4JNetworkError("Answer from Java side is empty")
> py4j.protocol.Py4JNetworkError: Answer from Java side is empty
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py",
>  line 1038, in send_command
> response = connection.send_command(command)
>
>   File 
> "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py",
>  line 539, in send_command
> raise Py4JNetworkError(
> py4j.protocol.Py4JNetworkError: Error while sending or receiving
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py",
>  line 1322, in __call__
> return_value = get_return_value(
>^
>   File 
> "/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py",
>  line 179, in deco
> 

[jira] [Updated] (SPARK-45172) Upgrade commons-compress.version to 1.24.0

2024-04-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45172:
--
Parent: SPARK-47046
Issue Type: Sub-task  (was: Improvement)

> Upgrade commons-compress.version to 1.24.0
> --
>
> Key: SPARK-45172
> URL: https://issues.apache.org/jira/browse/SPARK-45172
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44441) Upgrade bcprov-jdk15on and bcpkix-jdk15on to 1.70

2024-04-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-1:
---
Labels: pull-request-available  (was: )

> Upgrade bcprov-jdk15on and bcpkix-jdk15on to 1.70
> -
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44393) Upgrade H2 from 2.1.214 to 2.2.220

2024-04-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44393:
--
Fix Version/s: 3.4.3

> Upgrade H2 from 2.1.214 to 2.2.220
> --
>
> Key: SPARK-44393
> URL: https://issues.apache.org/jira/browse/SPARK-44393
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0, 3.4.3
>
>
> [CVE-2022-45868|https://nvd.nist.gov/vuln/detail/CVE-2022-45868]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44393) Upgrade H2 from 2.1.214 to 2.2.220

2024-04-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44393:
---
Labels: pull-request-available  (was: )

> Upgrade H2 from 2.1.214 to 2.2.220
> --
>
> Key: SPARK-44393
> URL: https://issues.apache.org/jira/browse/SPARK-44393
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> [CVE-2022-45868|https://nvd.nist.gov/vuln/detail/CVE-2022-45868]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44279) Upgrade optionator to ^0.9.3

2024-04-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44279:
--
Fix Version/s: 3.5.0

> Upgrade optionator to ^0.9.3
> 
>
> Key: SPARK-44279
> URL: https://issues.apache.org/jira/browse/SPARK-44279
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> [Regular Expression Denial of Service (ReDoS) - 
> CVE-2023-26115|https://github.com/jonschlinkert/word-wrap/issues/32]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47741) Handle stack overflow when parsing query as part of Execute immediate

2024-04-05 Thread Milan Stefanovic (Jira)
Milan Stefanovic created SPARK-47741:


 Summary: Handle stack overflow when parsing query as part of 
Execute immediate
 Key: SPARK-47741
 URL: https://issues.apache.org/jira/browse/SPARK-47741
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.1
Reporter: Milan Stefanovic
 Fix For: 4.0.0


Exec immediate can generate complex queries which can lead to stack overflow.

We need to catch this exception and convert it to proper analysis exc with 
error class.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47740) Stop JVM by calling SparkSession.stop from PySpark

2024-04-05 Thread Maxim Martynov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Martynov updated SPARK-47740:
---
Description: 
{code:python}
from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.driver.memory", 
"2g").config("spark.another.property", "hi").getOrCreate()
spark.conf.get("spark.driver.memory")
# prints '2g'
spark.conf.get("spark.another.property")
# prints 'hi'

# check for JVM process before and after stop
spark.stop()

# try to recreate Spark session with different driver memory, and no 
"spark.another.property" at all
spark = SparkSession.builder.config("spark.driver.memory", "3g").getOrCreate()
spark.conf.get("spark.driver.memory")
# prints '3g'
spark.conf.get("spark.another.property")
# prints 'hi'

# check for JVM process
{code}

After first call of {{.getOrCreate()}} JVM process with following options have 
been started:
{code}
maxim  76625 10.9  2.7 7975292 447068 pts/8  Sl+  16:59   0:11 
/home/maxim/.sdkman/candidates/java/current/bin/java -cp 
/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/conf:/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/jars/*
 -Xmx2g -XX:+IgnoreUnrecognizedVMOptions 
--add-opens=java.base/java.lang=ALL-UNNAMED 
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED 
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED 
--add-opens=java.base/java.io=ALL-UNNAMED 
--add-opens=java.base/java.net=ALL-UNNAMED 
--add-opens=java.base/java.nio=ALL-UNNAMED 
--add-opens=java.base/java.util=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED 
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED 
--add-opens=java.base/sun.nio.cs=ALL-UNNAMED 
--add-opens=java.base/sun.security.action=ALL-UNNAMED 
--add-opens=java.base/sun.util.calendar=ALL-UNNAMED 
--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED 
-Djdk.reflect.useDirectMethodHandle=false org.apache.spark.deploy.SparkSubmit 
--conf spark.driver.memory=2g --conf spark.another.property=hi pyspark-shell
{code}

But it have not been stopped by {{.stop()}} method call. New Spark session 
which reports 3Gb of driver memory actually uses the same 2Gb, and config 
options from old session which should be closed at this moment.

Summarizing:
* {{spark.stop()}} stops only SparkContext, but not JVM
* {{spark.stop()}} does not clean up session config

This behavior has been observed for a long time, at least since 2.2.0, and 
still present in 3.5.1.

This could be solved by stopping JVM after Spark session is stopped.
But just calling {{spark._jvm.System.exit(0)}} fill fail because py4j socket 
will be closed, and it will be impossible to start new Spark session:
{code:python}
spark._jvm.System.exit(0)
{code}

{code}
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py",
 line 516, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py",
 line 1038, in send_command
response = connection.send_command(command)
   
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py",
 line 539, in send_command
raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py",
 line 1322, in __call__
return_value = get_return_value(
   ^
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py",
 line 179, in deco
return f(*a, **kw)
   ^^^
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/protocol.py",
 line 334, in get_return_value
raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling z:java.lang.System.exit
{code}

Here is an article describing the same issue, and also providing a safe way to 
stop JVM without breaking py4j (in Russian):
https://habr.com/ru/companies/sberbank/articles/805285/

{code:python}
from pyspark.sql import SparkSession
from pyspark import SparkContext
from threading import RLock
from py4j.protocol import CALL_COMMAND_NAME, END_COMMAND_PART


from pyspark.sql import SparkSession
from pyspark 

[jira] [Updated] (SPARK-47740) Stop JVM by calling SparkSession.stop from PySpark

2024-04-05 Thread Maxim Martynov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Martynov updated SPARK-47740:
---
Description: 
{code:python}
from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.driver.memory", 
"2g").config("spark.another.property", "hi").getOrCreate()
spark.conf.get("spark.driver.memory")
# prints '2g'
spark.conf.get("spark.another.property")
# prints 'hi'

# check for JVM process before and after stop
spark.stop()

# try to recreate Spark session with different driver memory, and no 
"spark.another.property" at all
spark = SparkSession.builder.config("spark.driver.memory", "3g").getOrCreate()
spark.conf.get("spark.driver.memory")
# prints '3g'
spark.conf.get("spark.another.property")
# prints 'hi'

# check for JVM process
{code}

After first call of {{.getOrCreate()}} JVM process with following options have 
been started:
{code}
maxim  76625 10.9  2.7 7975292 447068 pts/8  Sl+  16:59   0:11 
/home/maxim/.sdkman/candidates/java/current/bin/java -cp 
/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/conf:/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/jars/*
 -Xmx2g -XX:+IgnoreUnrecognizedVMOptions 
--add-opens=java.base/java.lang=ALL-UNNAMED 
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED 
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED 
--add-opens=java.base/java.io=ALL-UNNAMED 
--add-opens=java.base/java.net=ALL-UNNAMED 
--add-opens=java.base/java.nio=ALL-UNNAMED 
--add-opens=java.base/java.util=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED 
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED 
--add-opens=java.base/sun.nio.cs=ALL-UNNAMED 
--add-opens=java.base/sun.security.action=ALL-UNNAMED 
--add-opens=java.base/sun.util.calendar=ALL-UNNAMED 
--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED 
-Djdk.reflect.useDirectMethodHandle=false org.apache.spark.deploy.SparkSubmit 
--conf spark.driver.memory=2g --conf spark.another.property=hi pyspark-shell
{code}

But it have not been stopped by {{.stop()}} method call. New Spark session 
which reports 3Gb of driver memory actually uses the same 2Gb, and config 
options from old session which should be closed at this moment.

Summarizing:
* {{spark.stop()}} stops only SparkContext, but not JVM
* {{spark.stop()}} does not clean up session config

This behavior has been observed for a long time, at least since 2.2.0, and 
still present in 3.5.1.

This could be solved by stopping JVM after Spark session is stopped.
But just calling {{spark._jvm.System.exit(0)}} fill fail because py4j socket 
will be closed, and it will be impossible to start new Spark session:
{code:python}
spark._jvm.System.exit(0)
{code}

{code}
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py",
 line 516, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py",
 line 1038, in send_command
response = connection.send_command(command)
   
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py",
 line 539, in send_command
raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py",
 line 1322, in __call__
return_value = get_return_value(
   ^
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py",
 line 179, in deco
return f(*a, **kw)
   ^^^
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/protocol.py",
 line 334, in get_return_value
raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling z:java.lang.System.exit
{code}

Here is an article describing the same issue, and also providing a safe way to 
stop JVM without breaking py4j (in Russian):
https://habr.com/ru/companies/sberbank/articles/805285/

{code:python}
from pyspark.sql import SparkSession
from pyspark import SparkContext
from threading import RLock
from py4j.protocol import CALL_COMMAND_NAME, END_COMMAND_PART


from pyspark.sql import SparkSession
from pyspark 

[jira] [Updated] (SPARK-47740) Stop JVM by calling SparkSession.stop from PySpark

2024-04-05 Thread Maxim Martynov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Martynov updated SPARK-47740:
---
Description: 
{code:python}
from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.driver.memory", 
"2g").config("spark.another.property", "hi").getOrCreate()
spark.conf.get("spark.driver.memory")
# prints '2g'
spark.conf.get("spark.another.property")
# prints 'hi'

# check for JVM process before and after stop
spark.stop()

# try to recreate Spark session with different driver memory, and no 
"spark.another.property" at all
spark = SparkSession.builder.config("spark.driver.memory", "3g").getOrCreate()
spark.conf.get("spark.driver.memory")
# prints '3g'
spark.conf.get("spark.another.property")
# prints 'hi'

# check for JVM process
{code}

After first call of {{.getOrCreate()}} JVM process with following options have 
been started:
{code}
maxim  76625 10.9  2.7 7975292 447068 pts/8  Sl+  16:59   0:11 
/home/maxim/.sdkman/candidates/java/current/bin/java -cp 
/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/conf:/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/jars/*
 -Xmx2g -XX:+IgnoreUnrecognizedVMOptions 
--add-opens=java.base/java.lang=ALL-UNNAMED 
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED 
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED 
--add-opens=java.base/java.io=ALL-UNNAMED 
--add-opens=java.base/java.net=ALL-UNNAMED 
--add-opens=java.base/java.nio=ALL-UNNAMED 
--add-opens=java.base/java.util=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED 
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED 
--add-opens=java.base/sun.nio.cs=ALL-UNNAMED 
--add-opens=java.base/sun.security.action=ALL-UNNAMED 
--add-opens=java.base/sun.util.calendar=ALL-UNNAMED 
--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED 
-Djdk.reflect.useDirectMethodHandle=false org.apache.spark.deploy.SparkSubmit 
--conf spark.driver.memory=2g --conf spark.another.property=hi pyspark-shell
{code}

But it have not been stopped by {{.stop()}} method call. New Spark session 
which reports 3Gb of driver memory actually uses the same 2Gb, and config 
options from old session which should be closed at this moment.

Summarizing:
* {{spark.stop()}} stops only SparkContext, but not JVM
* {{spark.stop()}} does not clean up session config

This behavior has been observed for a long time, at least since 2.2.0, and 
still present in 3.5.1.

This could be solved by stopping JVM after Spark session is stopped.
But just calling {{spark._jvm.System.exit(0)}} fill fail because py4j socket 
will be closed, and it will be impossible to start new Spark session:
{code:python}
spark._jvm.System.exit(0)
{code}

{code}
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py",
 line 516, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py",
 line 1038, in send_command
response = connection.send_command(command)
   
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py",
 line 539, in send_command
raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py",
 line 1322, in __call__
return_value = get_return_value(
   ^
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py",
 line 179, in deco
return f(*a, **kw)
   ^^^
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/protocol.py",
 line 334, in get_return_value
raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling z:java.lang.System.exit
{code}

Here is an article describing the same issue, and also providing a safe way to 
stop JVM without breaking py4j (in Russian):
https://habr.com/ru/companies/sberbank/articles/805285/

{code:python}
from pyspark.sql import SparkSession
from pyspark import SparkContext
from threading import RLock
from py4j.protocol import CALL_COMMAND_NAME, END_COMMAND_PART


from pyspark.sql import SparkSession
from pyspark 

[jira] [Created] (SPARK-47740) Stop JVM by calling SparkSession.stop from PySpark

2024-04-05 Thread Maxim Martynov (Jira)
Maxim Martynov created SPARK-47740:
--

 Summary: Stop JVM by calling SparkSession.stop from PySpark
 Key: SPARK-47740
 URL: https://issues.apache.org/jira/browse/SPARK-47740
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.5.1
Reporter: Maxim Martynov


{code:python}
from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.driver.memory", 
"2g").config("spark.another.property", "hi").getOrCreate()
spark.conf.get("spark.driver.memory")
# prints '2g'
spark.conf.get("spark.another.property")
# prints 'hi'

# check for JVM process before and after stop
spark.stop()

# try to recreate Spark session with different driver memory, and no 
"spark.another.property" at all
spark = SparkSession.builder.config("spark.driver.memory", "3g").getOrCreate()
spark.conf.get("spark.driver.memory")
# prints '3g'
spark.conf.get("spark.another.property")
# prints 'hi'

# check for JVM process
{code}

After first call of {{.getOrCreate()}} JVM process with following options have 
been started:
{code}
maxim  76625 10.9  2.7 7975292 447068 pts/8  Sl+  16:59   0:11 
/home/maxim/.sdkman/candidates/java/current/bin/java -cp 
/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/conf:/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/jars/*
 -Xmx2g -XX:+IgnoreUnrecognizedVMOptions 
--add-opens=java.base/java.lang=ALL-UNNAMED 
--add-opens=java.base/java.lang.invoke=ALL-UNNAMED 
--add-opens=java.base/java.lang.reflect=ALL-UNNAMED 
--add-opens=java.base/java.io=ALL-UNNAMED 
--add-opens=java.base/java.net=ALL-UNNAMED 
--add-opens=java.base/java.nio=ALL-UNNAMED 
--add-opens=java.base/java.util=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent=ALL-UNNAMED 
--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED 
--add-opens=java.base/sun.nio.ch=ALL-UNNAMED 
--add-opens=java.base/sun.nio.cs=ALL-UNNAMED 
--add-opens=java.base/sun.security.action=ALL-UNNAMED 
--add-opens=java.base/sun.util.calendar=ALL-UNNAMED 
--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED 
-Djdk.reflect.useDirectMethodHandle=false org.apache.spark.deploy.SparkSubmit 
--conf spark.driver.memory=2g --conf spark.another.property=hi pyspark-shell
{code}

But it have not been stopped by {{.stop()}} method call. New Spark session 
which reports 3Gb of driver memory actually uses the same 2Gb, and config 
options from old session which should be closed at this moment.

Summarizing:
* {{spark.stop()}} stops only SparkContext, but not JVM
* {{spark.stop()}} does not clean up session config

This behavior has been observed for a long time, at least since 2.2.0, and 
still present in 3.5.1.

This could be solved by stopping JVM after Spark session is stopped.
But just calling {{spark._jvm.System.exit(0)}} fill fail because py4j socket 
will be closed, and it will be impossible to start new Spark session:
{code:python}
spark._jvm.System.exit(0)
{code}

{code}
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py",
 line 516, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py",
 line 1038, in send_command
response = connection.send_command(command)
   
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/clientserver.py",
 line 539, in send_command
raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending or receiving
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/java_gateway.py",
 line 1322, in __call__
return_value = get_return_value(
   ^
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py",
 line 179, in deco
return f(*a, **kw)
   ^^^
  File 
"/home/maxim/Repo/devdocs/DataLineage/0-RnD/marquez/venv/lib/python3.11/site-packages/py4j/protocol.py",
 line 334, in get_return_value
raise Py4JError(
py4j.protocol.Py4JError: An error occurred while calling z:java.lang.System.exit
{code}

Here is a description of a safe way to stop JVM without breaking py4j (in 
Russian):
https://habr.com/ru/companies/sberbank/articles/805285/

{code:python}
from pyspark.sql import SparkSession
from pyspark import 

[jira] [Resolved] (SPARK-47735) Make pyspark.testing.connectutils compatible with pyspark-connect

2024-04-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47735.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45887
[https://github.com/apache/spark/pull/45887]

> Make pyspark.testing.connectutils compatible with pyspark-connect
> -
>
> Key: SPARK-47735
> URL: https://issues.apache.org/jira/browse/SPARK-47735
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47735) Make pyspark.testing.connectutils compatible with pyspark-connect

2024-04-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47735:
-

Assignee: Hyukjin Kwon

> Make pyspark.testing.connectutils compatible with pyspark-connect
> -
>
> Key: SPARK-47735
> URL: https://issues.apache.org/jira/browse/SPARK-47735
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46632) EquivalentExpressions throw IllegalStateException

2024-04-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46632:
---
Labels: pull-request-available  (was: )

> EquivalentExpressions throw IllegalStateException
> -
>
> Key: SPARK-46632
> URL: https://issues.apache.org/jira/browse/SPARK-46632
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core, SQL
>Affects Versions: 3.3.0, 3.4.0, 3.5.0
>Reporter: zhangzhenhao
>Priority: Major
>  Labels: pull-request-available
>
> EquivalentExpressions throw IllegalStateException with some IF expresssion
> ```scala
> import org.apache.spark.sql.catalyst.dsl.expressions.DslExpression
> import org.apache.spark.sql.catalyst.expressions.\{EquivalentExpressions, If, 
> Literal}
> import org.apache.spark.sql.functions.col
> val one = Literal(1.0)
> val y = col("y").expr
> val e1 = If(
>   Literal(true),
>   y * one * one + one * one * y,
>   y * one * one + one * one * y
> )
> (new EquivalentExpressions).addExprTree(e1)
> ```
>  
> result is 
> ```
> java.lang.IllegalStateException: Cannot update expression: (1.0 * 1.0) in 
> map: Map(ExpressionEquals(('y * 1.0)) -> ExpressionStats(('y * 1.0))) with 
> use count: -1
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprInMap(EquivalentExpressions.scala:85)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprTree(EquivalentExpressions.scala:198)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$1(EquivalentExpressions.scala:200)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$1$adapted(EquivalentExpressions.scala:200)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprTree(EquivalentExpressions.scala:200)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$1(EquivalentExpressions.scala:200)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$1$adapted(EquivalentExpressions.scala:200)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprTree(EquivalentExpressions.scala:200)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateCommonExprs(EquivalentExpressions.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$3(EquivalentExpressions.scala:201)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.$anonfun$updateExprTree$3$adapted(EquivalentExpressions.scala:201)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.updateExprTree(EquivalentExpressions.scala:201)
>   at 
> org.apache.spark.sql.catalyst.expressions.EquivalentExpressions.addExprTree(EquivalentExpressions.scala:188)
>   ... 49 elided
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1

2024-04-05 Thread Christos Karras (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834315#comment-17834315
 ] 

Christos Karras edited comment on SPARK-46143 at 4/5/24 1:47 PM:
-

I have the same issue too. The problem is because the squeeze parameter of 
read_excel has been deprecated since pandas version 1.4, and has been 
completely removed in pandas 2.0. But Pyspark's implementation of read_excel 
keeps passing the squeeze parameter, even though the parameter is also 
deprecated since pyspark 3.4.

Since there's no version constraint in Pyspark that indicates Pandas 2.0 is not 
supported, a fix to avoid the need to stay with pandas 1.x could be to detect 
the pandas version and decide if the squeeze parameter should be passed 
depending on the pandas version. Also, if the squeeze parameter is passed in 
the Pyspark function, raise an error if a version of pandas that no longer 
supports this parameter is installed. This would be a transition solution until 
the squeeze parameter is also removed completely from Pyspark.

 

Modify pyspark\pandas\namespace.py:
 * Change the squeeze parameter of the read_excel function to be "squeeze: 
Optional[bool] = None"
 * Modify the nested pd_read_excel function to check for the pandas version and 
build a dict of arguments it will pass to pd.read_excel based on that version. 
Also consider if the squeeze parameter was passed by the caller or not. And if 
the caller specified that argument but a newer version of pandas that doesn't 
support it, raise an exception

 

Sample code that could implement this solution:
def pd_read_excel(
    io_or_bin: Any,
    sn: Union[str, int, List[Union[str, int]], None], sq: bool   
) -> pd.DataFrame:

    read_excel_args: dict =

{         "io":BytesIO(io_or_bin) if isinstance(io_or_bin, (bytes, bytearray)) 
else io_or_bin,         "sheet_name":sn,         "header":header,         # 
TODO other args...,         **kwds     }

    if squeeze is not None:
        pandas_version_major = int(pd.__version__.split(".")[0])
        if pandas_version_major >= 2:
            raise Exception("The squeeze parameter for read_excel is no longer 
available in pandas 2.x")
       
        read_excel_args["squeeze"] = squeeze
    return pd.read_excel(**read_excel_args)
 


was (Author: JIRAUSER304886):
I have the same issue too. The problem is because the squeeze parameter of 
read_excel has been deprecated since pandas version 1.4, and has been 
completely removed in pandas 2.0. But Pyspark's implementation of read_excel 
keeps passing the squeeze parameter, even though the parameter is also 
deprecated since pyspark 3.4.

Since there's no version constraint in Pyspark that indicates Pandas 2.0 is not 
supported, a fix to avoid the need to stay with pandas 1.x could be to detect 
the pandas version and decide if the squeeze parameter should be passed 
depending on the pandas version. Also, if the squeeze parameter is passed in 
the Pyspark function, raise an error if a versio of pandas that no longer 
supports this parameter is installed. This would be a transition solution until 
the squeeze parameter is also removed completely from Pyspark.

 

Modify pyspark\pandas\namespace.py:
 * Change the squeeze parameter of the read_excel function to be "squeeze: 
Optional[bool] = None"
 * Modify the nested pd_read_excel function to check for the pandas version and 
build a dict of arguments it will pass to pd.read_excel based on that version. 
Also consider if the squeeze parameter was passed by the caller or not. And if 
the caller specified that argument but a newer version of pandas that doesn't 
support it, raise an exception

 

Sample code that could implement this solution:
def pd_read_excel(
    io_or_bin: Any,
    sn: Union[str, int, List[Union[str, int]], None], sq: bool   
) -> pd.DataFrame:

    read_excel_args: dict =

{        


"io":BytesIO(io_or_bin) if isinstance(io_or_bin, (bytes, bytearray)) else 
io_or_bin,         "sheet_name":sn,        

"header":header,        

# TODO other args...,        

**kwds    

}

    if squeeze is not None:
        if pandas_version >= 2:
            raise Exception("The squeeze parameter for read_excel is no longer 
available in pandas 2.x")
       
        read_excel_args["squeeze"] = squeeze
    return pd.read_excel(**read_excel_args)
 

> pyspark.pandas read_excel implementation at version 3.4.1
> -
>
> Key: SPARK-46143
> URL: https://issues.apache.org/jira/browse/SPARK-46143
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.1
> Environment: pyspark 3.4.1.5.3 build 20230713.
> Running on Microsoft Fabric workspace at runtime 1.2.
> Tested the same scenario on a spark 3.4.1 standalone deployment on docker 
> documented at 

[jira] [Comment Edited] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1

2024-04-05 Thread Christos Karras (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834315#comment-17834315
 ] 

Christos Karras edited comment on SPARK-46143 at 4/5/24 1:42 PM:
-

I have the same issue too. The problem is because the squeeze parameter of 
read_excel has been deprecated since pandas version 1.4, and has been 
completely removed in pandas 2.0. But Pyspark's implementation of read_excel 
keeps passing the squeeze parameter, even though the parameter is also 
deprecated since pyspark 3.4.

Since there's no version constraint in Pyspark that indicates Pandas 2.0 is not 
supported, a fix to avoid the need to stay with pandas 1.x could be to detect 
the pandas version and decide if the squeeze parameter should be passed 
depending on the pandas version. Also, if the squeeze parameter is passed in 
the Pyspark function, raise an error if a versio of pandas that no longer 
supports this parameter is installed. This would be a transition solution until 
the squeeze parameter is also removed completely from Pyspark.

 

Modify pyspark\pandas\namespace.py:
 * Change the squeeze parameter of the read_excel function to be "squeeze: 
Optional[bool] = None"
 * Modify the nested pd_read_excel function to check for the pandas version and 
build a dict of arguments it will pass to pd.read_excel based on that version. 
Also consider if the squeeze parameter was passed by the caller or not. And if 
the caller specified that argument but a newer version of pandas that doesn't 
support it, raise an exception

 

Sample code that could implement this solution:
def pd_read_excel(
    io_or_bin: Any,
    sn: Union[str, int, List[Union[str, int]], None], sq: bool   
) -> pd.DataFrame:

    read_excel_args: dict =

{         
"io":BytesIO(io_or_bin) if isinstance(io_or_bin, (bytes, bytearray)) else 
io_or_bin,         "sheet_name":sn,         
"header":header,         
# TODO other args...,         
**kwds     
}

    if squeeze is not None:
        if pandas_version >= 2:
            raise Exception("The squeeze parameter for read_excel is no longer 
available in pandas 2.x")
       
        read_excel_args["squeeze"] = squeeze
    return pd.read_excel(**read_excel_args)
 


was (Author: JIRAUSER304886):
I have the same issue too. The problem is because the squeeze parameter of 
read_excel has been deprecated since pandas version 1.4, and has been 
completely removed in pandas 2.0. But Pyspark's implementation of read_excel 
keeps passing the squeeze parameter, even though the parameter is also 
deprecated since pyspark 3.4.

Since there's no version constraint in Pyspark that indicates Pandas 2.0 is not 
supported, a fix to avoid the need to stay with pandas 1.x could be to detect 
the pandas version and decide if the squeeze parameter should be passed 
depending on the pandas version. Also, if the squeeze parameter is passed in 
the Pyspark function, raise an error if a versio of pandas that no longer 
supports this parameter is installed. This would be a transition solution until 
the squeeze parameter is also removed completely from Pyspark.

 

Modify pyspark\pandas\namespace.py:
 * Change the squeeze parameter of the read_excel function to be "squeeze: 
Optional[bool] = None"
 * Modify the nested pd_read_excel function to check for the pandas version and 
build a dict of arguments it will pass to pd.read_excel based on that version. 
Also consider if the squeeze parameter was passed by the caller or not. And if 
the caller specified that argument but a newer version of pandas that doesn't 
support it, raise an exception

 

Sample code that could implement this solution:
def pd_read_excel(
    io_or_bin: Any,
    sn: Union[str, int, List[Union[str, int]], None], sq: bool   
) -> pd.DataFrame:

    read_excel_args: dict = {
        "io":BytesIO(io_or_bin) if isinstance(io_or_bin, (bytes, bytearray)) 
else io_or_bin,
        "sheet_name":sn,
        "header":header,
        # TODO other args...,
        **kwds
    }

    if squeeze is not None:
        if pandas_version >= 2:
            raise Exception("The squeeze parameter for read_excel is no longer 
available in pandas 2.x")
       
        read_excel_args["squeeze"] = squeeze
    return pd.read_excel(**read_excel_args)
 

> pyspark.pandas read_excel implementation at version 3.4.1
> -
>
> Key: SPARK-46143
> URL: https://issues.apache.org/jira/browse/SPARK-46143
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.1
> Environment: pyspark 3.4.1.5.3 build 20230713.
> Running on Microsoft Fabric workspace at runtime 1.2.
> Tested the same scenario on a spark 3.4.1 standalone deployment on docker 
> documented at https://github.com/mpavanetti/sparkenv
>  
>  
>Reporter: Matheus 

[jira] [Comment Edited] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1

2024-04-05 Thread Christos Karras (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834315#comment-17834315
 ] 

Christos Karras edited comment on SPARK-46143 at 4/5/24 1:42 PM:
-

I have the same issue too. The problem is because the squeeze parameter of 
read_excel has been deprecated since pandas version 1.4, and has been 
completely removed in pandas 2.0. But Pyspark's implementation of read_excel 
keeps passing the squeeze parameter, even though the parameter is also 
deprecated since pyspark 3.4.

Since there's no version constraint in Pyspark that indicates Pandas 2.0 is not 
supported, a fix to avoid the need to stay with pandas 1.x could be to detect 
the pandas version and decide if the squeeze parameter should be passed 
depending on the pandas version. Also, if the squeeze parameter is passed in 
the Pyspark function, raise an error if a versio of pandas that no longer 
supports this parameter is installed. This would be a transition solution until 
the squeeze parameter is also removed completely from Pyspark.

 

Modify pyspark\pandas\namespace.py:
 * Change the squeeze parameter of the read_excel function to be "squeeze: 
Optional[bool] = None"
 * Modify the nested pd_read_excel function to check for the pandas version and 
build a dict of arguments it will pass to pd.read_excel based on that version. 
Also consider if the squeeze parameter was passed by the caller or not. And if 
the caller specified that argument but a newer version of pandas that doesn't 
support it, raise an exception

 

Sample code that could implement this solution:
def pd_read_excel(
    io_or_bin: Any,
    sn: Union[str, int, List[Union[str, int]], None], sq: bool   
) -> pd.DataFrame:

    read_excel_args: dict =

{        


"io":BytesIO(io_or_bin) if isinstance(io_or_bin, (bytes, bytearray)) else 
io_or_bin,         "sheet_name":sn,        

"header":header,        

# TODO other args...,        

**kwds    

}

    if squeeze is not None:
        if pandas_version >= 2:
            raise Exception("The squeeze parameter for read_excel is no longer 
available in pandas 2.x")
       
        read_excel_args["squeeze"] = squeeze
    return pd.read_excel(**read_excel_args)
 


was (Author: JIRAUSER304886):
I have the same issue too. The problem is because the squeeze parameter of 
read_excel has been deprecated since pandas version 1.4, and has been 
completely removed in pandas 2.0. But Pyspark's implementation of read_excel 
keeps passing the squeeze parameter, even though the parameter is also 
deprecated since pyspark 3.4.

Since there's no version constraint in Pyspark that indicates Pandas 2.0 is not 
supported, a fix to avoid the need to stay with pandas 1.x could be to detect 
the pandas version and decide if the squeeze parameter should be passed 
depending on the pandas version. Also, if the squeeze parameter is passed in 
the Pyspark function, raise an error if a versio of pandas that no longer 
supports this parameter is installed. This would be a transition solution until 
the squeeze parameter is also removed completely from Pyspark.

 

Modify pyspark\pandas\namespace.py:
 * Change the squeeze parameter of the read_excel function to be "squeeze: 
Optional[bool] = None"
 * Modify the nested pd_read_excel function to check for the pandas version and 
build a dict of arguments it will pass to pd.read_excel based on that version. 
Also consider if the squeeze parameter was passed by the caller or not. And if 
the caller specified that argument but a newer version of pandas that doesn't 
support it, raise an exception

 

Sample code that could implement this solution:
def pd_read_excel(
    io_or_bin: Any,
    sn: Union[str, int, List[Union[str, int]], None], sq: bool   
) -> pd.DataFrame:

    read_excel_args: dict =

{         
"io":BytesIO(io_or_bin) if isinstance(io_or_bin, (bytes, bytearray)) else 
io_or_bin,         "sheet_name":sn,         
"header":header,         
# TODO other args...,         
**kwds     
}

    if squeeze is not None:
        if pandas_version >= 2:
            raise Exception("The squeeze parameter for read_excel is no longer 
available in pandas 2.x")
       
        read_excel_args["squeeze"] = squeeze
    return pd.read_excel(**read_excel_args)
 

> pyspark.pandas read_excel implementation at version 3.4.1
> -
>
> Key: SPARK-46143
> URL: https://issues.apache.org/jira/browse/SPARK-46143
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.1
> Environment: pyspark 3.4.1.5.3 build 20230713.
> Running on Microsoft Fabric workspace at runtime 1.2.
> Tested the same scenario on a spark 3.4.1 standalone deployment on docker 
> documented at https://github.com/mpavanetti/sparkenv
>  
>  
>Reporter: Matheus 

[jira] [Comment Edited] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1

2024-04-05 Thread Christos Karras (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834315#comment-17834315
 ] 

Christos Karras edited comment on SPARK-46143 at 4/5/24 1:41 PM:
-

I have the same issue too. The problem is because the squeeze parameter of 
read_excel has been deprecated since pandas version 1.4, and has been 
completely removed in pandas 2.0. But Pyspark's implementation of read_excel 
keeps passing the squeeze parameter, even though the parameter is also 
deprecated since pyspark 3.4.

Since there's no version constraint in Pyspark that indicates Pandas 2.0 is not 
supported, a fix to avoid the need to stay with pandas 1.x could be to detect 
the pandas version and decide if the squeeze parameter should be passed 
depending on the pandas version. Also, if the squeeze parameter is passed in 
the Pyspark function, raise an error if a versio of pandas that no longer 
supports this parameter is installed. This would be a transition solution until 
the squeeze parameter is also removed completely from Pyspark.

 

Modify pyspark\pandas\namespace.py:
 * Change the squeeze parameter of the read_excel function to be "squeeze: 
Optional[bool] = None"
 * Modify the nested pd_read_excel function to check for the pandas version and 
build a dict of arguments it will pass to pd.read_excel based on that version. 
Also consider if the squeeze parameter was passed by the caller or not. And if 
the caller specified that argument but a newer version of pandas that doesn't 
support it, raise an exception

 

Sample code that could implement this solution:
def pd_read_excel(
    io_or_bin: Any,
    sn: Union[str, int, List[Union[str, int]], None], sq: bool   
) -> pd.DataFrame:

    read_excel_args: dict = {
        "io":BytesIO(io_or_bin) if isinstance(io_or_bin, (bytes, bytearray)) 
else io_or_bin,
        "sheet_name":sn,
        "header":header,
        # TODO other args...,
        **kwds
    }

    if squeeze is not None:
        if pandas_version >= 2:
            raise Exception("The squeeze parameter for read_excel is no longer 
available in pandas 2.x")
       
        read_excel_args["squeeze"] = squeeze
    return pd.read_excel(**read_excel_args)
 


was (Author: JIRAUSER304886):
I have the same issue too. The problem is because the squeeze parameter of 
read_excel has been deprecated since pandas version 1.4, and has been 
completely removed in pandas 2.0. But Pyspark's implementation of read_excel 
keeps passing the squeeze parameter, even though the parameter is also 
deprecated since pyspark 3.4.


Since there's no version constraint in Pyspark that indicates Pandas 2.0 is not 
supported, a fix to avoid the need to stay with pandas 1.x could be to detect 
the pandas version and decide if the squeeze parameter should be passed 
depending on the pandas version. Also, if the squeeze parameter is passed in 
the Pyspark function, raise an error if a versio of pandas that no longer 
supports this parameter is installed. This would be a transition solution until 
the squeeze parameter is also removed completely from Pyspark.

 

Modify pyspark\pandas\namespace.py:
 * Change the squeeze parameter of the read_excel function to be "squeeze: 
Optional[bool] = None"
 * Modify the nested pd_read_excel function to check for the pandas version and 
build a dict of arguments it will pass to pd.read_excel based on that version. 
Also consider if the squeeze parameter was passed by the caller or not. And if 
the caller specified that argument but a newer version of pandas that doesn't 
support it, raise an exception:

{{    }}
{{def pd_read_excel(}}{{        io_or_bin: Any, sn: Union[str, int, 
List[Union[str, int]], None], sq: bool}}{{    ) -> 
pd.DataFrame:}}{{read_excel_args: dict = {
{{
{{"io":BytesIO(io_or_bin) if isinstance(io_or_bin, (bytes, bytearray)) else 
io_or_bin,}}{{            }}
{{"sheet_name":sn,}}
{{"header":header,}}
{{...,}}
{{**kwds}}
{{}}}


{{if squeeze is not None:}}
{{   if pandas_version >= 2:}}
{{      raise Exception("The squeeze parameter for read_excel is no longer 
available in pandas 2.x")}}

{{read_excel_args["squeeze"] = squeeze}}
{{return pd.read_excel(**read_excel_args)}}
{{   }}

> pyspark.pandas read_excel implementation at version 3.4.1
> -
>
> Key: SPARK-46143
> URL: https://issues.apache.org/jira/browse/SPARK-46143
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.1
> Environment: pyspark 3.4.1.5.3 build 20230713.
> Running on Microsoft Fabric workspace at runtime 1.2.
> Tested the same scenario on a spark 3.4.1 standalone deployment on docker 
> documented at https://github.com/mpavanetti/sparkenv
>  
>  
>Reporter: Matheus Pavanetti
>Priority: Major
> Attachments: 

[jira] [Commented] (SPARK-46143) pyspark.pandas read_excel implementation at version 3.4.1

2024-04-05 Thread Christos Karras (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-46143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834315#comment-17834315
 ] 

Christos Karras commented on SPARK-46143:
-

I have the same issue too. The problem is because the squeeze parameter of 
read_excel has been deprecated since pandas version 1.4, and has been 
completely removed in pandas 2.0. But Pyspark's implementation of read_excel 
keeps passing the squeeze parameter, even though the parameter is also 
deprecated since pyspark 3.4.


Since there's no version constraint in Pyspark that indicates Pandas 2.0 is not 
supported, a fix to avoid the need to stay with pandas 1.x could be to detect 
the pandas version and decide if the squeeze parameter should be passed 
depending on the pandas version. Also, if the squeeze parameter is passed in 
the Pyspark function, raise an error if a versio of pandas that no longer 
supports this parameter is installed. This would be a transition solution until 
the squeeze parameter is also removed completely from Pyspark.

 

Modify pyspark\pandas\namespace.py:
 * Change the squeeze parameter of the read_excel function to be "squeeze: 
Optional[bool] = None"
 * Modify the nested pd_read_excel function to check for the pandas version and 
build a dict of arguments it will pass to pd.read_excel based on that version. 
Also consider if the squeeze parameter was passed by the caller or not. And if 
the caller specified that argument but a newer version of pandas that doesn't 
support it, raise an exception:

{{    }}
{{def pd_read_excel(}}{{        io_or_bin: Any, sn: Union[str, int, 
List[Union[str, int]], None], sq: bool}}{{    ) -> 
pd.DataFrame:}}{{read_excel_args: dict = {
{{
{{"io":BytesIO(io_or_bin) if isinstance(io_or_bin, (bytes, bytearray)) else 
io_or_bin,}}{{            }}
{{"sheet_name":sn,}}
{{"header":header,}}
{{...,}}
{{**kwds}}
{{}}}


{{if squeeze is not None:}}
{{   if pandas_version >= 2:}}
{{      raise Exception("The squeeze parameter for read_excel is no longer 
available in pandas 2.x")}}

{{read_excel_args["squeeze"] = squeeze}}
{{return pd.read_excel(**read_excel_args)}}
{{   }}

> pyspark.pandas read_excel implementation at version 3.4.1
> -
>
> Key: SPARK-46143
> URL: https://issues.apache.org/jira/browse/SPARK-46143
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.1
> Environment: pyspark 3.4.1.5.3 build 20230713.
> Running on Microsoft Fabric workspace at runtime 1.2.
> Tested the same scenario on a spark 3.4.1 standalone deployment on docker 
> documented at https://github.com/mpavanetti/sparkenv
>  
>  
>Reporter: Matheus Pavanetti
>Priority: Major
> Attachments: MicrosoftTeams-image.png, 
> image-2023-11-28-13-20-40-275.png, image-2023-11-28-13-20-51-291.png
>
>
> Hello, 
> I would like to report an issue with pyspark.pandas implementation on 
> read_excel function.
> Microsoft Fabric spark environment 1.2 (runtime) uses pyspark 3.4.1 which 
> potentially uses an older version of pandas on it's implementations of 
> pyspark.pandas.
> The function read_excel from pandas doesn't expect a parameter called 
> "squeeze" however it's implemented as part of pyspark.pandas and the 
> parameter "squeeze" is being passed to the pandas function.
>  
> !image-2023-11-28-13-20-40-275.png!
>  
> I've been digging into it for further investigation into pyspark 3.4.1 
> documentation
> [https://spark.apache.org/docs/3.4.1/api/python/_modules/pyspark/pandas/namespace.html#read_excel|https://mcas-proxyweb.mcas.ms/certificate-checker?login=false=https%3A%2F%2Fspark.apache.org.mcas.ms%2Fdocs%2F3.4.1%2Fapi%2Fpython%2F_modules%2Fpyspark%2Fpandas%2Fnamespace.html%3FMcasTsid%3D20893%23read_excel=92c0f0a0811f59386edd92fd5f3fcb0ac451ce363b3f2e01ed076f45e2b20500]
>  
> This is the point I found that "squeeze" parameter is being passed to pandas 
> read_excel function which is not expected.
> It seems like it was deprecated as part of pyspark 3.4.0 but still being used 
> in the implementation.
>  
> !image-2023-11-28-13-20-51-291.png!
>  
> I believe this is an issue with pyspark implementation 3.4.1 not necessaily 
> with fabric. However fabric uses this version as its 1.2 build.
>  
> I am able to work around that for now by download the excel from the one lake 
> to the spark driver, loading that to the memory with pandas and then 
> converting to a spark dataframe etc or I made it work downgrading the build
> I downloaded the pyspark build 20230713 to my local, made the changes and 
> re-compiled it and it worked locally. So it means that is related to the 
> implementation and they would have to fix or I do a downgrade to older 
> version like 3.3.3 or try the latest 3.5.0 which is not the case for fabric
>  
>  



--
This message was sent 

[jira] [Updated] (SPARK-47737) Bump PyArrow to 15.0.2

2024-04-05 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47737:
---
Labels: pull-request-available  (was: )

> Bump PyArrow to 15.0.2
> --
>
> Key: SPARK-47737
> URL: https://issues.apache.org/jira/browse/SPARK-47737
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> Use the latest version for stability.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47736) Add support for AbstractArrayType(StringTypeCollated)

2024-04-05 Thread Mihailo Milosevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mihailo Milosevic updated SPARK-47736:
--
Summary: Add support for AbstractArrayType(StringTypeCollated)  (was: Add 
support for ArrayType(StringTypeAnyCollation))

> Add support for AbstractArrayType(StringTypeCollated)
> -
>
> Key: SPARK-47736
> URL: https://issues.apache.org/jira/browse/SPARK-47736
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47736) Add support for ArrayType(StringTypeAnyCollation)

2024-04-05 Thread Mihailo Milosevic (Jira)
Mihailo Milosevic created SPARK-47736:
-

 Summary: Add support for ArrayType(StringTypeAnyCollation)
 Key: SPARK-47736
 URL: https://issues.apache.org/jira/browse/SPARK-47736
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Mihailo Milosevic






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org