[jira] [Commented] (SPARK-40421) Make `spearman` correlation in `DataFrame.corr` support missing values and `min_periods`

2022-09-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603892#comment-17603892
 ] 

Apache Spark commented on SPARK-40421:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37874

> Make `spearman` correlation in `DataFrame.corr` support missing values and 
> `min_periods`
> 
>
> Key: SPARK-40421
> URL: https://issues.apache.org/jira/browse/SPARK-40421
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40421) Make `spearman` correlation in `DataFrame.corr` support missing values and `min_periods`

2022-09-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40421:


Assignee: Apache Spark

> Make `spearman` correlation in `DataFrame.corr` support missing values and 
> `min_periods`
> 
>
> Key: SPARK-40421
> URL: https://issues.apache.org/jira/browse/SPARK-40421
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40421) Make `spearman` correlation in `DataFrame.corr` support missing values and `min_periods`

2022-09-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40421:


Assignee: (was: Apache Spark)

> Make `spearman` correlation in `DataFrame.corr` support missing values and 
> `min_periods`
> 
>
> Key: SPARK-40421
> URL: https://issues.apache.org/jira/browse/SPARK-40421
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40421) Make `spearman` correlation in `DataFrame.corr` support missing values and `min_periods`

2022-09-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603891#comment-17603891
 ] 

Apache Spark commented on SPARK-40421:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37874

> Make `spearman` correlation in `DataFrame.corr` support missing values and 
> `min_periods`
> 
>
> Key: SPARK-40421
> URL: https://issues.apache.org/jira/browse/SPARK-40421
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40421) Make `spearman` correlation in `DataFrame.corr` support missing values and `min_periods`

2022-09-13 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-40421:
-

 Summary: Make `spearman` correlation in `DataFrame.corr` support 
missing values and `min_periods`
 Key: SPARK-40421
 URL: https://issues.apache.org/jira/browse/SPARK-40421
 Project: Spark
  Issue Type: Sub-task
  Components: ps
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40420) Sort message parameters in the JSON formats

2022-09-13 Thread Max Gekk (Jira)
Max Gekk created SPARK-40420:


 Summary: Sort message parameters in the JSON formats
 Key: SPARK-40420
 URL: https://issues.apache.org/jira/browse/SPARK-40420
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk


Sort the message parameters by names in the MINIMAL and STANDARD formats. 
Currently, the order depends on internal implementation of Map. And as a 
consequence, the output is not stable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-40303) The performance will be worse after codegen

2022-09-13 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603883#comment-17603883
 ] 

Yang Jie edited comment on SPARK-40303 at 9/14/22 5:24 AM:
---

I did a simple experiment to compare the following scenarios:
 # A method with 127 input parameters
 # Encapsulate the input parameters of the above method as a specific type, 
including 127 fields, and create a new parameter object before each call
 # Encapsulate the input parameters of the above method as a specific type, 
including 127 fields, reuse one parameter object and reset + refill parameter 
data before each call

 

I confirmed that the JIT compilation failure mentioned above will occur in the 
1&2 test scenarios, the test result as follows:

 

Java 8

 
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
Test sum:                                 Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative

Use multiple parameters method                    35772          36161         
550          0.3        3577.2       1.0X
Use TestParameters create new                     38701          38783         
115          0.3        3870.1       0.9X
Use TestParameters reuse                          17986          18125         
196          0.6        1798.6       2.0X {code}
Java 11

 
{code:java}
OpenJDK 64-Bit Server VM 11.0.16+8-LTS on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test sum:                                 Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative

Use multiple parameters method                    12253          12286          
46          0.8        1225.3       1.0X
Use TestParameters create new                     13644          13665          
30          0.7        1364.4       0.9X
Use TestParameters reuse                          13188          13219          
44          0.8        1318.8       0.9X {code}
Java 17

 
{code:java}
OpenJDK 64-Bit Server VM 17.0.4+8-LTS on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Test sum:                                 Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative

Use multiple parameters method                    14044          14128         
119          0.7        1404.4       1.0X
Use TestParameters create new                     16174          16289         
162          0.6        1617.4       0.9X
Use TestParameters reuse                          15633          15638          
 8          0.6        1563.3       0.9X {code}
>From the test results, encapsulating and reusing specific parameter types will 
>only alleviate the problem when using Java 8, but running programs using Java 
>11 or Java 17 seems to be a simpler and more effective way.

 

So for the current issue, I suggest upgrading the Java runtime environment to 
solve the problem. [~yumwang] 

 

Uploaded the test program to the attachment


was (Author: luciferyang):
I did a simple experiment to compare the following scenarios:
 # A method with 127 input parameters
 # Encapsulate the input parameters of the above method as a specific type, 
including 127 fields, and create a new parameter object before each call
 # Encapsulate the input parameters of the above method as a specific type, 
including 127 fields, reuse one parameter object and reset + refill parameter 
data before each call

 

I confirmed that the JIT compilation failure mentioned above will occur in the 
1&2 test scenarios, the test result as follows:

 

Java 8

 
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
Test sum:                                 Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative

Use multiple parameters method                    35772          36161         
550          0.3        3577.2       1.0X
Use TestParameters create new                     38701          38783         
115          0.3        3870.1       0.9X
Use TestParameters reuse                          17986          18125         
196          0.6        1798.6       2.0X {code}
Java 11

 
{code:java}
OpenJDK 64-Bit Server VM 11.0.16+8-LTS on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test sum:                                 

[jira] [Commented] (SPARK-40303) The performance will be worse after codegen

2022-09-13 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603883#comment-17603883
 ] 

Yang Jie commented on SPARK-40303:
--

I did a simple experiment to compare the following scenarios:
 # A method with 127 input parameters
 # Encapsulate the input parameters of the above method as a specific type, 
including 127 fields, and create a new parameter object before each call
 # Encapsulate the input parameters of the above method as a specific type, 
including 127 fields, reuse one parameter object and reset + refill parameter 
data before each call

 

I confirmed that the JIT compilation failure mentioned above will occur in the 
1&2 test scenarios, the test result as follows:

 

Java 8

 
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_345-b01 on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
Test sum:                                 Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative

Use multiple parameters method                    35772          36161         
550          0.3        3577.2       1.0X
Use TestParameters create new                     38701          38783         
115          0.3        3870.1       0.9X
Use TestParameters reuse                          17986          18125         
196          0.6        1798.6       2.0X {code}
Java 11

 
{code:java}
OpenJDK 64-Bit Server VM 11.0.16+8-LTS on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
Test sum:                                 Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative

Use multiple parameters method                    12253          12286          
46          0.8        1225.3       1.0X
Use TestParameters create new                     13644          13665          
30          0.7        1364.4       0.9X
Use TestParameters reuse                          13188          13219          
44          0.8        1318.8       0.9X {code}
Java 17

 
{code:java}
OpenJDK 64-Bit Server VM 17.0.4+8-LTS on Linux 5.15.0-1019-azure
Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
Test sum:                                 Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative

Use multiple parameters method                    14044          14128         
119          0.7        1404.4       1.0X
Use TestParameters create new                     16174          16289         
162          0.6        1617.4       0.9X
Use TestParameters reuse                          15633          15638          
 8          0.6        1563.3       0.9X {code}
>From the test results, encapsulating and reusing specific parameter types will 
>only alleviate the problem when using Java 8, but running programs using Java 
>11 or Java 17 seems to be a simpler and more effective way.

 

So for the current issue, I suggest upgrading the Java runtime environment to 
solve the problem.

 

Uploaded the test program to the attachment

> The performance will be worse after codegen
> ---
>
> Key: SPARK-40303
> URL: https://issues.apache.org/jira/browse/SPARK-40303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: TestApiBenchmark.scala, TestApis.java, 
> TestParameters.java
>
>
> {code:scala}
> import org.apache.spark.benchmark.Benchmark
> val dir = "/tmp/spark/benchmark"
> val N = 200
> val columns = Range(0, 100).map(i => s"id % $i AS id$i")
> spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir)
> // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100)
> Seq(60).foreach{ cnt =>
>   val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => 
> s"count(distinct $c)")
>   val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = 
> 1)
>   benchmark.addCase(s"$cnt count distinct with codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "true",
>   "spark.sql.codegen.factoryMode" -> "FALLBACK") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.addCase(s"$cnt count distinct without codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "false",
>   "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> 

[jira] [Updated] (SPARK-40303) The performance will be worse after codegen

2022-09-13 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-40303:
-
Attachment: TestApiBenchmark.scala
TestApis.java
TestParameters.java

> The performance will be worse after codegen
> ---
>
> Key: SPARK-40303
> URL: https://issues.apache.org/jira/browse/SPARK-40303
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: TestApiBenchmark.scala, TestApis.java, 
> TestParameters.java
>
>
> {code:scala}
> import org.apache.spark.benchmark.Benchmark
> val dir = "/tmp/spark/benchmark"
> val N = 200
> val columns = Range(0, 100).map(i => s"id % $i AS id$i")
> spark.range(N).selectExpr(columns: _*).write.mode("Overwrite").parquet(dir)
> // Seq(1, 2, 5, 10, 15, 25, 40, 60, 100)
> Seq(60).foreach{ cnt =>
>   val selectExps = columns.take(cnt).map(_.split(" ").last).map(c => 
> s"count(distinct $c)")
>   val benchmark = new Benchmark("Benchmark count distinct", N, minNumIters = 
> 1)
>   benchmark.addCase(s"$cnt count distinct with codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "true",
>   "spark.sql.codegen.factoryMode" -> "FALLBACK") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.addCase(s"$cnt count distinct without codegen") { _ =>
> withSQLConf(
>   "spark.sql.codegen.wholeStage" -> "false",
>   "spark.sql.codegen.factoryMode" -> "NO_CODEGEN") {
>   spark.read.parquet(dir).selectExpr(selectExps: 
> _*).write.format("noop").mode("Overwrite").save()
> }
>   }
>   benchmark.run()
> }
> {code}
> {noformat}
> Java HotSpot(TM) 64-Bit Server VM 1.8.0_281-b09 on Mac OS X 10.15.7
> Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
> Benchmark count distinct: Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> 60 count distinct with codegen   628146 628146
>0  0.0  314072.8   1.0X
> 60 count distinct without codegen147635 147635
>0  0.0   73817.5   4.3X
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40414) Fix PythonArrowInput and PythonArrowOutput to be more generic to handle complicated type/data

2022-09-13 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-40414.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37864
[https://github.com/apache/spark/pull/37864]

> Fix PythonArrowInput and PythonArrowOutput to be more generic to handle 
> complicated type/data
> -
>
> Key: SPARK-40414
> URL: https://issues.apache.org/jira/browse/SPARK-40414
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.4.0
>
>
> During the work of flatMapGroupsWithState in PySpark, we figured out that we 
> are unable to reuse PythonArrowInput and PythonArrowOutput, as 
> PythonArrowInput and PythonArrowOutput are too specific to the strict input 
> data (row) and output data.
> To reuse the implementations we should make these traits more general to 
> handle more generic type of data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40414) Fix PythonArrowInput and PythonArrowOutput to be more generic to handle complicated type/data

2022-09-13 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-40414:


Assignee: Jungtaek Lim

> Fix PythonArrowInput and PythonArrowOutput to be more generic to handle 
> complicated type/data
> -
>
> Key: SPARK-40414
> URL: https://issues.apache.org/jira/browse/SPARK-40414
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> During the work of flatMapGroupsWithState in PySpark, we figured out that we 
> are unable to reuse PythonArrowInput and PythonArrowOutput, as 
> PythonArrowInput and PythonArrowOutput are too specific to the strict input 
> data (row) and output data.
> To reuse the implementations we should make these traits more general to 
> handle more generic type of data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40419) Integrate Grouped Aggregate Pandas UDFs into *.sql test cases

2022-09-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603880#comment-17603880
 ] 

Apache Spark commented on SPARK-40419:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/37873

> Integrate Grouped Aggregate Pandas UDFs into *.sql test cases
> -
>
> Key: SPARK-40419
> URL: https://issues.apache.org/jira/browse/SPARK-40419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We ported Python UDF, Scala UDF and Scalar Pandas UDF into SQL test cases 
> from SPARK-27921, but Grouped Aggregate Pandas UDF is not tested from SQL at 
> all.
> We should also leverage this to test pandas aggregate UDFs too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40419) Integrate Grouped Aggregate Pandas UDFs into *.sql test cases

2022-09-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40419:


Assignee: (was: Apache Spark)

> Integrate Grouped Aggregate Pandas UDFs into *.sql test cases
> -
>
> Key: SPARK-40419
> URL: https://issues.apache.org/jira/browse/SPARK-40419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We ported Python UDF, Scala UDF and Scalar Pandas UDF into SQL test cases 
> from SPARK-27921, but Grouped Aggregate Pandas UDF is not tested from SQL at 
> all.
> We should also leverage this to test pandas aggregate UDFs too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40419) Integrate Grouped Aggregate Pandas UDFs into *.sql test cases

2022-09-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40419:


Assignee: Apache Spark

> Integrate Grouped Aggregate Pandas UDFs into *.sql test cases
> -
>
> Key: SPARK-40419
> URL: https://issues.apache.org/jira/browse/SPARK-40419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> We ported Python UDF, Scala UDF and Scalar Pandas UDF into SQL test cases 
> from SPARK-27921, but Grouped Aggregate Pandas UDF is not tested from SQL at 
> all.
> We should also leverage this to test pandas aggregate UDFs too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40419) Integrate Grouped Aggregate Pandas UDFs into *.sql test cases

2022-09-13 Thread Haejoon Lee (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-40419:

Summary: Integrate Grouped Aggregate Pandas UDFs into *.sql test cases  
(was: Integrate aggregate pandas UDFs into *.sql test cases)

> Integrate Grouped Aggregate Pandas UDFs into *.sql test cases
> -
>
> Key: SPARK-40419
> URL: https://issues.apache.org/jira/browse/SPARK-40419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We ported Python UDF, Scala UDF and Scalar Pandas UDF into SQL test cases 
> from SPARK-27921, but Grouped Aggregate Pandas UDF is not tested from SQL at 
> all.
> We should also leverage this to test pandas aggregate UDFs too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40419) Integrate aggregate pandas UDFs into *.sql test cases

2022-09-13 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-40419:
---

 Summary: Integrate aggregate pandas UDFs into *.sql test cases
 Key: SPARK-40419
 URL: https://issues.apache.org/jira/browse/SPARK-40419
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We ported Python UDF, Scala UDF and Scalar Pandas UDF into SQL test cases from 
SPARK-27921, but Grouped Aggregate Pandas UDF is not tested from SQL at all.

We should also leverage this to test pandas aggregate UDFs too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40390) Spark Master UI - SSL implementation

2022-09-13 Thread Rhajvijay Manoharan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603867#comment-17603867
 ] 

Rhajvijay Manoharan commented on SPARK-40390:
-

[~hyukjin.kwon] We will try using the 3.X version. Thank you.

> Spark Master UI - SSL implementation
> 
>
> Key: SPARK-40390
> URL: https://issues.apache.org/jira/browse/SPARK-40390
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8
>Reporter: Rhajvijay Manoharan
>Priority: Major
>
> While I tried to implement SSL for the master UI, we were getting the below 
> error while having the spark-core (spark-core_2.11-{*}2.4.8{*}.jar) :
> 22/09/08 03:45:03 ERROR MasterWebUI: Failed to bind MasterWebUI
> java.lang.IllegalStateException: KeyStores with multiple certificates are not 
> supported on the base class 
> org.spark_project.jetty.util.ssl.SslContextFactory. (Use 
> org.spark_project.jetty.util.ssl.SslContextFactory$Server or 
> org.spark_project.jetty.util.ssl.SslContextFactory$Client instead)
>         at 
> org.spark_project.jetty.util.ssl.SslContextFactory.newSniX509ExtendedKeyManager(SslContextFactory.java:1283)
>         at 
> org.spark_project.jetty.util.ssl.SslContextFactory.getKeyManagers(SslContextFactory.java:1265)hread.run(Thread.java:745)
>  
> But, while having the spark-core (spark-core_2.11-{*}2.4.3{*}.jar), we are 
> not having any issues.
> Please suggest how we can mitigate this issue in the latest version of the 
> spark-core_2.11-2.4.8



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40390) Spark Master UI - SSL implementation

2022-09-13 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603862#comment-17603862
 ] 

Hyukjin Kwon commented on SPARK-40390:
--

Is this still an issue in Spark 3.1+? Spark 2.x is EOL

> Spark Master UI - SSL implementation
> 
>
> Key: SPARK-40390
> URL: https://issues.apache.org/jira/browse/SPARK-40390
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8
>Reporter: Rhajvijay Manoharan
>Priority: Major
>
> While I tried to implement SSL for the master UI, we were getting the below 
> error while having the spark-core (spark-core_2.11-{*}2.4.8{*}.jar) :
> 22/09/08 03:45:03 ERROR MasterWebUI: Failed to bind MasterWebUI
> java.lang.IllegalStateException: KeyStores with multiple certificates are not 
> supported on the base class 
> org.spark_project.jetty.util.ssl.SslContextFactory. (Use 
> org.spark_project.jetty.util.ssl.SslContextFactory$Server or 
> org.spark_project.jetty.util.ssl.SslContextFactory$Client instead)
>         at 
> org.spark_project.jetty.util.ssl.SslContextFactory.newSniX509ExtendedKeyManager(SslContextFactory.java:1283)
>         at 
> org.spark_project.jetty.util.ssl.SslContextFactory.getKeyManagers(SslContextFactory.java:1265)hread.run(Thread.java:745)
>  
> But, while having the spark-core (spark-core_2.11-{*}2.4.3{*}.jar), we are 
> not having any issues.
> Please suggest how we can mitigate this issue in the latest version of the 
> spark-core_2.11-2.4.8



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40405) sparksql throws exception while reading by jdbc

2022-09-13 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603861#comment-17603861
 ] 

Hyukjin Kwon commented on SPARK-40405:
--

[~ghsea]seems like it's a classpath problem. how do you reproduce this issue?

> sparksql throws exception while reading by jdbc
> ---
>
> Key: SPARK-40405
> URL: https://issues.apache.org/jira/browse/SPARK-40405
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: ghsea
>Priority: Major
>
> the sample 
> code(https://spark.apache.org/docs/3.2.1/sql-data-sources-jdbc.html) throws 
> exception while reading data by jdbc
> Dataset jdbcDF = spark.read()
>   .format("jdbc")
>   .option("url", "jdbc:postgresql:dbserver")
>   .option("dbtable", "schema.tablename")
>   .option("user", "username")
>   .option("password", "password")
>   .load();
> {{}}
> exception:
> java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/DataSourceV2
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
>   at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:406)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:406)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:370)
>   at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
>   at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
>   at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
>   at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
>   at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
>   at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:652)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
>   ... 47 elided
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.sql.sources.v2.DataSourceV2
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
>   ... 83 more



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40418) Increase default initialNumPartitions to 10

2022-09-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40418:
-
Component/s: SQL

> Increase default initialNumPartitions to 10
> ---
>
> Key: SPARK-40418
> URL: https://issues.apache.org/jira/browse/SPARK-40418
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Ziqi Liu
>Priority: Major
>
> It's actually an follow-up from SPARK-40211
> The previous default value for initialNumPartitions is 1, which is way too 
> small. Change to 10 might be a reasonable value to achieve a middle-ground 
> trade-off, and will be beneficial in most cases (unless partition num is too 
> small, but in that case either initialNumPartitions and scaleUpFactor won't 
> have significant effect)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40405) sparksql throws exception while reading by jdbc

2022-09-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40405:
-
Priority: Major  (was: Critical)

> sparksql throws exception while reading by jdbc
> ---
>
> Key: SPARK-40405
> URL: https://issues.apache.org/jira/browse/SPARK-40405
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: ghsea
>Priority: Major
>
> the sample 
> code(https://spark.apache.org/docs/3.2.1/sql-data-sources-jdbc.html) throws 
> exception while reading data by jdbc
> Dataset jdbcDF = spark.read()
>   .format("jdbc")
>   .option("url", "jdbc:postgresql:dbserver")
>   .option("dbtable", "schema.tablename")
>   .option("user", "username")
>   .option("password", "password")
>   .load();
> {{}}
> exception:
> java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/DataSourceV2
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
>   at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:406)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:406)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:370)
>   at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
>   at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
>   at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46)
>   at scala.collection.Iterator.foreach(Iterator.scala:943)
>   at scala.collection.Iterator.foreach$(Iterator.scala:943)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
>   at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
>   at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
>   at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
>   at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:652)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
>   ... 47 elided
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.sql.sources.v2.DataSourceV2
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
>   ... 83 more



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40403) Negative size in error message when unsafe array is too big

2022-09-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40403:


Assignee: Bruce Robbins

> Negative size in error message when unsafe array is too big
> ---
>
> Key: SPARK-40403
> URL: https://issues.apache.org/jira/browse/SPARK-40403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
>
> When initializing an overly large unsafe array via 
> {{UnsafeArrayWriter#initialize}}, {{BufferHolder#grow}} may report an error 
> message with a negative size, e.g.:
> {noformat}
> java.lang.IllegalArgumentException: Cannot grow BufferHolder by size 
> -2115263656 because the size is negative
> {noformat}
> (Note: This is not related to SPARK-39608, as far as I can tell, despite 
> having the same symptom).
> When calculating the initial size in bytes needed for the array, 
> {{UnsafeArrayWriter#initialize}} uses an int expression, which can overflow. 
> The initialize method then passes the negative size to {{BufferHolder#grow}}, 
> which complains about the negative size.
> Example (the following will run just fine on a 16GB laptop, despite the large 
> driver size setting):
> {noformat}
> bin/spark-sql --driver-memory 22g --master "local[1]"
> create or replace temp view data1 as
> select 0 as key, id as val
> from range(0, 268271216);
> create or replace temp view data2 as
> select key as lkey, collect_list(val) as bigarray
> from data1
> group by key;
> -- the below cache forces Spark to create unsafe rows
> cache lazy table data2;
> select count(*) from data2;
> {noformat}
> After a few minutes, {{BufferHolder#grow}} will throw the following exception:
> {noformat}
> java.lang.IllegalArgumentException: Cannot grow BufferHolder by size 
> -2115263656 because the size is negative
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter.initialize(UnsafeArrayWriter.java:61)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.Collect.serialize(collect.scala:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.Collect.serialize(collect.scala:37)
> {noformat}
> This query was going to fail anyway, but the message makes it looks like a 
> bug in Spark rather than a user problem. {{UnsafeArrayWriter#initialize}} 
> should calculate using a long expression and fail if the size exceeds 
> {{Integer.MAX_VALUE}}, showing the actual initial size in the error message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40403) Negative size in error message when unsafe array is too big

2022-09-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40403.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37852
[https://github.com/apache/spark/pull/37852]

> Negative size in error message when unsafe array is too big
> ---
>
> Key: SPARK-40403
> URL: https://issues.apache.org/jira/browse/SPARK-40403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 3.4.0
>
>
> When initializing an overly large unsafe array via 
> {{UnsafeArrayWriter#initialize}}, {{BufferHolder#grow}} may report an error 
> message with a negative size, e.g.:
> {noformat}
> java.lang.IllegalArgumentException: Cannot grow BufferHolder by size 
> -2115263656 because the size is negative
> {noformat}
> (Note: This is not related to SPARK-39608, as far as I can tell, despite 
> having the same symptom).
> When calculating the initial size in bytes needed for the array, 
> {{UnsafeArrayWriter#initialize}} uses an int expression, which can overflow. 
> The initialize method then passes the negative size to {{BufferHolder#grow}}, 
> which complains about the negative size.
> Example (the following will run just fine on a 16GB laptop, despite the large 
> driver size setting):
> {noformat}
> bin/spark-sql --driver-memory 22g --master "local[1]"
> create or replace temp view data1 as
> select 0 as key, id as val
> from range(0, 268271216);
> create or replace temp view data2 as
> select key as lkey, collect_list(val) as bigarray
> from data1
> group by key;
> -- the below cache forces Spark to create unsafe rows
> cache lazy table data2;
> select count(*) from data2;
> {noformat}
> After a few minutes, {{BufferHolder#grow}} will throw the following exception:
> {noformat}
> java.lang.IllegalArgumentException: Cannot grow BufferHolder by size 
> -2115263656 because the size is negative
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter.initialize(UnsafeArrayWriter.java:61)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.Collect.serialize(collect.scala:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.Collect.serialize(collect.scala:37)
> {noformat}
> This query was going to fail anyway, but the message makes it looks like a 
> bug in Spark rather than a user problem. {{UnsafeArrayWriter#initialize}} 
> should calculate using a long expression and fail if the size exceeds 
> {{Integer.MAX_VALUE}}, showing the actual initial size in the error message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40286) Load Data from S3 deletes data source file

2022-09-13 Thread Drew (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603826#comment-17603826
 ] 

Drew commented on SPARK-40286:
--

Hi [~ste...@apache.org],

Yeah, is there anything significant there that I should be looking for? When 
doing this with that same criteria I get the same results and nothing in the 
logs raises any suspicion to me.

> Load Data from S3 deletes data source file
> --
>
> Key: SPARK-40286
> URL: https://issues.apache.org/jira/browse/SPARK-40286
> Project: Spark
>  Issue Type: Question
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Drew
>Priority: Major
>
> Hello, 
> I'm using spark to [load 
> data|https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html] into 
> a hive table through Pyspark, and when I load data from a path in Amazon S3, 
> the original file is getting wiped from the Directory. The file is found, and 
> is populating the table with data. I also tried to add the `Local` clause but 
> that throws an error when looking for the file. When looking through the 
> documentation it doesn't explicitly state that this is the intended behavior.
> Thanks in advance!
> {code:java}
> spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile")
> spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE 
> src"){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40287) Load Data using Spark by a single partition moves entire dataset under same location in S3

2022-09-13 Thread Drew (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603819#comment-17603819
 ] 

Drew commented on SPARK-40287:
--

Hey [~ste...@apache.org], 

Yes, I get the same functionality with this criteria as well. It looks like the 
data is moving to the new table location again.

> Load Data using Spark by a single partition moves entire dataset under same 
> location in S3
> --
>
> Key: SPARK-40287
> URL: https://issues.apache.org/jira/browse/SPARK-40287
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Drew
>Priority: Major
>
> Hello,
> I'm experiencing an issue in PySpark when creating a hive table and loading 
> in the data to the table. So I'm using an Amazon s3 bucket as a data location 
> and I'm creating a table as parquet and trying to load data into that table 
> by a single partition, and I'm seeing some weird behavior. When selecting the 
> data location in s3 of a parquet file to load into my table. All of the data 
> is moved into the specified location in my create table command including the 
> partitions I didn't specify in the load data command. For example:
> {code:java}
> # create a data frame in pyspark with partitions
> df = spark.createDataFrame([("a", 1, "x"), ("b", 2, "y"), ("c", 3, "y")], 
> ["c1", "c2", "p"])
> # save it to S3
> df.write.format("parquet").mode("overwrite").partitionBy("p").save("s3://bucket/data/")
> {code}
> In the current state S3 should have a new folder `data` with two folders 
> which contain a parquet file in each partition. 
>   
>  - s3://bucket/data/p=x/
>     - part-1.snappy.parquet
>  - s3://bucket/data/p=y/
>     - part-2.snappy.parquet
>     - part-3.snappy.parquet
>  
> {code:java}
> # create new table
> spark.sql("create table src (c1 string,c2 int) PARTITIONED BY (p string) 
> STORED AS parquet LOCATION 's3://bucket/new/'")
> # load the saved table data from s3 specifying single partition value x
> spark.sql("LOAD DATA INPATH 's3://bucket/data/'INTO TABLE src PARTITION 
> (p='x')")
> spark.sql("select * from src").show()
> # output: 
> # +---+---+---+
> # | c1| c2|  p|
> # +---+---+---+
> # +---+---+---+
> {code}
> After running the `load data` command, and looking at the table I'm left with 
> no data loaded in. When checking S3 the data source we saved earlier is moved 
> under `s3://bucket/new/` oddly enough it also brought over the other 
> partitions along with it directory structure listed below. 
> - s3://bucket/new/
>     - p=x/
>         - p=x/
>             - part-1.snappy.parquet
>         - p=y/
>             - part-2.snappy.parquet
>             - part-3.snappy.parquet
> Is this the intended behavior of loading the data in from a partitioned 
> parquet file? Is the previous file supposed to be moved/deleted from source 
> directory? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-40309) Introduce sql_conf context manager for pyspark.sql

2022-09-13 Thread Xinrong Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng deleted SPARK-40309:
-


> Introduce sql_conf context manager for pyspark.sql
> --
>
> Key: SPARK-40309
> URL: https://issues.apache.org/jira/browse/SPARK-40309
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Xinrong Meng
>Priority: Major
>  Labels: release-notes
>
> That would simplify the control of Spark SQL configuration as below
> from
> {code:java}
> original_value = spark.conf.get("key")
> spark.conf.set("key", "value")
> ...
> spark.conf.set("key", original_value){code}
> to
> {code:java}
> with sql_conf({"key": "value"}):
> ...
> {code}
> [Here|https://github.com/apache/spark/blob/master/python/pyspark/pandas/utils.py#L490]
>  is such a context manager is in Pandas API on Spark.
> We should introduce one in `pyspark.sql`, and deduplicate code if possible.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40384) Do base image real in time build only when infra dockerfile is changed

2022-09-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40384:


Assignee: Yikun Jiang  (was: Apache Spark)

> Do base image real in time build only when infra dockerfile is changed
> --
>
> Key: SPARK-40384
> URL: https://issues.apache.org/jira/browse/SPARK-40384
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40384) Do base image real in time build only when infra dockerfile is changed

2022-09-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40384:


Assignee: Apache Spark  (was: Yikun Jiang)

> Do base image real in time build only when infra dockerfile is changed
> --
>
> Key: SPARK-40384
> URL: https://issues.apache.org/jira/browse/SPARK-40384
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40384) Do base image real in time build only when infra dockerfile is changed

2022-09-13 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603798#comment-17603798
 ] 

Hyukjin Kwon commented on SPARK-40384:
--

Reverted at 
https://github.com/apache/spark/commit/4dd153055fdd7ab0f38cd022e653e4ecd9404e54

> Do base image real in time build only when infra dockerfile is changed
> --
>
> Key: SPARK-40384
> URL: https://issues.apache.org/jira/browse/SPARK-40384
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-40384) Do base image real in time build only when infra dockerfile is changed

2022-09-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-40384:
--

> Do base image real in time build only when infra dockerfile is changed
> --
>
> Key: SPARK-40384
> URL: https://issues.apache.org/jira/browse/SPARK-40384
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40384) Do base image real in time build only when infra dockerfile is changed

2022-09-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40384:
-
Fix Version/s: (was: 3.4.0)

> Do base image real in time build only when infra dockerfile is changed
> --
>
> Key: SPARK-40384
> URL: https://issues.apache.org/jira/browse/SPARK-40384
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40417) Use YuniKorn v1.1+

2022-09-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-40417.
---
Fix Version/s: 3.3.1
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37872
[https://github.com/apache/spark/pull/37872]

> Use YuniKorn v1.1+
> --
>
> Key: SPARK-40417
> URL: https://issues.apache.org/jira/browse/SPARK-40417
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Kubernetes
>Affects Versions: 3.4.0, 3.3.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
>
> YuniKorn 1.1.0 starts to support multi-arch officially.
> [https://yunikorn.apache.org/release-announce/1.1.0]
> {code:java}
> $ docker inspect apache/yunikorn:scheduler-1.0.0 | grep Architecture
>         "Architecture": "amd64",
> $ docker inspect apache/yunikorn:scheduler-1.1.0 | grep Architecture
>         "Architecture": "arm64", {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40417) Use YuniKorn v1.1+

2022-09-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-40417:
-

Assignee: Dongjoon Hyun

> Use YuniKorn v1.1+
> --
>
> Key: SPARK-40417
> URL: https://issues.apache.org/jira/browse/SPARK-40417
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Kubernetes
>Affects Versions: 3.4.0, 3.3.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> YuniKorn 1.1.0 starts to support multi-arch officially.
> [https://yunikorn.apache.org/release-announce/1.1.0]
> {code:java}
> $ docker inspect apache/yunikorn:scheduler-1.0.0 | grep Architecture
>         "Architecture": "amd64",
> $ docker inspect apache/yunikorn:scheduler-1.1.0 | grep Architecture
>         "Architecture": "arm64", {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators

2022-09-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-40362:
-

Assignee: Peter Toth

> Bug in Canonicalization of expressions like Add & Multiply i.e Commutative 
> Operators
> 
>
> Key: SPARK-40362
> URL: https://issues.apache.org/jira/browse/SPARK-40362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Asif
>Assignee: Peter Toth
>Priority: Major
>  Labels: spark-sql
> Fix For: 3.3.1
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In the canonicalization code which is now in two stages, canonicalization 
> involving Commutative operators is broken, if they are subexpressions of 
> certain type of expressions which override precanonicalize, for example 
> BinaryComparison 
> Consider following expression:
> a + b > 10
>          GT
>             |
> a + b          10
> The BinaryComparison operator in the precanonicalize, first precanonicalizes  
> children & then   may swap operands based on left /right hashCode inequality..
> lets say  Add(a + b) .hashCode is >  10.hashCode as a result GT is converted 
> to LT
> But If the same tree is created 
>            GT
>             |
>  b + a      10
> The hashCode of Add(b, a) is not same as Add(a, b) , thus it is possible that 
> for this tree
>  Add(b + a) .hashCode is <  10.hashCode  in which case GT remains as is.
> Thus to similar trees result in different canonicalization , one having GT 
> other having LT 
>  
> The problem occurs because  for commutative expressions the canonicalization 
> normalizes the expression with consistent hashCode which is not the case with 
> precanonicalize as the hashCode of commutative expression 's precanonicalize 
> and post canonicalize are different.
>  
>  
> The test 
> {quote}test("bug X")
> Unknown macro: \{     val tr1 = LocalRelation('c.int, 'b.string, 'a.int)    
> val y = tr1.where('a.attr + 'c.attr > 10).analyze    val fullCond = 
> y.asInstanceOf[Filter].condition.clone()   val addExpr = (fullCond match 
> Unknown macro}
> ).clone().asInstanceOf[Add]
> val canonicalizedFullCond = fullCond.canonicalized
> // swap the operands of add
> val newAddExpr = Add(addExpr.right, addExpr.left)
> // build a new condition which is same as the previous one, but with operands 
> of //Add reversed 
> val builtCondnCanonicalized = GreaterThan(newAddExpr, 
> Literal(10)).canonicalized
> assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
> }
> {quote}
> This test fails.
> The fix which I propose is that for the commutative expressions, the 
> precanonicalize should be overridden and  
> Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
> of at place of canonicalize. effectively for commutative operands ( add, or , 
> multiply , and etc) canonicalize and precanonicalize should be same.
> PR:
> [https://github.com/apache/spark/pull/37824]
>  
>  
> I am also trying a better fix, where by the idea is that for commutative 
> expressions the murmur hashCode are caluculated using unorderedHash so that 
> it is order  independent ( i.e symmetric).
> The above approach works fine , but in case of Least & Greatest, the 
> Product's element is  a Seq,  and that messes with consistency of hashCode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators

2022-09-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-40362.
---
Fix Version/s: 3.3.1
   Resolution: Fixed

Issue resolved by pull request 37866
[https://github.com/apache/spark/pull/37866]

> Bug in Canonicalization of expressions like Add & Multiply i.e Commutative 
> Operators
> 
>
> Key: SPARK-40362
> URL: https://issues.apache.org/jira/browse/SPARK-40362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Asif
>Priority: Major
>  Labels: spark-sql
> Fix For: 3.3.1
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In the canonicalization code which is now in two stages, canonicalization 
> involving Commutative operators is broken, if they are subexpressions of 
> certain type of expressions which override precanonicalize, for example 
> BinaryComparison 
> Consider following expression:
> a + b > 10
>          GT
>             |
> a + b          10
> The BinaryComparison operator in the precanonicalize, first precanonicalizes  
> children & then   may swap operands based on left /right hashCode inequality..
> lets say  Add(a + b) .hashCode is >  10.hashCode as a result GT is converted 
> to LT
> But If the same tree is created 
>            GT
>             |
>  b + a      10
> The hashCode of Add(b, a) is not same as Add(a, b) , thus it is possible that 
> for this tree
>  Add(b + a) .hashCode is <  10.hashCode  in which case GT remains as is.
> Thus to similar trees result in different canonicalization , one having GT 
> other having LT 
>  
> The problem occurs because  for commutative expressions the canonicalization 
> normalizes the expression with consistent hashCode which is not the case with 
> precanonicalize as the hashCode of commutative expression 's precanonicalize 
> and post canonicalize are different.
>  
>  
> The test 
> {quote}test("bug X")
> Unknown macro: \{     val tr1 = LocalRelation('c.int, 'b.string, 'a.int)    
> val y = tr1.where('a.attr + 'c.attr > 10).analyze    val fullCond = 
> y.asInstanceOf[Filter].condition.clone()   val addExpr = (fullCond match 
> Unknown macro}
> ).clone().asInstanceOf[Add]
> val canonicalizedFullCond = fullCond.canonicalized
> // swap the operands of add
> val newAddExpr = Add(addExpr.right, addExpr.left)
> // build a new condition which is same as the previous one, but with operands 
> of //Add reversed 
> val builtCondnCanonicalized = GreaterThan(newAddExpr, 
> Literal(10)).canonicalized
> assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
> }
> {quote}
> This test fails.
> The fix which I propose is that for the commutative expressions, the 
> precanonicalize should be overridden and  
> Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
> of at place of canonicalize. effectively for commutative operands ( add, or , 
> multiply , and etc) canonicalize and precanonicalize should be same.
> PR:
> [https://github.com/apache/spark/pull/37824]
>  
>  
> I am also trying a better fix, where by the idea is that for commutative 
> expressions the murmur hashCode are caluculated using unorderedHash so that 
> it is order  independent ( i.e symmetric).
> The above approach works fine , but in case of Least & Greatest, the 
> Product's element is  a Seq,  and that messes with consistency of hashCode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40418) Increase default initialNumPartitions to 10

2022-09-13 Thread Ziqi Liu (Jira)
Ziqi Liu created SPARK-40418:


 Summary: Increase default initialNumPartitions to 10
 Key: SPARK-40418
 URL: https://issues.apache.org/jira/browse/SPARK-40418
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.4.0
Reporter: Ziqi Liu


It's actually an follow-up from SPARK-40211

The previous default value for initialNumPartitions is 1, which is way too 
small. Change to 10 might be a reasonable value to achieve a middle-ground 
trade-off, and will be beneficial in most cases (unless partition num is too 
small, but in that case either initialNumPartitions and scaleUpFactor won't 
have significant effect)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40417) Use YuniKorn v1.1+

2022-09-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40417:
--
Issue Type: Documentation  (was: Improvement)

> Use YuniKorn v1.1+
> --
>
> Key: SPARK-40417
> URL: https://issues.apache.org/jira/browse/SPARK-40417
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Kubernetes
>Affects Versions: 3.4.0, 3.3.1
>Reporter: Dongjoon Hyun
>Priority: Major
>
> YuniKorn 1.1.0 starts to support multi-arch officially.
> [https://yunikorn.apache.org/release-announce/1.1.0]
> {code:java}
> $ docker inspect apache/yunikorn:scheduler-1.0.0 | grep Architecture
>         "Architecture": "amd64",
> $ docker inspect apache/yunikorn:scheduler-1.1.0 | grep Architecture
>         "Architecture": "arm64", {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40417) Use YuniKorn v1.1+

2022-09-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40417:


Assignee: (was: Apache Spark)

> Use YuniKorn v1.1+
> --
>
> Key: SPARK-40417
> URL: https://issues.apache.org/jira/browse/SPARK-40417
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Kubernetes
>Affects Versions: 3.4.0, 3.3.1
>Reporter: Dongjoon Hyun
>Priority: Major
>
> YuniKorn 1.1.0 starts to support multi-arch officially.
> [https://yunikorn.apache.org/release-announce/1.1.0]
> {code:java}
> $ docker inspect apache/yunikorn:scheduler-1.0.0 | grep Architecture
>         "Architecture": "amd64",
> $ docker inspect apache/yunikorn:scheduler-1.1.0 | grep Architecture
>         "Architecture": "arm64", {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40417) Use YuniKorn v1.1+

2022-09-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40417:


Assignee: Apache Spark

> Use YuniKorn v1.1+
> --
>
> Key: SPARK-40417
> URL: https://issues.apache.org/jira/browse/SPARK-40417
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Kubernetes
>Affects Versions: 3.4.0, 3.3.1
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> YuniKorn 1.1.0 starts to support multi-arch officially.
> [https://yunikorn.apache.org/release-announce/1.1.0]
> {code:java}
> $ docker inspect apache/yunikorn:scheduler-1.0.0 | grep Architecture
>         "Architecture": "amd64",
> $ docker inspect apache/yunikorn:scheduler-1.1.0 | grep Architecture
>         "Architecture": "arm64", {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40417) Use YuniKorn v1.1+

2022-09-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603710#comment-17603710
 ] 

Apache Spark commented on SPARK-40417:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37872

> Use YuniKorn v1.1+
> --
>
> Key: SPARK-40417
> URL: https://issues.apache.org/jira/browse/SPARK-40417
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Kubernetes
>Affects Versions: 3.4.0, 3.3.1
>Reporter: Dongjoon Hyun
>Priority: Major
>
> YuniKorn 1.1.0 starts to support multi-arch officially.
> [https://yunikorn.apache.org/release-announce/1.1.0]
> {code:java}
> $ docker inspect apache/yunikorn:scheduler-1.0.0 | grep Architecture
>         "Architecture": "amd64",
> $ docker inspect apache/yunikorn:scheduler-1.1.0 | grep Architecture
>         "Architecture": "arm64", {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40417) Use YuniKorn v1.1+

2022-09-13 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-40417:
-

 Summary: Use YuniKorn v1.1+
 Key: SPARK-40417
 URL: https://issues.apache.org/jira/browse/SPARK-40417
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Kubernetes
Affects Versions: 3.4.0, 3.3.1
Reporter: Dongjoon Hyun


YuniKorn 1.1.0 starts to support multi-arch officially.

[https://yunikorn.apache.org/release-announce/1.1.0]
{code:java}
$ docker inspect apache/yunikorn:scheduler-1.0.0 | grep Architecture
        "Architecture": "amd64",
$ docker inspect apache/yunikorn:scheduler-1.1.0 | grep Architecture
        "Architecture": "arm64", {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40416) Add error classes for subquery expression CheckAnalysis failures

2022-09-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40416:


Assignee: (was: Apache Spark)

> Add error classes for subquery expression CheckAnalysis failures
> 
>
> Key: SPARK-40416
> URL: https://issues.apache.org/jira/browse/SPARK-40416
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40416) Add error classes for subquery expression CheckAnalysis failures

2022-09-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603682#comment-17603682
 ] 

Apache Spark commented on SPARK-40416:
--

User 'dtenedor' has created a pull request for this issue:
https://github.com/apache/spark/pull/37840

> Add error classes for subquery expression CheckAnalysis failures
> 
>
> Key: SPARK-40416
> URL: https://issues.apache.org/jira/browse/SPARK-40416
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40416) Add error classes for subquery expression CheckAnalysis failures

2022-09-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40416:


Assignee: Apache Spark

> Add error classes for subquery expression CheckAnalysis failures
> 
>
> Key: SPARK-40416
> URL: https://issues.apache.org/jira/browse/SPARK-40416
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40416) Add error classes for subquery expression CheckAnalysis failures

2022-09-13 Thread Daniel (Jira)
Daniel created SPARK-40416:
--

 Summary: Add error classes for subquery expression CheckAnalysis 
failures
 Key: SPARK-40416
 URL: https://issues.apache.org/jira/browse/SPARK-40416
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Daniel






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40394) Move CheckAnalysis error messages to use the new error framework

2022-09-13 Thread Daniel (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel updated SPARK-40394:
---
Summary: Move CheckAnalysis error messages to use the new error framework  
(was: Move subquery expression CheckAnalysis error messages to use the new 
error framework)

> Move CheckAnalysis error messages to use the new error framework
> 
>
> Key: SPARK-40394
> URL: https://issues.apache.org/jira/browse/SPARK-40394
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40415) Wrong version of okio in spark-deps file

2022-09-13 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-40415.
--
Resolution: Won't Fix

> Wrong version of okio in spark-deps file
> 
>
> Key: SPARK-40415
> URL: https://issues.apache.org/jira/browse/SPARK-40415
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: tree-before.txt
>
>
> kubernetes-client cascade depends on okio 1.15.0 , which is a compile scope 
> dependency for spark
> selenium-java depends on okio 1.14.0, which is a test scope dependency for 
> spark
>  
> but Spark Project Assembly  choice okio 1.14.0 as compile scope dependency 
> and the version in spark-deps file also 1.14.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours

2022-09-13 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-33152:
-
   Shepherd: Wenchen Fan  (was: Arnaud Doucet)
Description: 
h2. Q1. What are you trying to do? Articulate your objectives using absolutely 
no jargon.

Proposing new algorithm to create, store and use constraints for removing 
redundant filters & inferring new filters.
The current algorithm has subpar performance in complex expression scenarios 
involving aliases( with certain use cases the compilation time can go into 
hours), potential to cause OOM, may miss removing redundant filters in 
different scenarios, may miss creating IsNotNull constraints in different 
scenarios, does not push compound predicates in Join.
 # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
compilation times.
Have added a test "plan equivalence with case statements and performance 
comparison with benefit of more than 10x conservatively" in 
org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. *With 
this PR the compilation time is 247 ms vs 13958 ms without the change*
 # It is more effective in filter pruning as is evident in some of the tests in 
org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite where 
current code is not able to identify the redundant filter in some cases.
 # It is able to generate a better optimized plan for join queries as it can 
push compound predicates.
 # The current logic can miss a lot of possible cases of removing redundant 
predicates, as it fails to take into account if same attribute or its aliases 
are repeated multiple times in a complex expression.
 # There are cases where some of the optimizer rules involving removal of 
redundant predicates fail to remove on the basis of constraint data. In some 
cases the rule works, just by the virtue of previous rules helping it out to 
cover the inaccuracy. That the ConstraintPropagation rule & its function of 
removal of redundant filters & addition of new inferred filters is dependent on 
the working of some of the other unrelated previous optimizer rules is 
behaving, is indicative of issues.
 # It does away with all the EqualNullSafe constraints as this logic does not 
need those constraints to be created.
 # There is at least one test in existing ConstraintPropagationSuite which is 
missing a IsNotNull constraints because the code incorrectly generated a 
EqualsNullSafeConstraint instead of EqualTo constraint, when using the existing 
Constraints code. With these changes, the test correctly creates an EqualTo 
constraint, resulting in an inferred IsNotNull constraint
 # It does away with the current combinatorial logic of evaluation all the 
constraints can cause compilation to run into hours or cause OOM. The number of 
constraints stored is exactly the same as the number of filters encountered

h2. Q2. What problem is this proposal NOT designed to solve?

It mainly focuses on compile time performance, but in some cases can benefit 
run time characteristics too, like inferring IsNotNull filter or pushing down 
compound predicates on the join, which currently may get missed/ does not 
happen , respectively, by the present code.
h2. Q3. How is it done today, and what are the limits of current practice?

Current ConstraintsPropagation code, pessimistically tries to generates all the 
possible combinations of constraints , based on the aliases ( even then it may 
miss a lot of combinations if the expression is a complex expression involving 
same attribute repeated multiple times within the expression and there are many 
aliases to that column). There are query plans in our production env, which can 
result in intermediate number of constraints going into hundreds of thousands, 
causing OOM or taking time running into hours. Also there are cases where it 
incorrectly generates an EqualNullSafe constraint instead of EqualTo constraint 
, thus missing a possible IsNull constraint on column. 
Also it only pushes single column predicate on the other side of the join.
The constraints generated , in some cases, are missing the required ones, and 
the plan apparently is behaving correctly only due to the preceding unrelated 
optimizer rule. Have Test which show that with the bare mnimum rules containing 
RemoveRedundantPredicate, it misses the removal of redundant predicate.
h2. Q4. What is new in your approach and why do you think it will be successful?

It solves all the above mentioned issues.
 # The number of constraints created are same as the number of filters. No 
combinatorial creation of constraints. No need for EqualsNullSafe constraint on 
aliases.
 # Can remove redundant predicates on any expression involving aliases 
irrespective of the number of repeat occurences in all possible combination.
 # Brings down query compilation time to few minutes from hours.
 # Can push compound predicates on 

[jira] [Commented] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours

2022-09-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603664#comment-17603664
 ] 

Apache Spark commented on SPARK-33152:
--

User 'ahshahid' has created a pull request for this issue:
https://github.com/apache/spark/pull/37870

> SPIP: Constraint Propagation code causes OOM issues or increasing compilation 
> time to hours
> ---
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.1, 3.1.2
>Reporter: Asif
>Priority: Major
>  Labels: SPIP
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing 
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios 
> involving aliases( with certain use cases the compilation time can go into 
> hours), potential to cause OOM,  may miss removing redundant filters in 
> different scenarios, may miss creating IsNotNull constraints in different 
> scenarios,  does not push compound predicates in Join.
> # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
> compilation times.
> Have added  a test "plan equivalence with case statements and performance 
> comparison with benefit of more than 10x conservatively" in 
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. 
> *With this PR the compilation time is 247 ms vs 13958 ms without the change*
> # It is more effective in filter pruning as is evident in some of the tests 
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite 
> where current code is not able to identify the redundant filter in some cases.
> # It is able to generate a better optimized plan for join queries as it can 
> push compound predicates.
> # The current logic can miss a lot of possible cases of removing redundant 
> predicates, as it fails to take into account if same attribute or its aliases 
> are repeated multiple times in a complex expression.
> # There are cases where some of the optimizer rules involving removal of 
> redundant predicates fail to remove on the basis of constraint data. In some 
> cases the rule works, just by the virtue of previous rules helping it out to 
> cover the inaccuracy. That the ConstraintPropagation rule & its function of 
> removal of redundant filters & addition of new inferred filters is dependent 
> on the working of some of the other unrelated previous optimizer rules is 
> behaving, is indicative of issues.
> # It does away with all the EqualNullSafe constraints as this logic does not 
> need those constraints to be created.
> # There is at least one test in existing ConstraintPropagationSuite which is 
> missing a IsNotNull constraints because the code incorrectly generated a 
> EqualsNullSafeConstraint instead of EqualTo constraint, when using the 
> existing Constraints code. With these changes, the test correctly creates an 
> EqualTo constraint, resulting in an inferred IsNotNull constraint
> # It does away with the current combinatorial logic of evaluation all the 
> constraints can cause compilation to run into hours or cause OOM. The number 
> of constraints stored is exactly the same as the number of filters encountered
> h2. Q2. What problem is this proposal NOT designed to solve?
> It mainly focuses on compile time performance, but in some cases can benefit 
> run time characteristics too, like inferring IsNotNull filter or pushing down 
> compound predicates on the join, which currently may get missed/ does not 
> happen , respectively, by the present code.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Current ConstraintsPropagation code,  pessimistically tries to generates all 
> the possible combinations of constraints , based on the aliases ( even then 
> it may miss a lot of  combinations if the expression is a complex expression 
> involving same attribute repeated multiple times within the expression and 
> there are many aliases to that column).  There are query plans in our 
> production env, which can result in intermediate number of constraints going 
> into hundreds of thousands, causing OOM or taking time running into hours.  
> Also there are cases where it incorrectly generates an EqualNullSafe 
> constraint instead of EqualTo constraint , thus missing a possible IsNull 
> constraint on column. 
> Also it only pushes single column predicate on the other side of the join.
> The constraints generated , in some cases, are missing the required 

[jira] [Commented] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours

2022-09-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603663#comment-17603663
 ] 

Apache Spark commented on SPARK-33152:
--

User 'ahshahid' has created a pull request for this issue:
https://github.com/apache/spark/pull/37870

> SPIP: Constraint Propagation code causes OOM issues or increasing compilation 
> time to hours
> ---
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.1, 3.1.2
>Reporter: Asif
>Priority: Major
>  Labels: SPIP
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Proposing new algorithm to create, store and use constraints for removing 
> redundant filters & inferring new filters.
> The current algorithm has subpar performance in complex expression scenarios 
> involving aliases( with certain use cases the compilation time can go into 
> hours), potential to cause OOM,  may miss removing redundant filters in 
> different scenarios, may miss creating IsNotNull constraints in different 
> scenarios,  does not push compound predicates in Join.
> # This issue if not fixed can cause OutOfMemory issue or unacceptable query 
> compilation times.
> Have added  a test "plan equivalence with case statements and performance 
> comparison with benefit of more than 10x conservatively" in 
> org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. 
> *With this PR the compilation time is 247 ms vs 13958 ms without the change*
> # It is more effective in filter pruning as is evident in some of the tests 
> in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite 
> where current code is not able to identify the redundant filter in some cases.
> # It is able to generate a better optimized plan for join queries as it can 
> push compound predicates.
> # The current logic can miss a lot of possible cases of removing redundant 
> predicates, as it fails to take into account if same attribute or its aliases 
> are repeated multiple times in a complex expression.
> # There are cases where some of the optimizer rules involving removal of 
> redundant predicates fail to remove on the basis of constraint data. In some 
> cases the rule works, just by the virtue of previous rules helping it out to 
> cover the inaccuracy. That the ConstraintPropagation rule & its function of 
> removal of redundant filters & addition of new inferred filters is dependent 
> on the working of some of the other unrelated previous optimizer rules is 
> behaving, is indicative of issues.
> # It does away with all the EqualNullSafe constraints as this logic does not 
> need those constraints to be created.
> # There is at least one test in existing ConstraintPropagationSuite which is 
> missing a IsNotNull constraints because the code incorrectly generated a 
> EqualsNullSafeConstraint instead of EqualTo constraint, when using the 
> existing Constraints code. With these changes, the test correctly creates an 
> EqualTo constraint, resulting in an inferred IsNotNull constraint
> # It does away with the current combinatorial logic of evaluation all the 
> constraints can cause compilation to run into hours or cause OOM. The number 
> of constraints stored is exactly the same as the number of filters encountered
> h2. Q2. What problem is this proposal NOT designed to solve?
> It mainly focuses on compile time performance, but in some cases can benefit 
> run time characteristics too, like inferring IsNotNull filter or pushing down 
> compound predicates on the join, which currently may get missed/ does not 
> happen , respectively, by the present code.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Current ConstraintsPropagation code,  pessimistically tries to generates all 
> the possible combinations of constraints , based on the aliases ( even then 
> it may miss a lot of  combinations if the expression is a complex expression 
> involving same attribute repeated multiple times within the expression and 
> there are many aliases to that column).  There are query plans in our 
> production env, which can result in intermediate number of constraints going 
> into hundreds of thousands, causing OOM or taking time running into hours.  
> Also there are cases where it incorrectly generates an EqualNullSafe 
> constraint instead of EqualTo constraint , thus missing a possible IsNull 
> constraint on column. 
> Also it only pushes single column predicate on the other side of the join.
> The constraints generated , in some cases, are missing the required 

[jira] [Commented] (SPARK-40397) Migrate selenium-java from 3.1 to 4.2 and upgrade org.scalatestplus:selenium to 3.2.13.0

2022-09-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603652#comment-17603652
 ] 

Apache Spark commented on SPARK-40397:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37868

> Migrate selenium-java from 3.1 to 4.2 and upgrade org.scalatestplus:selenium 
> to 3.2.13.0
> 
>
> Key: SPARK-40397
> URL: https://issues.apache.org/jira/browse/SPARK-40397
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40397) Migrate selenium-java from 3.1 to 4.2 and upgrade org.scalatestplus:selenium to 3.2.13.0

2022-09-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40397:


Assignee: (was: Apache Spark)

> Migrate selenium-java from 3.1 to 4.2 and upgrade org.scalatestplus:selenium 
> to 3.2.13.0
> 
>
> Key: SPARK-40397
> URL: https://issues.apache.org/jira/browse/SPARK-40397
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40397) Migrate selenium-java from 3.1 to 4.2 and upgrade org.scalatestplus:selenium to 3.2.13.0

2022-09-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40397:


Assignee: Apache Spark

> Migrate selenium-java from 3.1 to 4.2 and upgrade org.scalatestplus:selenium 
> to 3.2.13.0
> 
>
> Key: SPARK-40397
> URL: https://issues.apache.org/jira/browse/SPARK-40397
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40397) Migrate selenium-java from 3.1 to 4.2 and upgrade org.scalatestplus:selenium to 3.2.13.0

2022-09-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603650#comment-17603650
 ] 

Apache Spark commented on SPARK-40397:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37868

> Migrate selenium-java from 3.1 to 4.2 and upgrade org.scalatestplus:selenium 
> to 3.2.13.0
> 
>
> Key: SPARK-40397
> URL: https://issues.apache.org/jira/browse/SPARK-40397
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40415) Wrong version of okio in spark-deps file

2022-09-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40415:


Assignee: Apache Spark

> Wrong version of okio in spark-deps file
> 
>
> Key: SPARK-40415
> URL: https://issues.apache.org/jira/browse/SPARK-40415
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
> Attachments: tree-before.txt
>
>
> kubernetes-client cascade depends on okio 1.15.0 , which is a compile scope 
> dependency for spark
> selenium-java depends on okio 1.14.0, which is a test scope dependency for 
> spark
>  
> but Spark Project Assembly  choice okio 1.14.0 as compile scope dependency 
> and the version in spark-deps file also 1.14.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40415) Wrong version of okio in spark-deps file

2022-09-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40415:


Assignee: (was: Apache Spark)

> Wrong version of okio in spark-deps file
> 
>
> Key: SPARK-40415
> URL: https://issues.apache.org/jira/browse/SPARK-40415
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: tree-before.txt
>
>
> kubernetes-client cascade depends on okio 1.15.0 , which is a compile scope 
> dependency for spark
> selenium-java depends on okio 1.14.0, which is a test scope dependency for 
> spark
>  
> but Spark Project Assembly  choice okio 1.14.0 as compile scope dependency 
> and the version in spark-deps file also 1.14.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40415) Wrong version of okio in spark-deps file

2022-09-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603631#comment-17603631
 ] 

Apache Spark commented on SPARK-40415:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37867

> Wrong version of okio in spark-deps file
> 
>
> Key: SPARK-40415
> URL: https://issues.apache.org/jira/browse/SPARK-40415
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: tree-before.txt
>
>
> kubernetes-client cascade depends on okio 1.15.0 , which is a compile scope 
> dependency for spark
> selenium-java depends on okio 1.14.0, which is a test scope dependency for 
> spark
>  
> but Spark Project Assembly  choice okio 1.14.0 as compile scope dependency 
> and the version in spark-deps file also 1.14.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40415) Wrong version of okio in spark-deps file

2022-09-13 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-40415:
-
Attachment: tree-before.txt

> Wrong version of okio in spark-deps file
> 
>
> Key: SPARK-40415
> URL: https://issues.apache.org/jira/browse/SPARK-40415
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
> Attachments: tree-before.txt
>
>
> kubernetes-client cascade depends on okio 1.15.0 , which is a compile scope 
> dependency for spark
> selenium-java depends on okio 1.14.0, which is a test scope dependency for 
> spark
>  
> but Spark Project Assembly  choice okio 1.14.0 as compile scope dependency 
> and the version in spark-deps file also 1.14.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40415) Wrong version of okio in spark-deps file

2022-09-13 Thread Yang Jie (Jira)
Yang Jie created SPARK-40415:


 Summary: Wrong version of okio in spark-deps file
 Key: SPARK-40415
 URL: https://issues.apache.org/jira/browse/SPARK-40415
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.4.0
Reporter: Yang Jie


kubernetes-client cascade depends on okio 1.15.0 , which is a compile scope 
dependency for spark

selenium-java depends on okio 1.14.0, which is a test scope dependency for spark

 

but Spark Project Assembly  choice okio 1.14.0 as compile scope dependency and 
the version in spark-deps file also 1.14.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators

2022-09-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603619#comment-17603619
 ] 

Apache Spark commented on SPARK-40362:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/37866

> Bug in Canonicalization of expressions like Add & Multiply i.e Commutative 
> Operators
> 
>
> Key: SPARK-40362
> URL: https://issues.apache.org/jira/browse/SPARK-40362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Asif
>Priority: Major
>  Labels: spark-sql
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In the canonicalization code which is now in two stages, canonicalization 
> involving Commutative operators is broken, if they are subexpressions of 
> certain type of expressions which override precanonicalize, for example 
> BinaryComparison 
> Consider following expression:
> a + b > 10
>          GT
>             |
> a + b          10
> The BinaryComparison operator in the precanonicalize, first precanonicalizes  
> children & then   may swap operands based on left /right hashCode inequality..
> lets say  Add(a + b) .hashCode is >  10.hashCode as a result GT is converted 
> to LT
> But If the same tree is created 
>            GT
>             |
>  b + a      10
> The hashCode of Add(b, a) is not same as Add(a, b) , thus it is possible that 
> for this tree
>  Add(b + a) .hashCode is <  10.hashCode  in which case GT remains as is.
> Thus to similar trees result in different canonicalization , one having GT 
> other having LT 
>  
> The problem occurs because  for commutative expressions the canonicalization 
> normalizes the expression with consistent hashCode which is not the case with 
> precanonicalize as the hashCode of commutative expression 's precanonicalize 
> and post canonicalize are different.
>  
>  
> The test 
> {quote}test("bug X")
> Unknown macro: \{     val tr1 = LocalRelation('c.int, 'b.string, 'a.int)    
> val y = tr1.where('a.attr + 'c.attr > 10).analyze    val fullCond = 
> y.asInstanceOf[Filter].condition.clone()   val addExpr = (fullCond match 
> Unknown macro}
> ).clone().asInstanceOf[Add]
> val canonicalizedFullCond = fullCond.canonicalized
> // swap the operands of add
> val newAddExpr = Add(addExpr.right, addExpr.left)
> // build a new condition which is same as the previous one, but with operands 
> of //Add reversed 
> val builtCondnCanonicalized = GreaterThan(newAddExpr, 
> Literal(10)).canonicalized
> assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
> }
> {quote}
> This test fails.
> The fix which I propose is that for the commutative expressions, the 
> precanonicalize should be overridden and  
> Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
> of at place of canonicalize. effectively for commutative operands ( add, or , 
> multiply , and etc) canonicalize and precanonicalize should be same.
> PR:
> [https://github.com/apache/spark/pull/37824]
>  
>  
> I am also trying a better fix, where by the idea is that for commutative 
> expressions the murmur hashCode are caluculated using unorderedHash so that 
> it is order  independent ( i.e symmetric).
> The above approach works fine , but in case of Least & Greatest, the 
> Product's element is  a Seq,  and that messes with consistency of hashCode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40334) Implement `GroupBy.prod`.

2022-09-13 Thread Haejoon Lee (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603611#comment-17603611
 ] 

Haejoon Lee commented on SPARK-40334:
-

[~ayudovin] No worries! Please keep working on your work!

FYI: You can leave a comment like "I'm working on this" before you start to 
avoid conflicts :)

> Implement `GroupBy.prod`.
> -
>
> Key: SPARK-40334
> URL: https://issues.apache.org/jira/browse/SPARK-40334
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> We should implement `GroupBy.prod` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.prod.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40384) Do base image real in time build only when infra dockerfile is changed

2022-09-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603603#comment-17603603
 ] 

Apache Spark commented on SPARK-40384:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37865

> Do base image real in time build only when infra dockerfile is changed
> --
>
> Key: SPARK-40384
> URL: https://issues.apache.org/jira/browse/SPARK-40384
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40400) Pass error message parameters to exceptions as a map

2022-09-13 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-40400.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37834
[https://github.com/apache/spark/pull/37834]

> Pass error message parameters to exceptions as a map
> 
>
> Key: SPARK-40400
> URL: https://issues.apache.org/jira/browse/SPARK-40400
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Modify Spark exception to pass message parameters as a map not an array. At 
> the moment, we still depend on the order of parameters in error-classes.json 
> , so, we can change the text of error messages but not the order of 
> parameters. For example, pass Map[String, String]  instead of Array[String]  
> in exceptions like:
> {code:scala}
> private[spark] class SparkRuntimeException(
> errorClass: String,
> errorSubClass: Option[String] = None,
> messageParameters: Array[String]
> ...)
> {code}
> It should be replaced by:
> {code:scala}
> new SparkRuntimeException(
>   errorClass = "UNSUPPORTED_FEATURE",
>   errorSubClass = "LITERAL_TYPE",
>   messageParameters = Map(
> "value" -> v.toString,
> "type" -> v.getClass.toString))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20624) SPIP: Add better handling for node shutdown

2022-09-13 Thread Juliusz Sompolski (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603577#comment-17603577
 ] 

Juliusz Sompolski commented on SPARK-20624:
---

[~holden] Are these new APIs documented? I can't seem to find them in the 
official Spark documentation.
Should they be mentioned e.g. in 
https://spark.apache.org/docs/latest/job-scheduling.html#graceful-decommission-of-executors
 ?

> SPIP: Add better handling for node shutdown
> ---
>
> Key: SPARK-20624
> URL: https://issues.apache.org/jira/browse/SPARK-20624
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Priority: Major
>
> While we've done some good work with better handling when Spark is choosing 
> to decommission nodes (SPARK-7955), it might make sense in environments where 
> we get preempted without our own choice (e.g. YARN over-commit, EC2 spot 
> instances, GCE Preemptiable instances, etc.) to do something for the data on 
> the node (or at least not schedule any new tasks).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40384) Do base image real in time build only when infra dockerfile is changed

2022-09-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40384.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37828
[https://github.com/apache/spark/pull/37828]

> Do base image real in time build only when infra dockerfile is changed
> --
>
> Key: SPARK-40384
> URL: https://issues.apache.org/jira/browse/SPARK-40384
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40384) Do base image real in time build only when infra dockerfile is changed

2022-09-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40384:


Assignee: Yikun Jiang

> Do base image real in time build only when infra dockerfile is changed
> --
>
> Key: SPARK-40384
> URL: https://issues.apache.org/jira/browse/SPARK-40384
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40414) Fix PythonArrowInput and PythonArrowOutput to be more generic to handle complicated type/data

2022-09-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40414:


Assignee: Apache Spark

> Fix PythonArrowInput and PythonArrowOutput to be more generic to handle 
> complicated type/data
> -
>
> Key: SPARK-40414
> URL: https://issues.apache.org/jira/browse/SPARK-40414
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> During the work of flatMapGroupsWithState in PySpark, we figured out that we 
> are unable to reuse PythonArrowInput and PythonArrowOutput, as 
> PythonArrowInput and PythonArrowOutput are too specific to the strict input 
> data (row) and output data.
> To reuse the implementations we should make these traits more general to 
> handle more generic type of data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40414) Fix PythonArrowInput and PythonArrowOutput to be more generic to handle complicated type/data

2022-09-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603571#comment-17603571
 ] 

Apache Spark commented on SPARK-40414:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/37864

> Fix PythonArrowInput and PythonArrowOutput to be more generic to handle 
> complicated type/data
> -
>
> Key: SPARK-40414
> URL: https://issues.apache.org/jira/browse/SPARK-40414
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> During the work of flatMapGroupsWithState in PySpark, we figured out that we 
> are unable to reuse PythonArrowInput and PythonArrowOutput, as 
> PythonArrowInput and PythonArrowOutput are too specific to the strict input 
> data (row) and output data.
> To reuse the implementations we should make these traits more general to 
> handle more generic type of data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40414) Fix PythonArrowInput and PythonArrowOutput to be more generic to handle complicated type/data

2022-09-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40414:


Assignee: (was: Apache Spark)

> Fix PythonArrowInput and PythonArrowOutput to be more generic to handle 
> complicated type/data
> -
>
> Key: SPARK-40414
> URL: https://issues.apache.org/jira/browse/SPARK-40414
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> During the work of flatMapGroupsWithState in PySpark, we figured out that we 
> are unable to reuse PythonArrowInput and PythonArrowOutput, as 
> PythonArrowInput and PythonArrowOutput are too specific to the strict input 
> data (row) and output data.
> To reuse the implementations we should make these traits more general to 
> handle more generic type of data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40414) Fix PythonArrowInput and PythonArrowOutput to be more generic to handle complicated type/data

2022-09-13 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603559#comment-17603559
 ] 

Jungtaek Lim commented on SPARK-40414:
--

Will submit a PR sooner.

> Fix PythonArrowInput and PythonArrowOutput to be more generic to handle 
> complicated type/data
> -
>
> Key: SPARK-40414
> URL: https://issues.apache.org/jira/browse/SPARK-40414
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> During the work of flatMapGroupsWithState in PySpark, we figured out that we 
> are unable to reuse PythonArrowInput and PythonArrowOutput, as 
> PythonArrowInput and PythonArrowOutput are too specific to the strict input 
> data (row) and output data.
> To reuse the implementations we should make these traits more general to 
> handle more generic type of data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40414) Fix PythonArrowInput and PythonArrowOutput to be more generic to handle complicated type/data

2022-09-13 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-40414:


 Summary: Fix PythonArrowInput and PythonArrowOutput to be more 
generic to handle complicated type/data
 Key: SPARK-40414
 URL: https://issues.apache.org/jira/browse/SPARK-40414
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Jungtaek Lim


During the work of flatMapGroupsWithState in PySpark, we figured out that we 
are unable to reuse PythonArrowInput and PythonArrowOutput, as PythonArrowInput 
and PythonArrowOutput are too specific to the strict input data (row) and 
output data.

To reuse the implementations we should make these traits more general to handle 
more generic type of data.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40412) limit(x,y) + 子查询 出现数据丢失和乱序问题

2022-09-13 Thread FengJia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603485#comment-17603485
 ] 

FengJia commented on SPARK-40412:
-

Huawei Cloud's solution is to add Order by

> limit(x,y) + 子查询 出现数据丢失和乱序问题
> 
>
> Key: SPARK-40412
> URL: https://issues.apache.org/jira/browse/SPARK-40412
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.5
> Environment: hive on spark
> hive 3.1.0
> spark 2.4.5
>Reporter: FengJia
>Priority: Major
>  Labels: hiveonspark, limit
>
> select * 
> from(
> select * from
> table
> limit 10,20
> )
> 结果只有10条  并且不是第11条到第20条  顺序也不对
>  
> select * from
> table
> limit 10,20
> 结果是20条,顺序是11到第30条
> select * 
> from(
> select * from
> table
> order by id
> limit 10,20
> )
> 结果是20条,且顺序也是11到30条
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40412) limit(x,y) + 子查询 出现数据丢失和乱序问题

2022-09-13 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-40412.
-
Resolution: Invalid

Spark SQL does not support \{{limit n, m}}. Please contact the huawei cloud.

> limit(x,y) + 子查询 出现数据丢失和乱序问题
> 
>
> Key: SPARK-40412
> URL: https://issues.apache.org/jira/browse/SPARK-40412
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.5
> Environment: hive on spark
> hive 3.1.0
> spark 2.4.5
>Reporter: FengJia
>Priority: Major
>  Labels: hiveonspark, limit
>
> select * 
> from(
> select * from
> table
> limit 10,20
> )
> 结果只有10条  并且不是第11条到第20条  顺序也不对
>  
> select * from
> table
> limit 10,20
> 结果是20条,顺序是11到第30条
> select * 
> from(
> select * from
> table
> order by id
> limit 10,20
> )
> 结果是20条,且顺序也是11到30条
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40412) limit(x,y) + 子查询 出现数据丢失和乱序问题

2022-09-13 Thread FengJia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603473#comment-17603473
 ] 

FengJia commented on SPARK-40412:
-

The Huawei cloud I use cannot change the Spark version,Did you make a mistake 
using my code?

> limit(x,y) + 子查询 出现数据丢失和乱序问题
> 
>
> Key: SPARK-40412
> URL: https://issues.apache.org/jira/browse/SPARK-40412
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.5
> Environment: hive on spark
> hive 3.1.0
> spark 2.4.5
>Reporter: FengJia
>Priority: Major
>  Labels: hiveonspark, limit
>
> select * 
> from(
> select * from
> table
> limit 10,20
> )
> 结果只有10条  并且不是第11条到第20条  顺序也不对
>  
> select * from
> table
> limit 10,20
> 结果是20条,顺序是11到第30条
> select * 
> from(
> select * from
> table
> order by id
> limit 10,20
> )
> 结果是20条,且顺序也是11到30条
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40334) Implement `GroupBy.prod`.

2022-09-13 Thread Artsiom Yudovin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603463#comment-17603463
 ] 

Artsiom Yudovin commented on SPARK-40334:
-

[~itholic], Hi, I have been started to work on this ticket 2 days ago. Does it 
make sense to continue or choose another ticket? 

> Implement `GroupBy.prod`.
> -
>
> Key: SPARK-40334
> URL: https://issues.apache.org/jira/browse/SPARK-40334
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> We should implement `GroupBy.prod` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.prod.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40324) Provide a query context of ParseException

2022-09-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603462#comment-17603462
 ] 

Apache Spark commented on SPARK-40324:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37861

> Provide a query context of ParseException
> -
>
> Key: SPARK-40324
> URL: https://issues.apache.org/jira/browse/SPARK-40324
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Extends the exception ParseException and add a queryContext into it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40178) Rebalance/Repartition Hints Not Working in PySpark

2022-09-13 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40178:

Target Version/s:   (was: 3.2.0, 3.2.1, 3.3.0, 3.2.2, 3.4.0, 3.3.1)

> Rebalance/Repartition Hints Not Working in PySpark
> --
>
> Key: SPARK-40178
> URL: https://issues.apache.org/jira/browse/SPARK-40178
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
> Environment: Mac OSX 11.4 Big Sur
> Python 3.9.7
> Spark version >= 3.2.0 (perhaps before as well).
>Reporter: Maxwell Conradt
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Partitioning hints in PySpark do not work because the column parameters are 
> not converted to Catalyst `Expression` instances before being passed to the 
> hint resolver.
> The behavior of the hints is documented 
> [here|https://spark.apache.org/docs/3.3.0/sql-ref-syntax-qry-select-hints.html#partitioning-hints-types].
> Example:
>  
> {code:java}
> >>> df = spark.range(1024)
> >>> 
> >>> df
> DataFrame[id: bigint]
> >>> df.hint("rebalance", "id")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/dataframe.py", line 
> 980, in hint
>     jdf = self._jdf.hint(name, self._jseq(parameters))
>   File 
> "/Users/maxwellconradt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
>  line 1322, in __call__
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/utils.py", line 196, 
> in deco
>     raise converted from None
> pyspark.sql.utils.AnalysisException: REBALANCE Hint parameter should include 
> columns, but id found
> >>> df.hint("repartition", "id")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/dataframe.py", line 
> 980, in hint
>     jdf = self._jdf.hint(name, self._jseq(parameters))
>   File 
> "/Users/maxwellconradt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
>  line 1322, in __call__
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/utils.py", line 196, 
> in deco
>     raise converted from None
> pyspark.sql.utils.AnalysisException: REPARTITION Hint parameter should 
> include columns, but id found {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40178) Rebalance/Repartition Hints Not Working in PySpark

2022-09-13 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40178:

Fix Version/s: (was: 3.2.0)
   (was: 3.3.0)
   (was: 3.2.1)
   (was: 3.2.2)
   (was: 3.4.0)
   (was: 3.3.1)

> Rebalance/Repartition Hints Not Working in PySpark
> --
>
> Key: SPARK-40178
> URL: https://issues.apache.org/jira/browse/SPARK-40178
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2
> Environment: Mac OSX 11.4 Big Sur
> Python 3.9.7
> Spark version >= 3.2.0 (perhaps before as well).
>Reporter: Maxwell Conradt
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Partitioning hints in PySpark do not work because the column parameters are 
> not converted to Catalyst `Expression` instances before being passed to the 
> hint resolver.
> The behavior of the hints is documented 
> [here|https://spark.apache.org/docs/3.3.0/sql-ref-syntax-qry-select-hints.html#partitioning-hints-types].
> Example:
>  
> {code:java}
> >>> df = spark.range(1024)
> >>> 
> >>> df
> DataFrame[id: bigint]
> >>> df.hint("rebalance", "id")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/dataframe.py", line 
> 980, in hint
>     jdf = self._jdf.hint(name, self._jseq(parameters))
>   File 
> "/Users/maxwellconradt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
>  line 1322, in __call__
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/utils.py", line 196, 
> in deco
>     raise converted from None
> pyspark.sql.utils.AnalysisException: REBALANCE Hint parameter should include 
> columns, but id found
> >>> df.hint("repartition", "id")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/dataframe.py", line 
> 980, in hint
>     jdf = self._jdf.hint(name, self._jseq(parameters))
>   File 
> "/Users/maxwellconradt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
>  line 1322, in __call__
>   File "/Users/maxwellconradt/spark/python/pyspark/sql/utils.py", line 196, 
> in deco
>     raise converted from None
> pyspark.sql.utils.AnalysisException: REPARTITION Hint parameter should 
> include columns, but id found {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40177) Simplify join condition of form (a==b) || (a==null&==null) to a<=>b

2022-09-13 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40177:

Target Version/s:   (was: 3.3.1)

> Simplify join condition of form (a==b) || (a==null&==null) to a<=>b
> -
>
> Key: SPARK-40177
> URL: https://issues.apache.org/jira/browse/SPARK-40177
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Ayushi Agarwal
>Priority: Major
>
> If the join condition is like key1==key2 || (key1==null && key2==null), join 
> is executed as Broadcast Nested Loop Join as this condition doesn't satisfy 
> equi join condition. BNLJ takes more time as compared to Sort merge or 
> broadcast join. This condition can be converted to key1<=>key2 to make the 
> join execute as Broadcast or sort merge join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40177) Simplify join condition of form (a==b) || (a==null&==null) to a<=>b

2022-09-13 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40177:

Fix Version/s: (was: 3.3.1)

> Simplify join condition of form (a==b) || (a==null&==null) to a<=>b
> -
>
> Key: SPARK-40177
> URL: https://issues.apache.org/jira/browse/SPARK-40177
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Ayushi Agarwal
>Priority: Major
>
> If the join condition is like key1==key2 || (key1==null && key2==null), join 
> is executed as Broadcast Nested Loop Join as this condition doesn't satisfy 
> equi join condition. BNLJ takes more time as compared to Sort merge or 
> broadcast join. This condition can be converted to key1<=>key2 to make the 
> join execute as Broadcast or sort merge join.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40413) Column.isin produces non-boolean results

2022-09-13 Thread Andreas Franz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Franz updated SPARK-40413:
--
Description: 
I observed an inconsistent behaviour using the Column.isin function. The 
[documentation|https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Column.html#isin(list:Any*):org.apache.spark.sql.Column]
 states that an "up-cast" takes places when different data types are involved. 
When working with _null_ values the results are confusing to me.

I prepared a small example demonstrating the issue
{code:java}
package example

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.functions._

object Test {


def main(args: Array[String]): Unit = {

val spark = SparkSession.builder()
.appName("App")
.master("local[*]")
.config("spark.driver.host", "localhost")
.config("spark.ui.enabled", "false")
.getOrCreate()

val schema = StructType(
Array(
StructField("name", StringType, nullable = true)
)
)

val data = Seq(
Row("a"),
Row("b"),
Row("c"),
Row(""),
Row(null)
).toList

val list1 = Array("a", "d", "")
val list2 = Array("a", "d", "", null)

val dataFrame = 
spark.createDataFrame(spark.sparkContext.parallelize(data), schema)

dataFrame
.withColumn("name_is_in_list_1", col("name").isin(list1: _*))
.show(10, truncate = false)

/*
++-+
|name|name_is_in_list_1|
++-+
|a   |true |
|b   |false|
|c   |false|
||true |
|null|null | // check value null is not contained in 
list1, why is null returned here? Expected result: false
++-+
 */

dataFrame
.withColumn("name_is_in_list_2", col("name").isin(list2: _*))
.show(10, truncate = false)

/*
++-+
|name|name_is_in_list_2|
++-+
|a   |true |
|b   |null | // check value "b" is not contained in 
list1, why is null returned here? Expected result: false
|c   |null | // check value "c" is not contained in 
list1, why is null returned here? Expected result: false
||true |
|null|null | // check value null is in list1, why is 
null returned here? Expected result: true
++-+
 */


val data2 = Seq(
Row("a"),
Row("b"),
Row("c"),
Row(""),
).toList

val dataFrame2 = 
spark.createDataFrame(spark.sparkContext.parallelize(data2), schema)

dataFrame2
.withColumn("name_is_in_list_2", col("name").isin(list2: _*))
.show(10, truncate = false)

/*
++-+
|name|name_is_in_list_2|
++-+
|a   |true |
|b   |null | // check value "b" is not contained in 
list2, why is null returned here? Expected result: false
|c   |null | // check value "b" is not contained in 
list2, why is null returned here? Expected result: false
||true |
++-+
 */
}
}{code}
 

  was:
I observed an inconsistent behaviour using the Column.isin function. The 
[documentation|https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Column.html#isin(list:Any*):org.apache.spark.sql.Column]
 states that an "up-cast" takes places when different data types are involved. 
When working with _null_ values the results are confusing to me.

I prepared a small example demonstrating the issue
{code:java}
package example

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.functions._

object Test {


def main(args: Array[String]): Unit = {

val spark = SparkSession.builder()
.appName("App")
.master("local[*]")
.config("spark.driver.host", "localhost")
.config("spark.ui.enabled", "false")
.getOrCreate()

val schema = StructType(
Array(
StructField("name", StringType, nullable = true)
)
)

val data = Seq(
Row("a"),
Row("b"),
Row("c"),
Row(""),
 

[jira] [Resolved] (SPARK-38734) Test the error class: INDEX_OUT_OF_BOUNDS

2022-09-13 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38734.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37857
[https://github.com/apache/spark/pull/37857]

> Test the error class: INDEX_OUT_OF_BOUNDS
> -
>
> Key: SPARK-38734
> URL: https://issues.apache.org/jira/browse/SPARK-38734
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Minor
>  Labels: starter
> Fix For: 3.4.0
>
>
> Add at least one test for the error class *INDEX_OUT_OF_BOUNDS* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def indexOutOfBoundsOfArrayDataError(idx: Int): Throwable = {
> new SparkIndexOutOfBoundsException(errorClass = "INDEX_OUT_OF_BOUNDS", 
> Array(idx.toString))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38734) Test the error class: INDEX_OUT_OF_BOUNDS

2022-09-13 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-38734:


Assignee: Max Gekk

> Test the error class: INDEX_OUT_OF_BOUNDS
> -
>
> Key: SPARK-38734
> URL: https://issues.apache.org/jira/browse/SPARK-38734
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class *INDEX_OUT_OF_BOUNDS* to 
> QueryExecutionErrorsSuite. The test should cover the exception throw in 
> QueryExecutionErrors:
> {code:scala}
>   def indexOutOfBoundsOfArrayDataError(idx: Int): Throwable = {
> new SparkIndexOutOfBoundsException(errorClass = "INDEX_OUT_OF_BOUNDS", 
> Array(idx.toString))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40413) Column.isin produces non-boolean results

2022-09-13 Thread Andreas Franz (Jira)
Andreas Franz created SPARK-40413:
-

 Summary: Column.isin produces non-boolean results
 Key: SPARK-40413
 URL: https://issues.apache.org/jira/browse/SPARK-40413
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Andreas Franz


I observed an inconsistent behaviour using the Column.isin function. The 
[documentation|https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Column.html#isin(list:Any*):org.apache.spark.sql.Column]
 states that an "up-cast" takes places when different data types are involved. 
When working with _null_ values the results are confusing to me.

I prepared a small example demonstrating the issue
{code:java}
package example

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.functions._

object Test {


def main(args: Array[String]): Unit = {

val spark = SparkSession.builder()
.appName("App")
.master("local[*]")
.config("spark.driver.host", "localhost")
.config("spark.ui.enabled", "false")
.getOrCreate()

val schema = StructType(
Array(
StructField("name", StringType, nullable = true)
)
)

val data = Seq(
Row("a"),
Row("b"),
Row("c"),
Row(""),
Row(null)
).toList

val list1 = Array("a", "d", "")
val list2 = Array("a", "d", "", null)

val dataFrame = 
spark.createDataFrame(spark.sparkContext.parallelize(data), schema)

dataFrame
.withColumn("name_is_in_list_1", col("name").isin(list1: _*))
.show(10, truncate = false)

/*
++-+
|name|name_is_in_list_1|
++-+
|a   |true |
|b   |false|
|c   |false|
||true |
|null|null | // check value null is not contained in 
list1, why is null returned here? Expected result: false
++-+
 */

dataFrame
.withColumn("name_is_in_list_2", col("name").isin(list2: _*))
.show(10, truncate = false)

/*
++-+
|name|name_is_in_list_2|
++-+
|a   |true |
|b   |null | // check value "b" is not contained in 
list1, why is null returned here? Expected result: false
|c   |null | // check value "c" is not contained in 
list1, why is null returned here? Expected result: false
||true |
|null|null | // check value null is in list1, why is 
null returned here? Expected result: true
++-+
 */


val data2 = Seq(
Row("a"),
Row("b"),
Row("c"),
Row(""),
).toList

val dataFrame2 = 
spark.createDataFrame(spark.sparkContext.parallelize(data2), schema)

dataFrame2
.withColumn("name_is_in_list_2", col("name").isin(list2: _*))
.show(10, truncate = false)

/*
++-+
|name|name_is_in_list_2|
++-+
|a   |true |
|b   |null | // check value "b" is not contained in 
list1, why is null returned here? Expected result: false
|c   |null | // check value "b" is not contained in 
list1, why is null returned here? Expected result: false
||true |
++-+
 */
}
}{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40412) limit(x,y) + 子查询 出现数据丢失和乱序问题

2022-09-13 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40412:

Fix Version/s: (was: 2.4.5)

> limit(x,y) + 子查询 出现数据丢失和乱序问题
> 
>
> Key: SPARK-40412
> URL: https://issues.apache.org/jira/browse/SPARK-40412
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.5
> Environment: hive on spark
> hive 3.1.0
> spark 2.4.5
>Reporter: FengJia
>Priority: Major
>  Labels: hiveonspark, limit
>
> select * 
> from(
> select * from
> table
> limit 10,20
> )
> 结果只有10条  并且不是第11条到第20条  顺序也不对
>  
> select * from
> table
> limit 10,20
> 结果是20条,顺序是11到第30条
> select * 
> from(
> select * from
> table
> order by id
> limit 10,20
> )
> 结果是20条,且顺序也是11到30条
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40412) limit(x,y) + 子查询 出现数据丢失和乱序问题

2022-09-13 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603436#comment-17603436
 ] 

Yuming Wang commented on SPARK-40412:
-

Could you test the latest Spark?

> limit(x,y) + 子查询 出现数据丢失和乱序问题
> 
>
> Key: SPARK-40412
> URL: https://issues.apache.org/jira/browse/SPARK-40412
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.5
> Environment: hive on spark
> hive 3.1.0
> spark 2.4.5
>Reporter: FengJia
>Priority: Major
>  Labels: hiveonspark, limit
>
> select * 
> from(
> select * from
> table
> limit 10,20
> )
> 结果只有10条  并且不是第11条到第20条  顺序也不对
>  
> select * from
> table
> limit 10,20
> 结果是20条,顺序是11到第30条
> select * 
> from(
> select * from
> table
> order by id
> limit 10,20
> )
> 结果是20条,且顺序也是11到30条
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40412) limit(x,y) + 子查询 出现数据丢失和乱序问题

2022-09-13 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-40412:

Target Version/s:   (was: 2.4.5)

> limit(x,y) + 子查询 出现数据丢失和乱序问题
> 
>
> Key: SPARK-40412
> URL: https://issues.apache.org/jira/browse/SPARK-40412
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.5
> Environment: hive on spark
> hive 3.1.0
> spark 2.4.5
>Reporter: FengJia
>Priority: Major
>  Labels: hiveonspark, limit
>
> select * 
> from(
> select * from
> table
> limit 10,20
> )
> 结果只有10条  并且不是第11条到第20条  顺序也不对
>  
> select * from
> table
> limit 10,20
> 结果是20条,顺序是11到第30条
> select * 
> from(
> select * from
> table
> order by id
> limit 10,20
> )
> 结果是20条,且顺序也是11到30条
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40342) Implement `Rolling.quantile`.

2022-09-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603430#comment-17603430
 ] 

Apache Spark commented on SPARK-40342:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37836

> Implement `Rolling.quantile`.
> -
>
> Key: SPARK-40342
> URL: https://issues.apache.org/jira/browse/SPARK-40342
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should implement `Rolling.quantile` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40342) Implement `Rolling.quantile`.

2022-09-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40342:


Assignee: Apache Spark

> Implement `Rolling.quantile`.
> -
>
> Key: SPARK-40342
> URL: https://issues.apache.org/jira/browse/SPARK-40342
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> We should implement `Rolling.quantile` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40342) Implement `Rolling.quantile`.

2022-09-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603429#comment-17603429
 ] 

Apache Spark commented on SPARK-40342:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37836

> Implement `Rolling.quantile`.
> -
>
> Key: SPARK-40342
> URL: https://issues.apache.org/jira/browse/SPARK-40342
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should implement `Rolling.quantile` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40342) Implement `Rolling.quantile`.

2022-09-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40342:


Assignee: (was: Apache Spark)

> Implement `Rolling.quantile`.
> -
>
> Key: SPARK-40342
> URL: https://issues.apache.org/jira/browse/SPARK-40342
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should implement `Rolling.quantile` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33782) Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

2022-09-13 Thread pralabhkumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603426#comment-17603426
 ] 

pralabhkumar commented on SPARK-33782:
--

[~dongjoon] Please review the PR . 

> Place spark.files, spark.jars and spark.files under the current working 
> directory on the driver in K8S cluster mode
> ---
>
> Key: SPARK-33782
> URL: https://issues.apache.org/jira/browse/SPARK-33782
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In Yarn cluster modes, the passed files are able to be accessed in the 
> current working directory. Looks like this is not the case in Kubernates 
> cluset mode.
> By doing this, users can, for example, leverage PEX to manage Python 
> dependences in Apache Spark:
> {code}
> pex pyspark==3.0.1 pyarrow==0.15.1 pandas==0.25.3 -o myarchive.pex
> PYSPARK_PYTHON=./myarchive.pex spark-submit --files myarchive.pex
> {code}
> See also https://github.com/apache/spark/pull/30735/files#r540935585.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40347) Implement `RollingGroupby.median`.

2022-09-13 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40347:
-

Assignee: Yikun Jiang

> Implement `RollingGroupby.median`.
> --
>
> Key: SPARK-40347
> URL: https://issues.apache.org/jira/browse/SPARK-40347
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Yikun Jiang
>Priority: Major
>
> We should implement `RollingGroupby.median` for increasing pandas API 
> coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40330) Implement `Series.searchsorted`.

2022-09-13 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40330:
-

Assignee: Ruifeng Zheng

> Implement `Series.searchsorted`.
> 
>
> Key: SPARK-40330
> URL: https://issues.apache.org/jira/browse/SPARK-40330
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Ruifeng Zheng
>Priority: Major
>
> We should implement `Series.searchsorted` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.Series.searchsorted.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40344) Implement `ExpandingGroupby.median`.

2022-09-13 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40344:
-

Assignee: Yikun Jiang

> Implement `ExpandingGroupby.median`.
> 
>
> Key: SPARK-40344
> URL: https://issues.apache.org/jira/browse/SPARK-40344
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Yikun Jiang
>Priority: Major
>
> We should implement `ExpandingGroupby.median` for increasing pandas API 
> coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40341) Implement `Rolling.median`.

2022-09-13 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40341:
-

Assignee: Yikun Jiang

> Implement `Rolling.median`.
> ---
>
> Key: SPARK-40341
> URL: https://issues.apache.org/jira/browse/SPARK-40341
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Yikun Jiang
>Priority: Major
>
> We should implement `Rolling.median` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.median.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-40399) Make `pearson` correlation in `DataFrame.corr` support missing values and `min_periods`

2022-09-13 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-40399.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37845
[https://github.com/apache/spark/pull/37845]

> Make `pearson` correlation in `DataFrame.corr` support missing values and 
> `min_periods`
> ---
>
> Key: SPARK-40399
> URL: https://issues.apache.org/jira/browse/SPARK-40399
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-40399) Make `pearson` correlation in `DataFrame.corr` support missing values and `min_periods`

2022-09-13 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40399:
-

Assignee: Ruifeng Zheng

> Make `pearson` correlation in `DataFrame.corr` support missing values and 
> `min_periods`
> ---
>
> Key: SPARK-40399
> URL: https://issues.apache.org/jira/browse/SPARK-40399
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40348) Implement `RollingGroupby.quantile`.

2022-09-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603383#comment-17603383
 ] 

Apache Spark commented on SPARK-40348:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37836

> Implement `RollingGroupby.quantile`.
> 
>
> Key: SPARK-40348
> URL: https://issues.apache.org/jira/browse/SPARK-40348
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should implement `RollingGroupby.quantile` for increasing pandas API 
> coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40348) Implement `RollingGroupby.quantile`.

2022-09-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17603384#comment-17603384
 ] 

Apache Spark commented on SPARK-40348:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/37836

> Implement `RollingGroupby.quantile`.
> 
>
> Key: SPARK-40348
> URL: https://issues.apache.org/jira/browse/SPARK-40348
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should implement `RollingGroupby.quantile` for increasing pandas API 
> coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >